Chapter 5 Pandas
A powerful library for data manipulation and analysis, borrowing the idea of data frames from the R language. An essential import setup for Pandas is
5.1 Series
Similar to one-dimensional NumPy array, with the addition of an index which can comprisearbitrary values
In the example above, the index has not been specified, and it defaults to the standard 0-based integer index.
s = pd.Series([2,5,6,7],index=("prime1","prime2","prime3","prime4"))
print(s)
#> prime1 2
#> prime2 5
#> prime3 6
#> prime4 7
#> dtype: int64
And then indexes can be used for slicing
When instantiated from a dictionary input, the keys are used to create the index
s = pd.Series({'Friday':0.2,'Saturday':1.0,'Sunday':4.2,'Monday':0.0,'Tuesday':1.2})
print(s[1:3])
#> Saturday 1.0
#> Sunday 4.2
#> dtype: float64
The overall behaviour of Series is similar to that of ndarray, as far as slicing and indexing are concerned:
def fahrenheit(t):
return 9*t/5+32
s[s<1.2],
#> (Friday 0.2
#> Saturday 1.0
#> Monday 0.0
#> dtype: float64,)
fahrenheit(s)
#> Friday 32.36
#> Saturday 33.80
#> Sunday 39.56
#> Monday 32.00
#> Tuesday 34.16
#> dtype: float64
Series has some features of dict
, like assignment by index and testing of index value membership
dt=pd.date_range('2019-10-29','2019-11-02')
print(dt)
#> DatetimeIndex(['2019-10-29', '2019-10-30', '2019-10-31', '2019-11-01',
#> '2019-11-02'],
#> dtype='datetime64[ns]', freq='D')
precip_sp = pd.Series([0.2,1.0,4.2,0.0,1.2],index=dt)
print(precip_sp)
#> 2019-10-29 0.2
#> 2019-10-30 1.0
#> 2019-10-31 4.2
#> 2019-11-01 0.0
#> 2019-11-02 1.2
#> Freq: D, dtype: float64
precip_bu = pd.Series([0.0,0.0,0.0,1.2,4.2,0.0], index=pd.date_range('2019-10-27','2019-11-01'))
#NaN as precip_bu and precip_sp have different indexes range
print(precip_sp-precip_bu)
#> 2019-10-27 NaN
#> 2019-10-28 NaN
#> 2019-10-29 0.2
#> 2019-10-30 -0.2
#> 2019-10-31 0.0
#> 2019-11-01 0.0
#> 2019-11-02 NaN
#> Freq: D, dtype: float64
5.2 DataFrame
A2-dimensional, table-like, data structure with columns of possibly different types and an index for its rows.
You can create a DataFrame by passing a dictionary of Series objects:
people_dict = {
"weight": pd.Series([68, 83, 112], index=["alice", "martino", "franco"]),
"birthyear": pd.Series([1984, 1985, 1992], index=["alice", "martino","franco"], name="year"),
"children": pd.Series([0, 3], index=["martino", "franco"]),
"hobby": pd.Series(["Biking", "Dancing"], index=["alice", "martino"]),
}
people = pd.DataFrame(people_dict)
people
#> weight birthyear children hobby
#> alice 68 1984 NaN Biking
#> franco 112 1992 3.0 NaN
#> martino 83 1985 0.0 Dancing
Another convenient way to create a DataFrame is to pass all the values to the constructor as an ndarray, or a list of lists, and specify the column names and row index labels separately:
values = [
[1985, np.nan, "Biking", 68],
[1984, 3, "Dancing", 83],
[1992, 0, np.nan, 112]
]
d = pd.DataFrame(
values,
columns=["birthyear", "children", "hobby", "weight"],
index=["alice", "martino", "franco"])
d
#> birthyear children hobby weight
#> alice 1985 NaN Biking 68
#> martino 1984 3.0 Dancing 83
#> franco 1992 0.0 NaN 112
precips = pd.DataFrame({'San Pietro Capofiume': precip_sp,'Bologna Urbana': precip_bu})
print(precips)
#> San Pietro Capofiume Bologna Urbana
#> 2019-10-27 NaN 0.0
#> 2019-10-28 NaN 0.0
#> 2019-10-29 0.2 0.0
#> 2019-10-30 1.0 1.2
#> 2019-10-31 4.2 4.2
#> 2019-11-01 0.0 0.0
#> 2019-11-02 1.2 NaN
Panda’s DataFrame()
function is flexible as to the types of its inputs. As a list of dictionaries, each dictionary element of the input list is one row ofthe data frame
city_loc = pd.DataFrame([{'Lat':44.49381,'Long':11.33875},
{'Lat':41.89193,'Long':12.51133}],
index=['Bologna','Rome'])
print(city_loc)
#> Lat Long
#> Bologna 44.49381 11.33875
#> Rome 41.89193 12.51133
DataFrame
behaves like a dictionary, in which keys are column names and the values are index-aligned Series
# add an elevation column for example as a list
city_loc['Elev'] = [54,21]
city_loc
#> Lat Long Elev
#> Bologna 44.49381 11.33875 54
#> Rome 41.89193 12.51133 21
5.2.1 Indexing
5.2.1.1 .loc[]
attribute
Label-based indexing is supported via the attribute .loc[]
, the result is a Series
object.
precips.loc['2019-10-30'],
#> (San Pietro Capofiume 1.0
#> Bologna Urbana 1.2
#> Name: 2019-10-30 00:00:00, dtype: float64,)
type(precips.loc['2019-10-30'])
#> <class 'pandas.core.series.Series'>
.loc[]
supports slicing. However, unlike ndarray integer-based slicing,it includes the upper label:
precips.loc['2019-10-29':'2019-11-01']
#> San Pietro Capofiume Bologna Urbana
#> 2019-10-29 0.2 0.0
#> 2019-10-30 1.0 1.2
#> 2019-10-31 4.2 4.2
#> 2019-11-01 0.0 0.0
and it also allows for list of labels:
# Note: in this example the list has to be transformed into a DatetimeIndex because the Series objects which were the input of the data frame had been constructed with a DatetimeIndex
precips.loc[pd.DatetimeIndex(['2019-10-31','2019-11-01','2019-11-02'])]
#> San Pietro Capofiume Bologna Urbana
#> 2019-10-31 4.2 4.2
#> 2019-11-01 0.0 0.0
#> 2019-11-02 1.2 NaN
Is it possible to index through booleans
#create index with missing values
na=(pd.isna(precips['Bologna Urbana'])|pd.isna(precips['San Pietro Capofiume']))
#index full dataset without(-) values which are na
precips.loc[~na]
#> San Pietro Capofiume Bologna Urbana
#> 2019-10-29 0.2 0.0
#> 2019-10-30 1.0 1.2
#> 2019-10-31 4.2 4.2
#> 2019-11-01 0.0 0.0
5.2.1.2 .iloc[]
attribute
The.iloc attribute provides integer-based indexing, following the usual Pythonconventions. The allowed types of input are the same as.loc, but integers mustbe specified instead of labels. The following examples return the same resultsas the ones above:
precips.iloc[3]
#> San Pietro Capofiume 1.0
#> Bologna Urbana 1.2
#> Name: 2019-10-30 00:00:00, dtype: float64
precips.iloc[2:6]
#> San Pietro Capofiume Bologna Urbana
#> 2019-10-29 0.2 0.0
#> 2019-10-30 1.0 1.2
#> 2019-10-31 4.2 4.2
#> 2019-11-01 0.0 0.0
precips.iloc[[4,5,6]]
#> San Pietro Capofiume Bologna Urbana
#> 2019-10-31 4.2 4.2
#> 2019-11-01 0.0 0.0
#> 2019-11-02 1.2 NaN
Is it possible to index with booleans, however first we need to convert it to a numpy array wiht .to_numpy()
5.2.1.3 .query()
The .query()
method lets you filter a DataFrame based on a query expression; the best choice if you have several conditions to be applied on the row selection.
5.2.2 Reading Data
5.2.2.1 read_csv
# iris1 = pd.read_csv("https://raw.githubusercontent.com/SakshamAyush/Iris_Flower_Classification/master/data/bezdekIris.csv")
The compression argument enables to specify the decompression algorithm if dataset is a compressed CSV file, setting the value
'infer'
chooses it by file extension.By default, time series are not recognized, the
dtypeof
the series is object, it should bedatetime64[ns]
. theparse_dates
argument enables us to specify a list of columns that must be parse as dates (i.e.parse_dates = ['year']
orparse_dates = [0]
)The
index_col
argument enables us to set a sequence of columns as row labels (ex.index_col='Time'
)
5.2.2.2 read_excel
The read_excel function enables the reading of .xls and .xlsx files. Installation of either xlrd or openpyxl (for Excel 2007,.xlsx) is a prerequisite.
5.2.3 Data Wrangling
Function pd.concat
on a list of objects can concatenate Series or DataFrame objects
#Row labels are not modified
s1=pd.Series([1,2,3])
s2=pd.Series([4,5,6])
s=pd.concat([s1, s2])
s
#> 0 1
#> 1 2
#> 2 3
#> 0 4
#> 1 5
#> 2 6
#> dtype: int64
Be careful when concatenating DataFrame objects. Let c1 and c2 be the column labels of the 1st and 2nd frame
Columns𝐶2 and 𝐶1 are added to first data frame, columns𝐶1 and 𝐶2 to the second, all filled with NaN.
The rows of the second frame are appended to the first
df1=pd.DataFrame({'a': [1,2,3],'b': [4,5,6]})
df2=pd.DataFrame({'b': [7,8,9],'c': [10,11,12]})
df=pd.concat([df1, df2])
5.2.3.1 Multi-indexing
If all columns are tuples of the same size, then they are understood as a multi-index. The same goes for row index labels. For example:
d = pd.DataFrame(
{
("public", "birthyear"):
{("Paris","alice"):1985, ("Paris","martino"): 1984, ("London","franco"):1992},
("public", "hobby"):
{("Paris","alice"):"Biking", ("Paris","martino"): "Dancing"},
("private", "weight"):
{("Paris","alice"):68, ("Paris","martino"): 83, ("London","franco"): 112},
("private", "children"):
{("Paris", "alice"):np.nan, ("Paris","martino"): 3, ("London","franco"): 0}
}
)
d
#> public private
#> birthyear hobby weight children
#> Paris alice 1985 Biking 68 NaN
#> martino 1984 Dancing 83 3.0
#> London franco 1992 NaN 112 0.0
You can transpose columns and indices using the .T
attribute:
There are two levels of columns, and two levels of indices. We can drop a column level by calling droplevel()
(the same goes for indices):
5.2.3.2 Stacking and unstacking levels
Calling the stack()
method will push the lowest column level after the lowest index:
d.stack()
#> private public
#> Paris alice birthyear NaN 1985
#> hobby NaN Biking
#> weight 68.0 NaN
#> martino birthyear NaN 1984
#> children 3.0 NaN
#> hobby NaN Dancing
#> weight 83.0 NaN
#> London franco birthyear NaN 1992
#> children 0.0 NaN
#> weight 112.0 NaN
dT.stack()
#> public birthyear alice 1985
#> martino 1984
#> franco 1992
#> hobby alice Biking
#> martino Dancing
#> private weight alice 68
#> martino 83
#> franco 112
#> children martino 3.0
#> franco 0.0
#> dtype: object
Note that many NaN values appeared. This makes sense because many new combinations did not exist before (eg. there was no bob in London).
Calling unstack()
will do the reverse, once again creating many NaN values.
dT.unstack()
#> alice ... franco
#> birthyear children hobby weight ... birthyear children hobby weight
#> private NaN NaN NaN 68 ... NaN 0.0 NaN 112
#> public 1985 NaN Biking NaN ... 1992 NaN NaN NaN
#>
#> [2 rows x 12 columns]
The stack()
and unstack()
methods let you select the level to stack/unstack. You can even stack/unstack multiple levels at once, specifying the level
parameter.
5.2.3.3 Adding or removing columns
people
#> weight birthyear children hobby
#> alice 68 1984 NaN Biking
#> franco 112 1992 3.0 NaN
#> martino 83 1985 0.0 Dancing
people["age"] = 2018 - people["birthyear"] # adds a new column "age"
people["over 30"] = people["age"] > 30 # adds another column "over 30"
birthyears = people.pop("birthyear")
del people["children"]
people
#> weight hobby age over 30
#> alice 68 Biking 34 True
#> franco 112 NaN 26 False
#> martino 83 Dancing 33 True
When you add a new column, it must have the same number of rows. Missing rows are filled with NaN, and extra rows are ignored.
When adding a new column, it is added at the end (on the right) by default. You can also insert a column anywhere else using the .insert()
method.
people.insert(1, "height", [172, 181, 185])
(people
.assign(body_mass_index = lambda df: df["weight"] / (df["height"] / 100)** 2)
.assign(overweight = lambda df: df["body_mass_index"] > 25)
)
#> weight height hobby age over 30 body_mass_index overweight
#> alice 68 172 Biking 34 True 22.985398 False
#> franco 112 181 NaN 26 False 34.186991 True
#> martino 83 185 Dancing 33 True 24.251278 False
5.2.3.4 Sorting a DataFrame
You can sort a DataFrame by calling its sort_index method. By default it sorts the rows by their index label, in ascending order, but let’s reverse the order:
people.sort_index(ascending=False)
#> weight height hobby age over 30
#> martino 83 185 Dancing 33 True
#> franco 112 181 NaN 26 False
#> alice 68 172 Biking 34 True
Note that sort_index returned a sorted copy of the DataFrame. To modify people directly, we can set the inplace=
argument to True
. Also, we can sort the columns instead of the rows by setting axis=1
or by the values instead of the labels, we can use sort_values()
and specify the column to sort by:
5.2.3.5 Aggregating
Similar to the SQL language, pandas allows grouping your data into groups to run calculations over each group.
* .groupby(" ")
Pandas supports spreadsheet-like pivot tables that allow quick data summarization. To illustrate this, let’s create a simple DataFrame:
5.2.4 Evaluating an expression
A great feature supported by pandas is expression evaluation. This relies on the numexpr
library which must be installed.
people.eval("weight / (height/100) ** 2 > 25")
#> franco True
#> martino False
#> alice False
#> dtype: bool
Assignment expressions are also supported. Let’s set inplace=True
to directly modify the DataFrame rather than getting a modified copy:
5.2.5 Handling Missing Data
Dealing with missing data is a frequent task when working with real life data. Pandas offers a few tools to handle missing data.
.isnull()
,.isna()
, indicate whether values are missing (True/False). Its sum (.isnull().sum(0)
) reports and overview the number of NAs for each column.df[df.isnull().any(axis=1)]
keeps only the rows that contain at least one null
grades_array = np.array([[8,8,9],[10,9,9],[4, 8, 2], [9, 10, 10]])
grades = pd.DataFrame(grades_array, columns=["sep", "oct", "nov"], index=["alice","bob","charles","darwin"])
grades
#> sep oct nov
#> alice 8 8 9
#> bob 10 9 9
#> charles 4 8 2
#> darwin 9 10 10
bonus_array = np.array([[0,np.nan,2],[np.nan,1,0],[0, 1, 0], [3, 3, 0]])
bonus_points = pd.DataFrame(bonus_array, columns=["oct", "nov", "dec"], index=["bob","colin", "darwin", "charles"])
bonus_points
#> oct nov dec
#> bob 0.0 NaN 2.0
#> colin NaN 1.0 0.0
#> darwin 0.0 1.0 0.0
#> charles 3.0 3.0 0.0
grades + bonus_points
#> dec nov oct sep
#> alice NaN NaN NaN NaN
#> bob NaN NaN 9.0 NaN
#> charles NaN 5.0 11.0 NaN
#> colin NaN NaN NaN NaN
#> darwin NaN 11.0 10.0 NaN
Looks like the addition worked in some cases but way too many elements are now empty. That’s because when aligning the DataFrames, some columns and rows were only present on one side, and thus they were considered missing on the other side (NaN). Then adding NaN to a number results in NaN, hence the result.
5.2.5.1 .fillna()
Let’s try to fix the problem above. For example, we can decide that missing data should result in a zero, instead of NaN. We can replace all NaN values by a any value using the fillna()
method:
(grades + bonus_points).fillna(0) #fill with 0s
#> dec nov oct sep
#> alice 0.0 0.0 0.0 0.0
#> bob 0.0 0.0 9.0 0.0
#> charles 0.0 5.0 11.0 0.0
#> colin 0.0 0.0 0.0 0.0
#> darwin 0.0 11.0 10.0 0.0
It’s a bit unfair that we’re setting grades to zero in September, though. Perhaps we should decide that missing grades are missing grades, but missing bonus points should be replaced by zeros:
better_bonus_points = bonus_points.copy()
better_bonus_points.insert(0, "sep", 0)
better_bonus_points.loc["alice"] = 0
better_bonus_points = better_bonus_points.interpolate(axis=1)
final_grades = grades + better_bonus_points
final_grades
#> dec nov oct sep
#> alice NaN 9.0 8.0 8.0
#> bob NaN 10.0 9.0 10.0
#> charles NaN 5.0 11.0 4.0
#> colin NaN NaN NaN NaN
#> darwin NaN 11.0 10.0 9.0
5.2.5.2 .dropna()
So let’s call the .dropna()
method to get rid of rows that are full of NaNs:
final_grades_clean = final_grades.dropna(how="all")
final_grades_clean
#> dec nov oct sep
#> alice NaN 9.0 8.0 8.0
#> bob NaN 10.0 9.0 10.0
#> charles NaN 5.0 11.0 4.0
#> darwin NaN 11.0 10.0 9.0
To remove columns that are full of NaNs, we set the axis argument to 1:
5.2.6 Data Manipulation
.tonumeric()
convert each column value to numeric.astype(str)
convert, for example, to str the values.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=NoDefault.no_default, observed=False, dropna=True)
a groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.
- Use
.cut()
function to create a vector with categories labeled according to some values. For example, suppose we have a variable ‘median_income’ and we want to split it into 5 categories, in a variable ‘income_cat’ :
5.2.7 Overview Functions
The
.info()
method prints out a summary of each columns contents:the
.describe()
method gives a nice overview of the main aggregated values over each column: * count: number of non-null (not NaN) values * mean: mean of non-null values * std: standard deviation of non-null values * min: minimum of non-null values * 25%, 50%, 75%: 25th, 50th and 75th percentile of non-null values * max: maximum of non-null valuesIn order to know the characteristics of the elements, the
.unique()
function reports the single elements. Instead,.nunique()
it counts the unique values.For object type of data, there may exits multiple categories (factors). To explore them use
.value_counts()
.shape()
it reports the shape of the table (rows x columns).corr()
, compute pairwise correlation of columns, excluding NA/null values