This is the third post in a series based off my Python for Data Science bootcamp I run at eBay occasionally. The other posts are:
This is an introduction to the NumPy and Pandas libraries that form the foundation of data science in Python. These libraries, especially Pandas, have a large API surface and many powerful features. There is now way in a short amount of time to cover every topic; in many cases we will just scratch the surface. But after this you should understand the fundamentals, have an idea of the overall scope, and have some pointers for extending your learning as you need more functionality.
Introduction¶
We'll start by importing the numpy and pandas packages. Note the "as" aliases; it is conventional to use "np" for numpy and "pd" for pandas. If you are using Anaconda Python distribution, as recommended for data science, these packages should already be available:
import numpy as np
import pandas as pd
We are going to do some plotting with the matplotlib and Seaborn packages. We want the plots to appear as cell outputs inline in Jupyter. To do that we need to run this next line:
%matplotlib inline
We're going to use the Seaborn library for better styled charts, and it may not yet be installed. To install it, if you are running at the command line and using Anaconda, use:
conda config add channels condaforge
conda install seaborn
Else use pip
:
pip install seaborn
If you are running this in Jupyter from an Anaconda installation, use:
# sys.executable is the path to the Python executable; e.g. /usr/bin/python
import sys
!conda config add channels condaforge
!conda install yes prefix {sys.prefix} seaborn
We need to import the plotting packages. We're also going to change the default style for matplotlib plots to use Seaborn's styling:
import matplotlib.pyplot as plt
import seaborn as sns
# Call sns.set() to change the default styles for matplotlib to use Seaborn styles.
sns.set()
NumPy  the Foundation of Data Science in Python¶
Data science is largely about the manipulation of (often large) collections of numbers. To support effective data science a language needs a way to do this efficiently. Python lists are suboptimal because they are heterogeneous collections of object references; the objects in turn have reference counts for garbage collection, type info, size info, and the actual data. Thus storing (say) a list of a four 32bit integers, rather than requiring just 16 bytes requires much more. Furthermore there is typically poor locality of the items referenced from the list, leading to cache misses and other performance problems. Python does offer an array
type which is homogeneous and improves on lists as far as storage goes, but it offers limited operations on that data.
NumPy bridges the gap, offering both efficient storage of homogeneous data in single or multidimensional arrays, and a rich set of computationally efficient operations on that data.
In this section we will cover some of the basics of NumPy. We won't go into too much detail as our main focus will be Pandas, a library built on top of NumPy that is particularly wellsuited to manipulating tabular data. You can get a deeper intro to NumPy here: https://docs.scipy.org/doc/numpydev/user/quickstart.html
# Create a onedimensional NumPy array from a range
a = np.arange(1, 11)
a
# Create a onedimensional NumPy array from a range with a specified increment
a = np.arange(0.5, 10.5, 0.5)
a
# Reshape the array into a 4x5 matrix
a = a.reshape(4, 5)
a
# Get the shape and # of elements
print(np.shape(a))
print(np.size(a))
# Create one dimensional NumPy array from a list
a = np.array([1, 2, 3])
a
# Append a value
b = a
a = np.append(a, 4) # Note that this makes a copy; the original array is not affected
print(b)
print(a)
# Index and slice
print(f'Second element of a is {a[1]}')
print(f'Last element of a is {a[1]}')
print(f'Middle two elements of a are {a[1:3]}')
# Create an array of zeros of length n
np.zeros(5)
# Create an array of 1s
np.ones(5)
# Create an array of 10 random integers between 1 and 100
np.random.randint(1,100, 10)
# Create linearly spaced array of 5 values from 0 to 100
np.linspace(0, 100, 5)
# Create a 2D array from a list of lists
b = np.array([[1,2,3],
[4,5,6],
[7,8,9]])
b
# Get the shape, # of elements, and # of dimensions
print(np.shape(b))
print(np.size(b))
print(np.ndim(b))
# Get the first row of b; these are equivalent
print(b[0])
print(b[0,:]) # First row, "all columns"
# Get the first column of b
print(b[:,0])
# Get a subsection of b, from 1,1 through 2,2 (i.e. before 3,3)
print(b[1:3,1:3])
Numpy supports Boolean operations on arrays and using arrays of Boolean values to select elements:
# Get an array of Booleans based on whether entries are odd or even numbers
b%2 == 0
# Use Boolean indexing to set all even values to 1
b[b%2 == 0] = 1
b
UFuncs¶
NumPy supports highly efficient lowlevel operations on arrays called UFuncs (Universal Functions).
np.mean(b) # Get the mean of all the elements
np.power(b, 2) # Raise every element to second power
You can get the details on UFuncs here: https://docs.scipy.org/doc/numpy1.13.0/reference/ufuncs.html
Dates and Times in NumPy¶
NumPy uses 64bit integers to represent datetimes:
np.array('20151225', dtype=np.datetime64) # We use an array just so Jupyter will show us the type details
Note the "[D]" after the type. NumPy is flexible in how the 64bits are allocated between date and time components. Because we specified a date only, it assumes the granularity is days, which is what the "D" means. There are a number of other possible units; the most useful are:
Y  Years 

M  Months 
W  Weeks 
D  Days 
h  Hours 
m  Minutes 
s  Seconds 
ms  Milliseconds 
us  Microsecond 
Obviously the finer the granularity the more bits are assigned to fractional seconds leaving less for years so the range dates we can represent shrinks. The values are signed integers; in most cases 0 would be 0AD but for some very fine granularity units 0 is Jan 1, 1970 (e.g. "as" is attoseconds and the range here is less than 10 seconds either side of the start of 1970!).
There is also a default "ns" format suitable for most uses.
When constructing a NumPy datetime the units can be specified explicitly or inferred based on the initialization value's format:
np.array(np.datetime64('20151225 12:00:00.00')) # default to ms as that's the granularity in the datetime
np.array(np.datetime64('20151225 12:00:00.00', 'us')) # use microseconds
NumPy's date parsing is very limited and for the most part we will use Pandas datetime types that we will discuss later.
Pandas¶
NumPy is primarily aimed at scientific computation e.g. linear algebra. As such, 2D data is in the form of arrays of arrays. In data science applications, we are more often dealing with tabular data; that is, collections of records (samples, observations) where each record may be heterogeneous but the schema is consistent from record to record. The Pandas library is built on top of NumPy to provide this type of representation of data, along with the types of operations more typical in data science applications, like indexing, filtering and aggregation. There are two primary classes it provides for this, Series and DataFrame.
Pandas Series¶
A Pandas Series is a onedimensional array of indexed data. It wraps a sequence of values (a NumPy array) and a sequence of indices (a pd.Index
object), along with a name. Pandas indexes can be thought of as immutable dictionaries mapping keys to locations/offsets in the value array; the dictionary implementation is very efficient and there are specialized versions for each type of index (int, float, etc).
For those interested, the underlying implementation used for indexes in Pandas is klib: https://github.com/attractivechaos/klib
squares = pd.Series([1, 4, 9, 16, 25])
print(squares.name)
squares
From the above you can see that by default, a series will have numeric indices assigned, as a sequential list starting from 0, much like a typical Python list or array. The default name for the series is None
, and the type of the data is int64
.
squares.values
squares.index
You can show the first few lines with .head()
. The argument, if omitted, defaults to 5.
squares.head(2)
The data need not be numeric:
data = pd.Series(["quick", "brown", "fox"], name="Fox")
data
Above, we have assigned a name to the series, and note that the data type is now object
. Think of Pandas object
as being strings/text and/or None
rather than generic Python objects; this is the predominant usage.
What if we combine integers and strings?
data = pd.Series([1, "quick", "brown", "fox"], name="Fox")
data
We can have "missing" values using None:
data = pd.Series(["quick", None, "fox"], name="Fox")
data
For a series of type object
, None
can simply be included, but what if the series is numeric?
data = pd.Series([1, None, 3])
data
As you can see, the special float value NaN (np.nan
, for 'not a number') is used in this case. This is also why the series has been changed to have type float64 and not int64; floating point numbers have special reserved values to represent NaN while ints don't.
Be careful with NaN; it will fail equality tests:
np.nan == np.nan
Instead you can use is
or np.isnan()
:
print(np.nan is np.nan)
print(np.isnan(np.nan))
Normal indexing and slicing operations are available, much like Python lists:
squares[2]
squares[2:4]
Where NumPy arrays have implicit integer sequence indices, Pandas indices are explicit and need not be integers:
squares = pd.Series([1, 4, 9, 16, 25],
index=['square of 1', 'square of 2', 'square of 3', 'square of 4', 'square of 5'])
squares
squares['square of 3']
As you can see, a Series is a lot like a Python dict (with additional slicing like a list). In fact, we can construct one from a Python dict:
pd.Series({'square of 1':1, 'square of 2':4, 'square of 3':9, 'square of 4':16, 'square of 5':25})
You can use both a dictionary and an explicit index but be careful if the index and dictionary keys don't align completely; the explicit index takes precedence. Look at what happens:
pd.Series({"one": 1, "three": 3}, index=["one", "two"])
Exercise 1¶
Given the list below, create a Series that has the list as both the index and the values, and then display the first 3 rows:
ex1 = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm']
A number of dictstyle operations work on a Series:
'square of 5' in squares
squares.keys()
squares.items() # Iterable
list(squares.items())
However, unlike with a Python dict, .values
is an array attribute, not a function returning an iterable, so we use .values
, not .values()
:
squares.values
We can add new entries:
squares['square of 6'] = 36
squares
change existing values:
squares['square of 6'] = 1
squares
and delete entries:
del squares['square of 6']
squares
Iteration (.__iter__
) iterates over the values in a Series, while membership testing (.__contains__
) checks the indices. .iteritems()
will iterate over (index, value) tuples, similar to list's .enumerate()
:
for v in squares: # calls .__iter__()
print(v)
print(16 in squares)
print('square of 4' in squares) # calls .__contains__()
print(16 in squares.values)
for v in squares.iteritems():
print(v)
Vectorized Operations¶
You can iterate over a Series or Dataframe, but in many cases there are much more efficient vectorized UFuncs available; these are implemented in native code exploiting parallel processor operations and are much faster. Some examples are .sum()
, .median()
, .mode()
, and .mean()
:
squares.mean()
Series also behaves a lot like a list. We saw some indexing and slicing earlier. This can be done on nonnumeric indexes too, but be careful: it includes the final value:
squares['square of 2': 'square of 4']
If one or both of the keys are invalid, the results will be empty:
squares['square of 2': 'cube of 4']
Exercise 2¶
Delete the row 'k' from the earlier series you created in exercise 1, then display the rows from 'f' through 'l'.
Something to be aware of, is that the index need not be unique:
people = pd.Series(['alice', 'bob', 'carol'], index=['teacher', 'teacher', 'plumber'])
people
If we dereference a Series by a nonunique index we will get a Series, not a scalar!
people['plumber']
people['teacher']
You need to be very careful with nonunique indices. For example, assignment will change all the values for that index without collapsing to a single entry!
people['teacher'] = 'dave'
people
To prevent this you could use positional indexing, but my advice is to try to avoid using nonunique indices if at all possible. You can use the .is_unique
property on the index to check:
people.index.is_unique
DataFrames¶
A DataFrame is like a dictionary where the keys are column names and the values are Series that share the same index and hold the column values. The first "column" is actually the shared Series index (there are some exceptions to this where the index can be multilevel and span more than one column but in most cases it is flat).
names = pd.Series(['Alice', 'Bob', 'Carol'])
phones = pd.Series(['5551234567', '5559876543', '5552456789'])
dept = pd.Series(['Marketing', 'Accounts', 'HR'])
staff = pd.DataFrame({'Name': names, 'Phone': phones, 'Department': dept}) # 'Name', 'Phone', 'Department' are the column names
staff
Note above that the first column with values 0, 1, 2 is actually the shared index, and there are three series keyed off the three names "Department", "Name" and "Phone".
Like Series
, DataFrame
has an index for rows:
staff.index
DataFrame
also has an index for columns:
staff.columns
staff.values
The index operator actually selects a column in the DataFrame, while the .iloc and .loc attributes still select rows (actually, we will see in the next section that they can select a subset of the DataFrame with a row selector and column selector, but the row selector comes first so if you supply a single argument to .loc or .iloc you will select rows):
staff['Name'] # Acts similar to dictionary; returns the Series for a column
staff.loc[2]
You can get a transpose of the DataFrame with the .T attribute:
staff.T
You can also access columns like this, with dotnotation. Occasionally this breaks if there is a conflict with a UFunc name, like 'count':
staff.Name
You can add new columns. Later we'll see how to do this as a function of existing columns:
staff['Fulltime'] = True
staff.head()
Use .describe()
to get summary statistics:
staff.describe()
Use .quantile()
to get quantiles:
df = pd.DataFrame([2, 3, 1, 4, 3, 5, 2, 6, 3])
df.quantile(q=[0.25, 0.75])
Use .drop()
to remove rows. This will return a copy with the modifications and leave the original untouched unless you include the argument inplace=True
.
staff.drop([1])
# Note that because we didn't say inplace=True,
# the original is unchanged
staff
There are many ways to construct a DataFrame. For example, from a Series or dictionary of Series, from a list of Python dicts, or from a 2D NumPy array. There are also utility functions to read data from disk into a DataFrame, e.g. from a .csv file or an Excel spreadsheet. We'll cover some of these later.
Many DataFrame operations take an axis
argument which defaults to zero. This specifies whether we want to apply the operation by rows (axis=0) or by columns (axis=1).
You can drop columns if you specify axis=1
:
staff.drop(["Fulltime"], axis=1)
Another way to remove a column inplace is to use del
:
del staff["Department"]
staff
You can change the index to be some other column. If you want to save the existing index, then first add it as a new column:
staff['Number'] = staff.index
staff
# Now we can set the new index. This is a destructive
# operation that discards the old index, which is
# why we saved it as a new column first.
staff = staff.set_index('Name')
staff
Alternatively you can promote the index to a column and go back to a numeric index with reset_index()
:
staff = df.reset_index()
staff
Exercise 3¶
Create a DataFrame from the dictionary below:
ex3data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}
Then:
 Generate a summary of the data
 Calculate the sum of all visits (the total number of visits).
More on Indexing¶
The Pandas Index type can be thought of as an immutable ordered multiset (multiset as indices need not be unique). The immutability makes it safe to share an index between multiple columns of a DataFrame. The setlike properties are useful for things like joins (a join is like an intersection between Indexes). There are dictlike properties (index by label) and listlike properties too (index by location).
Indexes are complicated but understanding them is key to leveraging the power of pandas. Let's look at some example operations to get more familiar with how they work:
# Let's create two Indexes for experimentation
i1 = pd.Index([1, 3, 5, 7, 9])
i2 = pd.Index([2, 3, 5, 7, 11])
You can index like a list with []
:
i1[2]
You can also slice like a list:
i1[2:5]
The normal Python bitwise operators have setlike behavior on indices; this is very useful when comparing two dataframes that have similar indexes:
i1 & i2 # Intersection
i1  i2 # Union
i1 ^ i2 # Difference
Series and DataFrames have an explicit Index but they also have an implicit index like a list. When using the []
operator, the type of the argument will determine which index is used:
s = pd.Series([1, 2], index=["1", "2"])
print(s["1"]) # matches index type; use explicit
print(s[1]) # integer doesn't match index type; use implicit positional
If the explicit Index uses integer values things can get confusing. In such cases it is good to make your intent explicit; there are attributes for this:

.loc
references the explicit Index 
.iloc
references the implicit Index; i.e. a positional index 0, 1, 2,...
The Python way is "explicit is better than implicit" so when indexing/slicing it is better to use these. The example below illustrates the difference:
# Note: explicit index starts at 1; implicit index starts at 0
nums = pd.Series(['first', 'second', 'third', 'fourth'], index=[1, 2, 3, 4])
print(f'Item at explicit index 1 is {nums.loc[1]}')
print(f'Item at implicit index 1 is {nums.iloc[1]}')
print(nums.loc[1:3])
print(nums.iloc[1:3])
When using .iloc
, the expression in []
can be:
 an integer, a list of integers, or a slice object (e.g.
1:7
)  a Boolean array (see Filtering section below for why this is very useful)
 a function with one argument (the calling object) that returns one of the above
Selecting outside of the bounds of the object will raise an IndexError except when using slicing.
When using .loc
, the expression in []
can be:
 an label, a list of labels, or a slice object with labels (e.g.
'a':'f'
; unlike normal slices the stop label is included in the slice)  a Boolean array
 a function with one argument (the calling object) that returns one of the above
You can use one or two dimensions in []
after .loc
or .iloc
depending on whether you want to select a subset of rows, columns, or both.
You can use the set_index
method to change the index of a DataFrame.
If you want to change entries in a DataFrame selectively to some other value, you can use assignment with indexing, such as:
df.loc[row_indexer, column_indexer] = value
Don't use:
df[row_indexer][column_indexer] = value
That chained indexing can result in copies being made which will not have the effect you expect. You want to do all your indexing in one operation. See the details at https://pandas.pydata.org/pandasdocs/stable/indexing.html
Exercise 4¶
Using the same DataFrame from Exercise 3:
 Select just the 'animal' and 'age' columns from the DataFrame
 Select the data in rows [3, 5, 7] and in columns ['animal', 'age']
Loading/Saving CSV, JSON and Excel Files¶
Use Pandas.read_csv
to read a CSV file into a dataframe. There are many optional argumemts that you can provide, for example to set or override column headers, skip initial rows, treat first row as containing column headers, specify the type of columns (Pandas will try to infer these otherwise), skip columns, and so on. The parse_dates
argument is especially useful for specifying which columns have date fields as Pandas doesn't infer these.
Full docs are at https://pandas.pydata.org/pandasdocs/stable/generated/pandas.read_csv.html
crime = pd.read_csv('http://samplecsvs.s3.amazonaws.com/SacramentocrimeJanuary2006.csv',
parse_dates=['cdatetime'])
crime.head()
If you need to do some preprocessing of a field during loading you can use the converters
argument which takes a dictionary mapping the field names to functions that transform the field. E.g. if you had a string field zip
and you wanted to take just the first 3 digits, you could use:
..., converters={'zip': lambda x: x[:3]}, ...
If you know what types to expect for the columns, you can (and, IMO, you should) pass a dictionary in with the types
argument that maps field names to NumPy types, to override the type inference. You can see details of NumPy scalar types here: https://docs.scipy.org/doc/numpy1.13.0/reference/arrays.scalars.html. Omit any fields that you may have already included in the parse_dates
argument.
By default the first line is expected to contain the column headers. If it doesn't you can specify them yourself, using arguments such as:
..., header=None, names=['column1name','column2name'], ...
If the separator is not a comma, use the sep
argument; e.g. for a TABseparated file:
..., sep='\t', ...
Use Pandas.read_excel
to load spreadsheet data. Full details here: https://pandas.pydata.org/pandasdocs/stable/generated/pandas.read_excel.html
titanic = pd.read_excel('http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.xls')
titanic.head()
Use the DataFrame.to_csv
method to save a DataFrame to a file or DataFrame.to_excel
to save as a spreadsheet.
It's also possible to read JSON data into a DataFrame. The complexity here is that JSON data is typically hierarchical; in order to turn it into a DataFrame the data typically needs to be flattened in some way. This is controlled by an orient
parameter. For details see https://pandas.pydata.org/pandasdocs/stable/generated/pandas.read_json.html.
Sorting¶
You can sort a DataFrame using the sort_values
method:
DataFrame.sort_values(by, axis=0, ascending=True, inplace=False, na_position='last')
The by
argument should be a column name or list of column names in priority order (if axis=0, i.e. we are sorting the rows, which is typically the case).
See https://pandas.pydata.org/pandasdocs/stable/generated/pandas.DataFrame.sort_values.html for the details.
Filtering¶
A Boolean expression on a Series will return a Series of Booleans:
titanic.survived == 1
If you index a Series with a Boolean Series, you will select the items where the index is True. For example:
titanic[titanic.survived == 1].head()
You can combine these with &
(and) and 
(or). Pandas uses these bitwise operators because Python allows them to be overloaded while 'and' and 'or' cannot be, and in any event they arguably make sense as they are operating on Boolean series which are similar to bit vectors.
As &
and 
have higher operator precedence than relational operators like >
and ==
, the subexpressions we use with them need to be enclosed in parentheses:
titanic[titanic.survived & (titanic.sex == 'female') & (titanic.age > 50)].head()
NumPy itself also supports such Boolean filtering; for example:
s = np.array([3, 2, 4, 1, 5])
s[s > np.mean(s)] # Get the values above the mean
Handling Missing Data¶
To see if there are missing values, we can use isnull() to get a DataFrame where there are null values:
titanic.isnull().head()
The above will show us the first few rows that had null values. If we want to know which columns may have nulls, we can use:
titanic.isnull().any()
.any()
returns True if any are true; .all()
returns True if all are true.
To drop rows that have missing values, use dropna(); add inplace=True
to do it in place.
titanic.dropna().head()
In this case there are none  noone could both be on a boat and be a recovered body, so at least one of these fields is always NaN.
It may be more useful to be selective. For example, if we want to get the rows in which ticket and cabin are not null:
filter = titanic.notnull()
filter.head()
titanic[filter.ticket & filter.cabin].head()
We can use .count()
to get the number of entries in each column that are not null.
titanic.count()
To replace missing values with values of our choosing, we use .fillna()
. With a single scalar argument it will replace all null entries in the DataFrame with that value. Usually we will want to be more granular and control which columns are affected in what ways. Let's see if there are rows with no fare specified:
titanic[~filter.fare]
We can change the fare to zero by passing a dictionary as the argument rather than a scalar:
titanic.fillna({'fare': 0}, inplace=True)
titanic[~filter.fare]
We could also use a method="ffill"
argument for a forward fill or method="bfill"
argument for a backward fill; these are most useful for time series data. Yet another option is to use the .interpolate()
method to use interpolation for the missing values; that is beyond the scope of this notebook.
Exercise 5¶
Using the previous DataFrame from exercise 3, do the following:
 Select only the rows where the number of visits is greater than or equal to 3
 Select the rows where the age is missing, i.e. is NaN
 Select the rows where the animal is a cat and the age is less than 3
 Select the rows the age is between 2 and 4 (inclusive)
 Change the index to use this list:
idx = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
 Change the age in row 'f' to 1.5.
 Append a new row 'k' to df with your choice of values for each column
 Then delete that row to return the original DataFrame
Concatenation¶
pandas.concat
can be used to concatenate Series and DataFrames:
s1 = pd.Series(['A', 'B', 'C'])
s2 = pd.Series(['D', 'E', 'F'])
df = pd.concat([s1, s2])
df
Note that the Indexes are concatenated too, so if you are using a simple row number index you can end up with duplicate values.
df[2]
If you don't want this behavior use the ignore_index
argument; a new index will be generated:
pd.concat([s1, s2], ignore_index=True)
Alternatively you can use verify_integrity=True
to cause an exception to be raised if the result would have duplicate indices.
pd.concat([s1, s2], verify_integrity=True)
d1 = pd.DataFrame([['A1', 'B1'],['A2', 'B2']], columns=['A', 'B'])
d2 = pd.DataFrame([['C3', 'D3'],['C4', 'D4']], columns=['A', 'B'])
d3 = pd.DataFrame([['B1', 'C1'],['B2', 'C2']], columns=['B', 'C'])
pd.concat([d1, d2])
We can join on other axis too:
pd.concat([d1, d2], axis=1)
pd.concat([d1, d3], axis=1)
If the columns are not completely shared, additional NaN entries will be made:
pd.concat([d1, d3])
We can force concat to only include the columns that are shared with an inner join:
pd.concat([d1, d3], join='inner')
See https://pandas.pydata.org/pandasdocs/stable/generated/pandas.concat.html for more options.
Merging and Joining¶
We have already seen how we can add a new column to a DataFrame when it is a fixed scalar value:
df = pd.DataFrame(['Fred', 'Alice', 'Joe'], columns=['Name'])
df
df['Married'] = False
df
We can also give an array of values provided it has the same length, or we can use a Series keyed on the index if it is not the same length:
df['Phone'] = ['5551234567', '5553210000', '5559998765']
df
df['Department'] = pd.Series({0: 'HR', 2: 'Marketing'})
df
Often we want to join two DataFrames instead. Pandas has a merge
function that supports onetoone, manytoone and manytomany joins. merge will look for matching column names between the inputs and use this as the key:
d1 = pd.DataFrame({'city': ['Seattle', 'Boston', 'New York'], 'population': [704352, 673184, 8537673]})
d2 = pd.DataFrame({'city': ['Boston', 'New York', 'Seattle'], 'area': [48.42, 468.48, 142.5]})
pd.merge(d1, d2)
You can explicitly specify the column to join on; this is equivalent to the above example:
pd.merge(d1, d2, on='city')
If there is more than one column in common, only items where the column values match in all cases will be included. Let's add a common column x
and see what happens:
d10 = pd.DataFrame({'city': ['Seattle', 'Boston', 'New York'],
'x': ['a', 'b', 'c'],
'population': [704352, 673184, 8537673]})
d11 = pd.DataFrame({'city': ['Boston', 'New York', 'Seattle'],
'x': ['a', 'c', 'b'],
'area': [48.42, 468.48, 142.5]})
pd.merge(d10, d11)
You can see that Pandas avoided ambiguous cases by just dropping them.
However, if we specify the column for the join, Pandas will just treat the other common columns (if any) as distinct, and add suffixes to disambiguate the names:
pd.merge(d10, d11, on='city')
If the column names to join on don't match you can specify the names to use explicitly:
d3 = pd.DataFrame({'place': ['Boston', 'New York', 'Seattle'], 'area': [48.42, 468.48, 142.5]})
pd.merge(d1, d3, left_on='city', right_on='place')
# If you want to drop the redundant column:
pd.merge(d1, d3, left_on='city', right_on='place').drop('place', axis=1)
merge
joins on arbitrary columns; if you want to join on the index you can use left_index
and right_index
:
df1 = pd.DataFrame(list('ABC'), columns=['c1'])
df2 = pd.DataFrame(list('DEF'), columns=['c2'])
pd.merge(df1, df2, left_index=True, right_index=True)
Pandas provides a utility method on DataFrame, join
, to do the above:
df1.join(df2)
merge
can take a how
argument that can be inner
(intersection), outer
(union), left
(first augmented by second) or right
(second augmented by first) to control the type of join. inner
joins are the default.
If there are other columns with the same name between the two DataFrames, Pandas will give them unique names by appending _x
to the columns from the first argument and _y
to the columns from the second argument.
It's also possible to use lists of column names for the left_on
and right_on
arguments to join on multiple columns.
For more info on merging see https://pandas.pydata.org/pandasdocs/stable/merging.html
Exploring the Data¶
There are some more useful ways to explore the data in our DataFrame. Let's return to the Titanic data set, but this time we will use the sample dataset that comes with Seaborn, which is a bit different to the one we loaded before:
import seaborn as sns;
titanic = sns.load_dataset('titanic')
titanic.head()
You can use .unique()
to see the full set of distinct values in a series:
titanic.deck.unique()
.value_counts()
will get the counts of the unique values:
titanic.deck.value_counts()
.describe()
will give summary statistics on a DataFrame. We first drop rows with NAs:
titanic.dropna().describe()
Aggregating, Pivot Tables, and Multiindexes¶
There is a common set of operations known as the splitapplycombine pattern:

split the data into groups based on some criteria (this is a
GROUP BY
in SQL, orgroupby
in Pandas)  apply some aggregate function on the groups, such as finding the mean for of some column for each group
 combining the results into a new table (Dataframe)
Let's look at some examples. We can see the survival rates by gender by grouping by gender, and aggegating the survival feature using .mean()
:
titanic.groupby('sex')['survived'].mean()
Similarly it's interesting to see the survival rate by passenger class; we'll still group by gender as well:
titanic.groupby(['sex', 'class'])['survived'].mean()
Because we grouped by two columns, the DataFrame result this time is a hierarchical table; an example of a multiindexed DataFrame (indexed by both 'sex' and 'class'). We're mostly going to ignore those in this notebook  you can read about them here  but it is worth noting that Pandas has an unstack
method that can turn a mutiplyindexed DataFrame back into a conventionallyindexed one. Each call to unstack
will flatten out one level of a multiindex hierarchy (starting at the innermost, by default, although you can control this). There is also a stack
method that does the opposite. Let's repeat the above but unstack the result:
titanic.groupby(['sex', 'class'])['survived'].mean().unstack()
You may recognize the result as a pivot of the hierachical table. Pandas has a convenience method pivot_table
to do all of the above in one go. It can take an aggfunc
argument to specify how to aggregate the results; the default is to find the mean which is just what we want so we can omit it:
titanic.pivot_table('survived', index='sex', columns='class')
We could have pivoted the other way:
titanic.pivot_table('survived', index='class', columns='sex')
If we wanted counts instead, we could use Numpy's sum
function to aggregate:
titanic.pivot_table('survived', index='sex', columns='class', aggfunc='sum')
You can see more about what aggregation functions are available here. Let's break things down further by age group (under 18 or over 18). To do this we will create a new series with the age range of each observation, using the cut
function:
age = pd.cut(titanic['age'], [0, 18, 100]) # Assume noone is over 100
age.head()
Now we can create our pivot table using the age series as one of the indices! Pretty cool!
titanic.pivot_table('survived', index=['sex', age], columns='class')
Applying Functions¶
We saw earlier that we can add new columns to a DataFrame easily. The new column can be a function of an existing column. For example, we could add an 'is_adult' field to the Titanic data:
titanic['is_adult'] = titanic.age >= 18
titanic.head()
That's a simple case; we can do more complex rowbyrow applications of arbitrary functions; here's the same change done differently (this would be much less efficient but may be the only option if the function is complex):
titanic['is_adult'] = titanic.apply(lambda row: row['age'] >= 18, axis=1)
titanic.head()
Exercise 6¶
Use the same DataFrame from exercise 5:
 Calculate the mean age for each different type of animal
 Count the number of each type of animal
 Sort the data first by the values in the 'age' column in decending order, then by the value in the 'visits' column in ascending order.
 In the 'animal' column, change the 'snake' entries to 'python'
 The 'priority' column contains the values 'yes' and 'no'. Replace this column with a column of boolean values: 'yes' should be True and 'no' should be False
String Operations¶
Pandas has vectorized string operations that will skip over missing values. Looks look at some examples:
# Let's get the more detailed Titanic data set
titanic3 = pd.read_excel('http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.xls')
titanic3.head()
# Uppercase the home.dest field
titanic3['home.dest'].str.upper().head()
# Let's split the field up into two
place_df = titanic3['home.dest'].str.split('/', expand=True) # Expands the split list into DF columns
place_df.columns = ['home', 'dest', ''] # For some reason there is a third column
titanic3['home'] = place_df['home']
titanic3['dest'] = place_df['dest']
titanic3 = titanic3.drop(['home.dest'], axis=1)
titanic3.head()
Ordinal and Categorical Data¶
So far we have mostly seen numeric, daterelated, and "object" or string data. When loading data, Pandas will try to infer if it is numeric, but fall back to string/object. Loading functions like read_csv
take arguments that can let us explicitly tell Pandas what the type of a column is, or whether it should try to parse the values in a column as dates. However, there are other types that are common where Pandas will need more help.
Categorical data is data where the values fall into a finite set of nonnumeric values. Examples could be month names, department names, or occupations. It's possible to represent these as strings but generally much more space and time efficient to map the values to some more compact underlying representation. Ordinal data is categorical data where the values are also ordered; for example, exam grades like 'A', 'B', 'C', etc, or statements of preference ('Dislike', 'Neutral', 'Like'). In terms of use, the main difference is that it is valid to compare categorical data for equality only, while for ordinal values sorting, or comparing with relational operators like '>', is meaningful (of course in practice we often sort categorical values alphabetically, but that is mostly a convenience and doesn't usually imply relative importance or weight). It's useful to think of ordinal and categorical data as being similar to enumerations in programming languages that support these.
Let's look at some examples. We will use a dataset with automobile data from the UCI Machine Learning Repository. This data has '?' for missing values so we need to specify that to get the right conversion. It's also missing a header line so we need to supply names for the columns:
autos = pd.read_csv("http://mlr.cs.umass.edu/ml/machinelearningdatabases/autos/imports85.data", na_values='?',
header=None, names=[
"symboling", "normalized_losses", "make", "fuel_type", "aspiration",
"num_doors", "body_style", "drive_wheels", "engine_location",
"wheel_base", "length", "width", "height", "curb_weight",
"engine_type", "num_cylinders", "engine_size", "fuel_system",
"bore", "stroke", "compression_ratio", "horsepower", "peak_rpm",
"city_mpg", "highway_mpg", "price"
])
autos.head()
There are some obvious examples here for categorical types; for example make
, body_style
, drive_wheels
, and engine_location
. There are also some numeric columns that have been represented as words. Let's fix those first. First we should see what possible values they can take:
autos['num_cylinders'].unique()
autos['num_doors'].unique()
Let's fix the nan
values for num_doors
; four seems a reasonable default for the number of doors of a car:
autos = autos.fillna({"num_doors": "four"})
To convert these to numbers we need to way to map from the number name to its value. We can use a dictionary for that:
numbers = {"two": 2, "three": 3, "four": 4, "five": 5, "six": 6, "eight": 8, "twelve": 12}
Now we can use the replace
method to transform the values using the dictionary:
autos = autos.replace({"num_doors": numbers, "num_cylinders": numbers})
autos.head()
Now let's return to the categorical columns. We can use astype
to convert the type, and we want to use the type category
:
autos["make"] = autos["make"].astype('category')
autos["fuel_type"] = autos["fuel_type"].astype('category')
autos["aspiration"] = autos["aspiration"].astype('category')
autos["body_style"] = autos["body_style"].astype('category')
autos["drive_wheels"] = autos["drive_wheels"].astype('category')
autos["engine_location"] = autos["engine_location"].astype('category')
autos["engine_type"] = autos["engine_type"].astype('category')
autos["fuel_system"] = autos["fuel_system"].astype('category')
autos.dtypes
Under the hood now each of these columns has been turned into a type similar to an enumeration. We can use the .cat
attribute to access some of the details. For example, to see the numeric value's now associated with each row for the make
column:
autos['make'].cat.codes.head()
autos['make'].cat.categories
autos['fuel_type'].cat.categories
It's possible to change the categories, assign new categories, remove category values, order or reorder the category values, and more; you can see more info at http://pandas.pydata.org/pandasdocs/stable/categorical.html
Having an underlying numerical representation is important; most machine learning algorithms require numeric features and can't deal with strings or categorical symbolic values directly. For ordinal types we can usually just use the numeric encoding we have generated above, but with nonordinal data we need to be careful; we shouldn't be attributing weight to the underlying numeric values. Instead, for nonordinal values, the typical approach is to use onehot encoding  create a new column for each distinct value, and just use 0 or 1 in each of these columns to indicate if the observation is in that category. Let's take a simple example:
wheels = autos[['make', 'drive_wheels']]
wheels.head()
The get_dummies
method will 1hot encode a feature:
onehot = pd.get_dummies(wheels['drive_wheels']).head()
onehot
To merge this into a dataframe with the make
, we can merge with the wheels
dataframe on the implicit index field, and then drop the original categorical column:
wheels.merge(onehot, left_index=True, right_index=True).drop('drive_wheels', axis=1)
Aligned Operations¶
Pandas will align DataFrames on indexes when performing operations. Consider for example two DataFrames, one with number of transactions by day of week, and one with number of customers by day of week, and say we want to know average transactions per customer by date:
transactions = pd.DataFrame([2, 4, 5],
index=['Mon', 'Wed', 'Thu'])
customers = pd.DataFrame([2, 2, 3, 2],
index=['Sat', 'Mon', 'Tue', 'Thu'])
transactions / customers
Notice how pandas aligned on index to produce the result, and used NaN for mismatched entries. We could specify the value to use as operands by using the div
method:
transactions.div(customers, fill_value=0)
Chaining Methods and .pipe()¶
Many operations on Series and DataFrames return modified copies of the Series or Dataframe, unless the inplace=True
argument is included. Even in that case there is usually a copy made and then the reference is just replaced at the end, so using inplace operations generally isn't faster. Because a Series or Dataframe reference is returned, you can chain multiple operations, for example:
df = (pd.read_csv('data.csv')
.rename(columns=str.lower)
.drop('id', axis=1))
This is great for builtin operations, but what about custom operations? The good news is these are possible too, with .pipe()
, which will allow you to specify your own functions to call as part of the operation chain:
def my_operation(df, *args, **kwargs):
# Do something to the df
...
# Return the modified dataframe
return df
# Now we can call this in our chain.
df = (pd.read_csv('data.csv')
.rename(columns=str.lower)
.drop('id', axis=1)
.pipe(my_operation, 'foo', bar=True))
Statistical Significance and Hypothesis Testing¶
In exploring the data, we may come up with hypotheses about relationships between different values. We can get an indication of whether our hypothesis is correct or the relationship is coincidental using tests of statistical significance.
We may have a simple hypothesis, like "All X are Y". For phenomena in the real world, we usually we can't explore all possible X, and so we can't usually prove all X are Y. To prove the opposite, on the other hand, only requires a single counterexample. The wellknown illustration is the black swan: to prove that all swans are white you would have to find every swan that exists (and possibly that has ever existed and may ever exist) and check its color, but to prove not all swans are white you need to find just a single swan that is not white and you can stop there. For these kinds of hypotheses we can often just look at our historical data and try to find a counterexample.
Let's say that the conversion is better in the test set. How do we know that the change caused the improvement, and it wasn't just by chance? One way is to combine the observations from both the test and control set, then take random samples, and see what the probability is of a sample showing a similar improvement in conversion. If the probability of a similar improvement from a random sample is very low, then we can conclude that the improvement from the change is statistically significant.
In practice we may have a large number of observations in the test set and a large number of observations in the control set, and the approach outlined above may be computationally too costly. There are various tests that can give us similar measures at a much lower cost, such as the ttest (when comparing means of populations) or the chisquared test (when comparing categorical data). The details of how these tests work tests and which ones to choose are beyond the scope of this notebook.
The usual approach is to assume the opposite of what we want to prove; this is called the null hypothesis or $H_0$. For our example, the null hypothesis states there is no relationship between our change and conversion on the website. We then calculate the probability that the data supports the null hypothesis rather than just being the result of unrelated variance: this is called the pvalue. In general, a pvalue of less than 0.05 (5%) is taken to mean that the hypothesis is valid, although this has recently become a contentious point. We'll set aside that debate for now and stick with 0.05.
Let's revisit the Titanic data:
import seaborn as sns;
titanic = sns.load_dataset('titanic')
titanic.head()
If we want to see how gender affected survival rates, one way is with crosstabulation:
ct = pd.crosstab(titanic['survived'],titanic['sex'])
ct
There were a lot more men than women on the ship, and it certainly looks like the survival rate for women was better than for men, but is the difference statistically significant? Our hypothesis is that gender affects survivability, and so the null hypothesis is that it doesn't. Let's measure this with a chisquared test:
from scipy import stats
chi2, p, dof, expected = stats.chi2_contingency(ct.values)
p
That's a very small pvalue! So we can be sure gender was an issue.
When running an online experiment, check the pvalue periodically and plot the trend. You want to see the pvalue gradually converging. If instead it is erratic and showing no sign of conversion, that suggests the experiment is not going to be conclusive.
One last comment: it is often said "correlation does not imply causation". We should be more precise and less flippant than this! If X and Y are correlated, there are only four possibilities:
 this was a chance event. We can determine the probability of that and assess if its a reasonable explanation.
 X causes Y or Y causes X
 X and Y are both caused by some unknown factor Z. This is usually what we are referring to when saying "correlation does not imply causation", but there is still causation!
Plotting¶
Pandas includes the ability to do simple plots. For a Series, this typically means plotting the values in the series as the Y values, and then index as the X values; for a DataFrame this would be a multiplot. You can use x
and y
named arguments to select specific columns to plot, and you can use a kind
argument to specify the type of plot.
See https://pandas.pydata.org/pandasdocs/stable/visualization.html for details.
s = pd.Series([2, 3, 1, 5, 3], index=['a', 'b', 'c', 'd', 'e'])
s.plot()
s.plot(kind='bar')
df = pd.DataFrame(
[
[2, 1],
[4, 4],
[1, 2],
[3, 6]
],
index=['a', 'b', 'c', 'd'],
columns=['s1', 's2']
)
df.plot()
df.plot(x='s1', y='s2', kind='scatter')
Charting with Seaborn¶
See the Python Graph Gallery for many examples of different types of charts including the code used to create them. As you learn to use the plotting libraries in many cases the fastest way to get results is just find an example from there and copy/paste/edit it.
There are a number of plotting libraries for Python; the most well known are matplotlib, Seaborn, Bokeh, and Plotly. Some offer more interactivity than others. Matplotlib is the most commonly used; it is very flexible but requires a fair amount of boilerplate code. There is a good tutorial on matplotlib here. We will instead use Seaborn, which is built on top of matplotlib and simplifies its usage so that many plots just take one line of code.
# Let's get the more detailed Titanic data set
titanic3 = pd.read_excel('http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.xls')
titanic3.head()
# We can use a factorplot to count categorical data
import seaborn as sns
sns.factorplot('sex', data=titanic3, kind='count')
# Let's bring class in too:
sns.factorplot('pclass', data=titanic3, hue='sex', kind='count')
# Of course we can aggregate the other way too
sns.factorplot('sex', data=titanic3, hue='pclass', kind='count')
# Let's see how many people were on each deck
deck = pd.DataFrame(titanic3['cabin'].dropna().str[0])
deck.columns = ['deck'] # Get just the deck column
sns.factorplot('deck', data=deck, kind='count')
# What class passenger was on each deck?
df = titanic3[['cabin', 'pclass']].dropna()
df['deck'] = df.apply(lambda row: ord(row.cabin[0]) 64, axis=1)
sns.regplot(x=df["pclass"], y=df["deck"])
Working with Dates and Time Series¶
Pandas provides several classes for dealing with datetimes: Timestamp
, Period
, and Timedelta
, and corresponding index types based off these, namely DatetimeIndex
, PeriodIndex
and TimedeltaIndex
.
For parsing dates we can use pd.to_datetime
which can parse dates in many formats, or pd.to_timedelta
to get a time delta. For formatting dates as strings the Timestamp.strftime
method can be used.
For example, to get a fourweeklong range of dates starting from Christmas 2017:
di = pd.to_datetime("December 25, 2017") + pd.to_timedelta(np.arange(4*7), 'D')
di
It's also possible to pass a list of dates to to_datetime
to create a DatetimeIndex. A DatetimeIndex can be converted to a TimedeltaIndex by subtracting a start date:
di  di[0]
And of course the converse is possible:
(di  di[0]) + di[1]
Another way of creating the indices is to specify range start and ends plus optionally the granularity, via the periods
and freq
arguments, using the APIs pd.date_range
, pd.timedelta_range
, and pd.interval_range
:
pd.date_range('20171230', '20171231')
pd.date_range('20171230', '20171231', freq='h') # Hourly frequency
pd.date_range('20171230', periods=4) # 4 values using the default frequency of day
pd.date_range('20171230', periods=4, freq='h') # 4 values using hourly frequency
Periods represent time intervals locked to timestamps. Consider the difference below:
pd.date_range('201701', '201712', freq='M') # This gives us 12 dates, one for each month, on the last day of each month
pd.period_range('201701', '201712', freq='M') # This gives us 12 month long periods
You may wonder why the dates above were on the last day of each month. Pandas uses frequency codes, as follows:
Code  Meaning 

D  Calendar day 
B  Business day 
W  Weekly 
MS  Month start 
BMS  Business month start 
M  Month end 
BM  Business month end 
QS  Quarter start 
BQS  Business quarter start 
Q  Quarter end 
BQ  Business quarter end 
AS  Year start 
A  Year end 
BAS  Business year start 
BS  Business year end 
T  Minutes 
S  Seconds 
L  Milliseonds 
U  Microseconds 
These can also be combined in some cases; e.g. "!H30T" or "90T" each represent 90 minutes:
pd.date_range('201701', periods=16, freq='1H30T')
We can also add month offsets to annual or quarterly frequencies or day of week constraints to weekly frequencies:
pd.date_range('2017', periods=4, freq='QSFEB') # 4 quarters starting from beginning of February
pd.date_range('201701', periods=4, freq='WMON') # First 4 Mondays in Jan 2017
So what use are all these? To understand that we need some timeseries data. Let's get the eBay daily stock closing price for 2017:
import sys
!conda install yes prefix {sys.prefix} pandasdatareader
from pandas_datareader import data
ebay = data.DataReader('EBAY', start='2017', end='2018', data_source='iex')['close']
ebay.plot()
ebay.head()
ebay.index
Our index is not timestampbased, so let's fix that:
ebay.index = pd.to_datetime(ebay.index)
ebay.index
Let's plot just January prices:
ebay["201701"].plot()
Let's plot weekly closing prices:
ebay[pd.date_range('201701', periods=52, freq='WFRI')].plot()
This is just a small sample of what Pandas can do with time series; Pandas came out of financial computation and has very rich capabilities in this area.
Summarizing Data with pandas_profiling and facets¶
pandas_profiling
is a Python package that can produce much more detailed summaries of data than the .describe()
method. In this case we must install with pip
and the right way to do this from the notebook is:
import sys
!{sys.executable} m pip install pandasprofiling
%matplotlib inline
import pandas_profiling
import seaborn as sns;
titanic = sns.load_dataset('titanic')
pandas_profiling.ProfileReport(titanic) # You may need to run cell twice
Facets is a library from Google that looks very good. It has similar functionality to pandas_profiling as well as some powerful visualization. Installation is more complex so we won't use it now but it is worth considering.
Handling Data that Exceeds Your System's RAM¶
Pandas is an inmemory system. The use of NumPy means it uses memory very efficiently but you are still limited by the RAM you have available. If your data is too large, there are several options available, including:
 process the data sequentially (may not be possible but see here for an interesting approach)
 partition the data into chunks and process those separately
 partition the data into chunks and use multiple computers configured as a cluster with
ipyparallel
(https://ipyparallel.readthedocs.io/en/latest/)  use a DataFramelike library that handles larger datasets, like Dask DataFrames (http://dask.pydata.org/en/latest/dataframe.html)
 use a tool like Apache Drill which can SQL queries against files on disk in formats like CSV
 putting the data in a database and operating on a subset in Pandas using a SELECT statement.
These are all out of scope of this document but we will briefly elaborate on the last two. Python comes standard with an implementation of Sqlite, in the package sqlite3
. Pandas supports reading a DataFrame from the result of running a query against a Sqlite database. Here's a very simple example of how that may look:
import sqlite3 as lite
with lite.connect('mydata.db') as con:
query = 'select * from sales limit 100'
df = pd.read_sql(query, con)
You can read more about Sqlite here: https://sqlite.org/quickstart.html.
Dask supports chunked dataframes that support most of the functionality of Pandas. The key additional parameter is blocksize
which specifies the maximum size of a chunk of data to read into memory at one time. In addition, Dask methods are lazily evaluated; you must explicitly call a .compute()
method to kick off the calculation. Here is a simple example: assume we have multiple CSV files containing temperature measurements. We could compute the mean temperature with something like:
import dask.dataframe as dd
df = dd.read_csv('temp*.csv', blocksize=25e6) # Use 25MB chunks
df.temperature.mean().compute()
Adding Interactivity with ipywidgets¶
ipywidgets
is an extension package for Jupyter that allows output cells to include interactive HTML elements. To install, you will need to run a command to enable the extension from a terminal and then restart Jupyter. First, install the package; the code below shows the right way to do this from within the notebook:
!conda install c condaforge prefix {sys.prefix} yes ipywidgets
Now you need to run this command from your terminal, kill and restart JupyterLab, then return here.
jupyter labextension install @jupyterwidgets/jupyterlabmanager
(You can run it from within JupyterLab but you will still need a restart before the widgets will work).
We will look at a simple example using the interact
function from ipywidgets
. You call this giving it a function as the first argument, followed by zero or more additional arguments that can be tuples, lists or dictionaries. These arguments will each become interactive controls like sliders and dropdowns, and any change in their values will cause the function to be called again with the new values as arguments.
See http://ipywidgets.readthedocs.io/en/stable/examples/Using%20Interact.html for more info on creating other types of controls when using interact
.
from ipywidgets import interact
import pandas as pd
df = pd.DataFrame([[2, 1], [4, 4], [1, 2], [3, 6]], index=['a', 'b', 'c', 'd'], columns=['s1', 's2'])
def plot_graph(kind, col):
what = df if col == 'all' else df[col]
what.plot(kind=kind)
interact(plot_graph, kind=['line', 'bar'], col=['all', 's1', 's2'])
Some Useful Packages and Resources¶

openpyxl
allows you to create and work directly with Excel spreadsheets 
faker
can create fake data like names, addresses, credit card numbers, and social security numbers 
numba
includes a@jit
decorator that can speed up the execution of many functions; useful when crunching data outside of Pandas (it won't speed up Pandas code) 
moviepy
allows you to edit video framebyframe (or even create video)  ray is a new package that lets you leverage your GPU to speed up pandas code
 qgrid is a Jupyter extension that adds interactive sorting, filtering and editing of DataFrames
Video tutorials on Pandas: http://www.dataschool.io/easierdataanalysiswithpandas/
Jake VanderPlas' excellent Python Data Science Handbook: https://jakevdp.github.io/PythonDataScienceHandbook/
Tom Augspurger has a great multipart series on Pandas aimed at intermediate to advanced users.
Example: Loading JSON into a DataFrame and Expanding Complex Fields¶
In this example we'll see how we can load some structured data and process it into a flat table form better suited to machine learning.
# Let's get some data; top stories from lobste.rs; populate a DataFrame with the JSON
stories = pd.read_json('https://lobste.rs/hottest.json')
stories.head()
# Use the "short_id' field as the index
stories = stories.set_index('short_id')
# Show the first few rows
stories.head()
# Take a look at the submitter_user field; it is a dictionary itself.
stories.submitter_user[0]
# We want to expand these fields into our dataframe. First expand into its own dataframe.
user_df = stories.submitter_user.apply(pd.Series)
user_df.head()
# We should make sure there are no collisions in column names.
set(user_df.columns).intersection(stories.columns)
# We can rename the column to avoid the clash
user_df = user_df.rename(columns={'created_at': 'user_created_at'})
# Now combine them, dropping the original compound column that we are expanding.
stories = pd.concat([stories.drop(['submitter_user'], axis=1), user_df], axis=1)
stories.head()
# The tags field is another compound field.
stories.tags.head()
# Make a new dataframe with the tag lists expanded into columns of Series.
tag_df = stories.tags.apply(pd.Series)
tag_df.head()
# Pivot the DataFrame
tag_df = tag_df.stack()
tag_df
# Expand into a 1hot encoding
tag_df = pd.get_dummies(tag_df)
tag_df.head()
# Merge multiple rows
tag_df = tag_df.sum(level=0)
tag_df.head()
# And add back to the original dataframe
stories = pd.concat([stories.drop('tags', axis=1), tag_df], axis=1)
stories.head()
Further Reading¶
The definitive Pandas book is the one by Wes McKinney, original author of Pandas. I also recommend Jake Vanderplas's book, and the one by Matt Harrison. The links below are affiliate links where I may earn a small commission: