10 Minutes to Pandas
If you are in data science, there are high chances of using pandas in your Data Science and Machine Learning processes and data pipelines. Considering the need to refer to syntax and the basics of pandas, here is a quick 10-minute intro to pandas and the most used methods from it.
Note: In this article, “pd” is an alias for pandas and “np” is an alias for numpy.
Object Creation
Creating a Series by passing a list of values, letting pandas create a default integer index:
series = pd.Series([1,3,5,np.nan,6,8])
series
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
dates = pd.date_range('20130101', periods=6)
dates
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
test_df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
test_df
Date | A | B | C | D |
---|---|---|---|---|
2013-01-01 | -0.165045 | 0.286237 | -0.388395 | 0.189089 |
2013-01-02 | -0.380108 | 0.781734 | -0.668664 | 0.122847 |
2013-01-03 | 1.982129 | 1.970573 | 1.724951 | -0.810865 |
2013-01-04 | -1.390268 | -0.862023 | 1.708512 | -1.268239 |
2013-01-05 | 1.007223 | 0.024108 | 0.539417 | 1.442396 |
2013-01-06 | 1.223380 | -0.034152 | 0.349011 | -0.225668 |
Viewing Data
Here is how to view the top and bottom rows of the frame.
df.head()
df.tail(3)
Date | A | B | C | D |
---|---|---|---|---|
2013-01-03 | 1.982129 | 1.970573 | 1.724951 | -0.810865 |
2013-01-04 | -1.390268 | -0.862023 | 1.708512 | -1.268239 |
2013-01-05 | 1.007223 | 0.024108 | 0.539417 | 1.442396 |
Display the index, columns, and the underlying NumPy data:
df.index
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
df.columns
Index(['A', 'B', 'C', 'D'], dtype='object')
df.values
array([[-0.16504516, 0.28623677, -0.38839496, 0.1890891 ],
[-0.38010769, 0.78173448, -0.66866431, 0.12284665],
[ 1.98212925, 1.9705729 , 1.72495074, -0.81086545],
[-1.39026802, -0.86202321, 1.70851228, -1.26823932],
[ 1.0072233 , 0.02410772, 0.53941737, 1.44239551],
[ 1.22337986, -0.03415161, 0.34901142, -0.22566768]])
describe() shows a quick statistic summary of your data
df.describe()
A | B | C | D | |
---|---|---|---|---|
count | 6.000000 | 6.000000 | 6.000000 | 6.000000 |
mean | 0.379552 | 0.361080 | 0.544139 | -0.091740 |
std | 1.239371 | 0.952760 | 1.012787 | 0.937839 |
min | -1.390268 | -0.862023 | -0.668664 | -1.268239 |
25% | -0.326342 | -0.019587 | -0.204043 | -0.664566 |
50% | 0.421089 | 0.155172 | 0.444214 | -0.051411 |
75% | 1.169341 | 0.657860 | 1.416239 | 0.172528 |
max | 1.982129 | 1.970573 | 1.724951 | 1.442396 |
Transposing your data
df.T
Sorting by an axis
df.sort_index(axis=1, ascending=False)
Date | A | B | C | D |
---|---|---|---|---|
2013-01-01 | 0.189089 | -0.388395 | 0.286237 | -0.165045 |
2013-01-02 | 0.122847 | -0.668664 | 0.781734 | -0.380108 |
2013-01-03 | -0.810865 | 1.724951 | 1.970573 | 1.982129 |
2013-01-04 | -1.268239 | 1.708512 | -0.862023 | -1.390268 |
2013-01-05 | 1.442396 | 0.539417 | 0.024108 | 1.007223 |
2013-01-06 | -0.225668 | 0.349011 | -0.034152 | 1.223380 |
Sorting by values
df.sort_values(by='B')
Date | A | B | C | D |
---|---|---|---|---|
2013-01-04 | -1.390268 | -0.862023 | 1.708512 | -1.268239 |
2013-01-06 | 1.223380 | -0.034152 | 0.349011 | -0.225668 |
2013-01-05 | 1.007223 | 0.024108 | 0.539417 | 1.442396 |
2013-01-01 | -0.165045 | 0.286237 | -0.388395 | 0.189089 |
2013-01-02 | -0.380108 | 0.781734 | -0.668664 | 0.122847 |
2013-01-03 | 1.982129 | 1.970573 | 1.724951 | -0.810865 |
Selection
While Standard Python / Numpy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, it is recommended to use the optimized pandas data access methods such as .at, .iat, .loc, etc..
Selecting a single column, which yields a Series, equivalent to df.A.
df['A']
2013-01-01 -0.165045
2013-01-02 -0.380108
2013-01-03 1.982129
2013-01-04 -1.390268
2013-01-05 1.007223
2013-01-06 1.223380
Freq: D, Name: A, dtype: float64
Selecting via [], which slices the rows.
df[0:3]
Date | A | B | C | D |
---|---|---|---|---|
2013-01-01 | -0.165045 | 0.286237 | -0.388395 | 0.189089 |
2013-01-02 | -0.380108 | 0.781734 | -0.668664 | 0.122847 |
2013-01-03 | 1.982129 | 1.970573 | 1.724951 | -0.810865 |
Selection by Label for getting a cross section using a label
df.loc[dates[0]]
A -0.165045
B 0.286237
C -0.388395
D 0.189089
Name: 2013-01-01 00:00:00, dtype: float64
Selecting on a multi-axis by label:
df.loc[:,['A','B']]
Date | A | B |
---|---|---|
2013-01-01 | -0.165045 | 0.286237 |
2013-01-02 | -0.380108 | 0.781734 |
2013-01-03 | 1.982129 | 1.970573 |
2013-01-04 | -1.390268 | -0.862023 |
2013-01-05 | 1.007223 | 0.024108 |
2013-01-06 | 1.223380 | -0.034152 |
Showing label slicing, both endpoints are included:
df.loc['20130102':'20130104',['A','B']]
Date | A | B |
---|---|---|
2013-01-02 | -0.380108 | 0.781734 |
2013-01-03 | 1.982129 | 1.970573 |
2013-01-04 | -1.390268 | -0.862023 |
Reduction in the dimensions of the returned object
df.loc['20130102',['A','B']]
A -0.380108
B 0.781734
Name: 2013-01-02 00:00:00, dtype: float64
For getting a scalar value:
df.loc[dates[0],'A']
For getting fast access to a scalar (equivalent to the prior method):
df.at[dates[0],'A']
Selection by Position
Select via the position of the passed integers:
df.iloc[3]
A -1.390268
B -0.862023
C 1.708512
D -1.268239
Name: 2013-01-04 00:00:00, dtype: float64
By integer slices, acting similar to numpy/python:
df.iloc[3:5,0:2]
Date | A | B |
---|---|---|
2013-01-04 | -1.390268 | -0.862023 |
2013-01-05 | 1.007223 | 0.024108 |
By lists of integer position locations, similar to the numpy/python style:
df.iloc[[1,2,4],[0,2]]
Date | A | C |
---|---|---|
2013-01-02 | -0.380108 | -0.668664 |
2013-01-03 | 1.982129 | 1.724951 |
2013-01-05 | 1.007223 | 0.539417 |
For slicing rows explicitly:
df.iloc[1:3,:]
Date | A | B | C | D |
---|---|---|---|---|
2013-01-02 | -0.380108 | 0.781734 | -0.668664 | 0.122847 |
2013-01-03 | 1.982129 | 1.970573 | 1.724951 | -0.810865 |
For slicing columns explicitly:
df.iloc[:,1:3]
Date | A | B |
---|---|---|
2013-01-01 | 0.286237 | -0.388395 |
2013-01-02 | 0.781734 | - 0.668664 |
2013-01-03 | 1.970573 | 1.724951 |
2013-01-04 | -0.862023 | 1.708512 |
2013-01-05 | 0.024108 | 0.539417 |
2013-01-06 | -0.034152 | 0.349011 |
Boolean Indexing
Using a single column’s values to select data.
df[df.A > 0]
Date | A | B | C | D |
---|---|---|---|---|
2013-01-03 | 1.982129 | 1.970573 | 1.724951 | -0.810865 |
2013-01-05 | 1.007223 | 0.024108 | 0.539417 | 1.442396 |
2013-01-06 | 1.223380 | -0.034152 | 0.349011 | -0.225668 |
Selecting values from a DataFrame where a Boolean condition is met.
df[df > 0]
Date | A | B | C | D |
---|---|---|---|---|
2013-01-01 | NaN | 0.286237 | NaN | 0.189089 |
2013-01-02 | NaN | 0.781734 | NaN | 0.122847 |
2013-01-03 | 1.982129 | 1.970573 | 1.724951 | NaN |
2013-01-04 | NaN | NaN | 1.708512 | NaN |
2013-01-05 | 1.007223 | 0.024108 | 0.539417 | 1.442396 |
2013-01-06 | 1.223380 | NaN | 0.349011 | NaN |
Using the isin() method for filtering:
df2 = df.copy()
df2['E'] = ['one', 'one','two','three','four','three']
df2
Date | A | B | C | D | E |
---|---|---|---|---|---|
2013-01-01 | -0.165045 | 0.286237 | -0.388395 | 0.189089 | one |
2013-01-02 | -0.380108 | 0.781734 | -0.668664 | 0.122847 | one |
2013-01-03 | 1.982129 | 1.970573 | 1.724951 | -0.810865 | two |
2013-01-04 | -1.390268 | -0.862023 | 1.708512 | -1.268239 | three |
2013-01-05 | 1.007223 | 0.024108 | 0.539417 | 1.442396 | four |
2013-01-06 | 1.223380 | -0.034152 | 0.349011 | -0.225668 | three |
df2[df2['E'].isin(['two','four'])]
Date | A | B | C | D | E |
---|---|---|---|---|---|
2013-01-03 | 1.982129 | 1.970573 | 1.724951 | -0.810865 | two |
2013-01-05 | 1.007223 | 0.024108 | 0.539417 | 1.442396 | four |
Setting a new column automatically aligns the data by the indexes.
new_series = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6))
new_series
2013-01-02 1
2013-01-03 2
2013-01-04 3
2013-01-05 4
2013-01-06 5
2013-01-07 6
Freq: D, dtype: int64
Setting values by label:
df.at[dates[0],'A'] = 0
Setting by assigning with a NumPy array:
df.loc[:,'D'] = np.array([5] * len(df))
Missing Data
Pandas primarily uses the value np.nan to represent missing data. It is by default not included in computations. Reindexing allows you to change/add/delete the index on a specified axis. This returns a copy of the data.
df = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
df.loc[dates[0]:dates[1],'E'] = 1
df
Date | A | B | C | D | E | F |
---|---|---|---|---|---|---|
2013-01-01 | 0.000000 | 0.000000 | -0.388395 | 5 | NaN | 1.0 |
2013-01-02 | -0.380108 | 0.781734 | -0.668664 | 5 | 1.0 | 1.0 |
2013-01-03 | 1.982129 | 1.970573 | 1.724951 | 5 | 2.0 | NaN |
2013-01-04 | -1.390268 | -0.862023 | 1.708512 | 5 | 3.0 | NaN |
To drop any rows that have missing data.
df.dropna(how='any')
Date | A | B | C | D | E | F |
---|---|---|---|---|---|---|
2013-01-02 | -0.380108 | 0.781734 | -0.668664 | 5 | 1.0 | 1.0 |
Filling missing data.
df.fillna(value=5)
Date | A | B | C | D | E | F |
---|---|---|---|---|---|---|
2013-01-01 | 0.000000 | 0.000000 | -0.388395 | 5 | 5.0 | 1.0 |
2013-01-02 | -0.380108 | 0.781734 | -0.668664 | 5 | 1.0 | 1.0 |
2013-01-03 | 1.982129 | 1.970573 | 1.724951 | 5 | 2.0 | 5.0 |
2013-01-04 | -1.390268 | -0.862023 | 1.708512 | 5 | 3.0 | 5.0 |
To get the boolean mask where values are nan.
pd.isna(df)
Date | A | B | C | D | E | F |
---|---|---|---|---|---|---|
2013-01-01 | False | False | False | False | True | False |
2013-01-02 | False | False | False | False | False | False |
2013-01-03 | False | False | False | False | False | True |
2013-01-04 | False | False | False | False | False | True |
Apply
Applying functions to the data:
df.apply(np.cumsum)
Date | A | B | C | D | E |
---|---|---|---|---|---|
2013-01-01 | 0.000000 | 0.000000 | -0.388395 | 5 | NaN |
2013-01-02 | -0.380108 | 0.781734 | -1.057059 | 10 | 1.0 |
2013-01-03 | 1.602022 | 2.752307 | 0.667891 | 15 | 3.0 |
df.apply(lambda x: x.max() - x.min())
String Methods
Series is equipped with a set of string processing methods in the str attribute that make it easy to operate on each element of the array. Note that pattern-matching in str generally uses regular expressions by default.
str_series = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
str_series
0 A
1 B
2 C
3 Aaba
4 Baca
5 NaN
6 CABA
7 dog
8 cat
dtype: object
s.str.lower()
0 a
1 b
2 c
3 aaba
4 baca
5 NaN
6 caba
7 dog
8 cat
dtype: object
Merge
Pandas provides various facilities for easily combining together Series, DataFrame, and Panel objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations. Concatenating pandas objects together with concat():
df = pd.DataFrame(np.random.randn(10, 4))
break it into pieces
pieces = [df[:3], df[3:7], df[7:]]
pd.concat(pieces)
A | B | C | D |
---|---|---|---|
-0.106234 | -0.950631 | 1.519573 | 0.097218 |
1.796956 | -0.450472 | -1.315292 | -1.099288 |
1.589803 | 0.774019 | 0.009430 | -0.227336 |
1.153811 | 0.272446 | 1.984570 | -0.039846 |
0.495798 | 0.714185 | -1.035842 | 0.101935 |
0.254143 | 0.359573 | -1.274558 | -1.978555 |
0.456850 | -0.094249 | 0.665324 | 0.226110 |
-0.657296 | 0.760446 | -0.521526 | 0.392031 |
0.186656 | -0.131740 | -1.404915 | 0.501818 |
-0.523582 | -0.876016 | -0.004513 | -0.509841 |
JOIN
left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
key | lval |
---|---|
foo | 1 |
foo | 2 |
key | rval |
---|---|
foo | 4 |
foo | 5 |
pd.merge(left, right, on='key')
key | lval | rval |
---|---|---|
foo | 1 | 4 |
foo | 1 | 5 |
foo | 2 | 4 |
foo | 2 | 5 |
Append
Append rows to a dataframe.
append_df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
append_df
A | B | C | D |
---|---|---|---|
0.310213 | 0.511346 | 1.891497 | 0.491886 |
-2.099571 | -0.477107 | 0.701392 | 0.452229 |
-1.508507 | 0.207553 | 0.140408 | 0.033682 |
-1.026017 | -1.277501 | 1.755467 | 1.056045 |
-0.890034 | 0.726291 | -0.419684 | -1.073366 |
-0.614249 | 1.139664 | -1.582946 | 0.661833 |
-0.010116 | 1.877924 | -0.015332 | 1.176713 |
-0.314318 | 1.088290 | -0.067972 | -1.759359 |
Grouping
By “group by” we are referring to a process involving one or more of the following steps:
- Splitting the data into groups based on some criteria.
- Applying a function to each group independently.
- Combining the results into a data structure.
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
'C' : np.random.randn(8),
'D' : np.random.randn(8)})
A | B | C | D |
---|---|---|---|
foo | one | -0.606619 | 0.295979 |
bar | one | -0.015111 | -1.662742 |
foo | two | -0.212922 | 1.564823 |
bar | three | 0.332831 | 0.337342 |
foo | two | 0.235074 | -0.568002 |
bar | two | -0.892237 | 0.944328 |
foo | one | 0.558490 | 0.977741 |
foo | three | 0.517773 | 1.052036 |
Grouping and then applying the sum() function to the resulting groups
df.groupby('A').sum()
C | D | |
---|---|---|
A | ||
bar | -0.574517 | -0.381072 |
foo | 0.491797 | 3.322576 |
Grouping by multiple columns forms a hierarchical index, and again we can apply the sum function.
df.groupby(['A','B']).sum()
Reshaping
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]))
tuples
[('bar', 'one'),
('bar', 'two'),
('baz', 'one'),
('baz', 'two'),
('foo', 'one'),
('foo', 'two'),
('qux', 'one'),
('qux', 'two')]
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df_ind = df[:4]
df_ind
The stack() method “compresses” a level in the DataFrame’s columns.
stacked = df_ind.stack()
stacked
With a “stacked” DataFrame or Series (having a MultiIndex as the index), the inverse operation of stack() is unstack(), which by default unstacks the last level:
stacked.unstack()
stacked.unstack(1)
stacked.unstack(0)
Pivot tables
df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
'B' : ['A', 'B', 'C'] * 4, 'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
'D' : np.random.randn(12), 'E' : np.random.randn(12)})
We can produce pivot tables from this data very easily
pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])
Getting Data In/Out
Writing to a csv file.
df.to_csv('foo.csv')
Reading a CSV
pd.read_csv('foo.csv')
Writing to a HDF5 Store.
df.to_hdf('foo.h5','df')
Reading from a HDF5 Store.
pd.read_hdf('foo.h5','df')
Writing to an excel file.
df.to_excel('foo.xlsx', sheet_name='Sheet1')
Reading from an excel file
pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])
Pandas is indeed a powerful package to work with, especially for data engineers, scientists who work on manipulating and analysing data. With a solid grasp of Pandas, you are well-equipped to streamline your data workflow and uncover valuable insights from your data.