Python Pandas Library Cheat Sheet for a Beginner
This is a python pandas cheat sheet for a quick lookup and review. For more information and explanation, please visit pandas doc: https://pandas.pydata.org/docs/reference/frame.html
Loading data:
Import: Import pandas as pd
Read CSV: pd.read_csv(‘file_name.csv’)
Union all multiple dataframes: pd.concat([df1, df2, df3, …])
Merge:
pd.merge(table1, table2)
table1.merge(table2)
Using left_on or right_on: pd.merge(table1, table2, left_on: table1_column, right_on: table2_column)
Useful option: suffixes = [table1_suffix, table2_suffix]
Join types: pd.merge(table1, table2, how=’join_type’)
Join types = [‘left’, ‘right’, ‘outer’]
Default join type = ‘inner’
Inspecting data:
Inspection:
dataframe.columns
dataframe.head()
dataframe.describe()
dataframe.dtypes
dataframe.isnull().sum() << shows you the # of rows per column with null value
Duplicates:
Finding duplicates: dataframe.duplicated()
Removing duplicates: dataframe.drop_duplicates()
Getting the column information using aggregate:
Available: sum(), mean(), median(), max(), min(), std(), nunique(), unique()
dataframe.column_name.aggregate_function(): ex) dataframe.column_name.sum()
Group by:
One groupby: dataframe.groupby(‘column1’).column2.count()
Multiple groupby: dataframe.groupby([‘column1’, ‘column2’]).column3.count()
Selecting columns:
One column:
Dataframe[‘column1’]
If the column name is one word: dataframe.column1
Multiple columns: dataframe[[‘column1’,’column2’,’column3’]]
Looking up:
Filtering dataframe using loc or iloc:
loc: using label. EX) dataframe.loc[]
iloc: using index. EX) dataframe.iloc[]
Getting the result for specific values:
One: dataframe[dataframe.column_name == ‘specific value]
You can use other logical operations like: <, >, !=
Multiple: use format of:
And: () & ()
Or: () | ()
Cleaning and Tidying data:
Column name change: dataframe.rename(columns = {‘old_column_name’:’new_column_name’, ‘old_column_name2’:’new_column_name2’})
Create a new column:
New_column_value can be one value: ex) dataframe[‘new_column_name’] = ‘Yes’
New_column_value can be a simple calc: ex) dataframe[‘new_column_name’] = dataframe[column1] * dataframe[column2]
New_column_value can be more complex calc using lambda: ex) dataframe[‘new_column_name’] = dataframe.apply( lambda x: x[‘column1’]**2 if x[‘column2’] > (x[‘column1’]*2) else x[‘column2’]*3)
Dealing with missing data:
Drop them: dataframe.dropna()
Fill them:
One column: dataframe.column.fillna(value)
Multiple columns: dataframe.fillna(value = {column1: value1, column2:value2})
Pivot:
dataframe.pivot(index = , columns = , values = ).reset_index()
pd.melt(dataframe, id_vars = , var_name = , value_name = )
Export:
Export:
To csv: dataframe.to_csv()
To json: dataframe.to_json()
Comments