Skip to content

Pandas

Pandas is a powerful and widely used data manipulation library in Python. It provides data structures for efficiently storing and manipulating large datasets. This tutorial will cover the basic functionalities of Pandas, including creating DataFrames, indexing, cleaning, and exploring data.

Creating Dataframes

DataFrame is a two-dimensional, tabular data structure with labeled axes (rows and columns). It is one of the key data structures provided by Pandas and is widely used for data manipulation and analysis. Creating DataFrames can be done in various ways, such as from dictionaries, lists, CSV files, Excel files, and more. Here, we'll explore some common methods for creating DataFrames.

  • From lists or arrays or list of dictionaries
    import pandas as pd
    
    data = {'Name': ['Alice', 'Bob', 'Charlie'],
            'Age': [25, 30, 22],
            'City': ['New York', 'San Francisco', 'Los Angeles']}
    # or
    data = [{'Name': 'Alice', 'Age': 25, 'City': 'New York'}, 
            {'Name': 'Bob', 'Age': 30, 'City': 'San Francisco'}, 
            {'Name': 'Charlie', 'Age': 22, 'City': 'Los Angeles'}]
    
    df = pd.DataFrame(data)
    print(df)
    

This would result in a dataframe like this

      Name  Age           City
0    Alice   25       New York
1      Bob   30  San Francisco
2  Charlie   22    Los Angeles
  • From csvs
    df = pd.read_csv('data.csv')
    
  • From excel
    df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
    

Exploring data

Exploring data is a crucial step in any data analysis process, and Pandas provides a variety of functions to help you understand and analyze your dataset. Here are some common techniques for exploring data using Pandas in Python

  • Basic
    import pandas as pd
    
    # Assuming df is your DataFrame
    print(df.head())     # Display the first few rows
    print(df.tail())     # Display the last few rows
    print(df.info())     # Display concise summary
    print(df.describe())  # Descriptive statistics
    
  • Indexing and selecting information
    # Selecting a single column
    print(df['Column'])
    
    # Selecting multiple columns
    print(df[['Column1', 'Column2']])
    
    # Selecting a row by label
    print(df.loc[0])
    
    # Selecting a row by integer index
    print(df.iloc[0])
    
    # Selecting rows based on a condition
    print(df[df['Column'] > 25])
    
  • Missing values or duplicates
    # Checking for missing values
    print(df.isnull())
    
    # Dropping rows with any missing values
    df_cleaned = df.dropna()
    
    # Filling missing values with a specific value
    df_filled = df.fillna(value)
    
    # Checking for duplicate rows
    print(df.duplicated())
    
    # Removing duplicate rows
    df_no_duplicates = df.drop_duplicates()
    
  • creating new columns
    # Creating a new column based on existing columns
    df['NewColumn'] = df['Column1'] + df['Column2']
    
  • Handling Categorical Data
    # Converting a column to categorical
    df['CategoryColumn'] = df['CategoryColumn'].astype('category')
    
    # Getting counts of each category
    print(df['CategoryColumn'].value_counts())
    
  • Grouping
    # Grouping by a column and applying aggregation functions
    grouped_df = df.groupby('CategoryColumn').agg({'NumericColumn': ['mean', 'sum']})
    print(grouped_df)
    
  • Visualization
    import matplotlib.pyplot as plt
    import seaborn as sns
    
    # Basic plotting
    df.plot(x='Column1', y='Column2', kind='scatter')
    plt.show()
    
    # Using Seaborn for more advanced plots
    sns.boxplot(x='CategoryColumn', y='NumericColumn', data=df)
    plt.show()
    

Data Cleaning

Data cleaning is an essential step in the data preparation process, ensuring that the dataset is accurate, consistent, and suitable for analysis. Pandas provides various functions to facilitate data cleaning tasks. Here are some common techniques for data cleaning using Pandas in Python

  • Handling outliers
    import numpy as np
    
    Q1 = np.percentile(df['NumericColumn'], 25)
    Q3 = np.percentile(df['NumericColumn'], 75)
    IQR = Q3 - Q1
    
    # Filtering data within 1.5 times the interquartile range
    df_no_outliers = df[(df['NumericColumn'] >= Q1 - 1.5 * IQR) & (df['NumericColumn'] <= Q3 + 1.5 * IQR)]
    
  • Correcting Data Types
    df['NumericColumn'] = pd.to_numeric(df['NumericColumn'])  # Convert column to numeric
    df['DateColumn'] = pd.to_datetime(df['DateColumn'])      # Convert column to datetime
    df['CategoryColumn'] = df['CategoryColumn'].astype('category')  # Convert column to category
    
  • Text data cleaning
    df['TextColumn'] = df['TextColumn'].str.lower()      # Convert text to lowercase
    df['TextColumn'] = df['TextColumn'].str.strip()      # Remove leading and trailing whitespaces
    df['TextColumn'] = df['TextColumn'].str.replace('[^a-zA-Z ]', '')  # Remove non-alphabetic characters
    
  • Renaming columns
    df.rename(columns={'OldName': 'NewName'}, inplace=True)  # Rename a specific column
    df.columns = ['NewColumn1', 'NewColumn2']  # Rename all columns
    
  • Binarization

    bins = [0, 25, 50, 75, 100]
    labels = ['0-25', '26-50', '51-75', '76-100']
    
    df['BinnedColumn'] = pd.cut(df['NumericColumn'], bins=bins, labels=labels)
    

  • Dealing with date and time

    df['DateColumn'] = pd.to_datetime(df['DateColumn'])  # Convert column to datetime
    df['DayOfWeek'] = df['DateColumn'].dt.day_name()  # Extract day of the week
    df['Month'] = df['DateColumn'].dt.month  # Extract month
    

Data Manipulation

Data modification involves making changes to the existing data, such as adding or removing columns, updating values, and creating new features. Here are some common data modification tasks using Pandas

  • Adding or update an existing column
    import pandas as pd
    
    # Assuming df is your DataFrame
    df['NewColumn'] = [1, 2, 3, 4, 5]
    
    df['ExistingColumn'] = df['ExistingColumn'] * 2
    
  • Drop a column in place
    # Removing a single column
    df.drop('ColumnToRemove', axis=1, inplace=True)
    
    # Removing multiple columns
    columns_to_remove = ['Column1', 'Column2']
    df.drop(columns=columns_to_remove, inplace=True)
    
  • Applying functions
    # Applying a function to each element of a column
    df['NumericColumn'] = df['NumericColumn'].apply(lambda x: x * 2)
    
    # Applying a function to each row
    df['NewColumn'] = df.apply(lambda row: row['Column1'] + row['Column2'], axis=1)
    
  • Map and Reduce
    # Creating a new column based on a mapping
    gender_mapping = {'M': 'Male', 'F': 'Female'}
    df['Gender'] = df['Code'].map(gender_mapping)
    
  • Change data type
    # Converting a column to numeric
    df['NumericColumn'] = pd.to_numeric(df['NumericColumn'])
    
    # Converting a column to datetime
    df['DateColumn'] = pd.to_datetime(df['DateColumn'])
    
    # Converting a column to category
    df['CategoryColumn'] = df['CategoryColumn'].astype('category')
    
  • Sorting
    # Sorting based on one or more columns
    df.sort_values(by=['Column1', 'Column2'], ascending=[True, False], inplace=True)
    
  • Combining data frames
    # Concatenating DataFrames along rows or columns
    df_concatenated = pd.concat([df1, df2], axis=0)
    

Conclusion

Pandas is an invaluable tool for anyone working with tabular data in Python. It provides a flexible and expressive framework for data manipulation, making it easier to clean, analyze, and visualize datasets. By mastering Pandas, you empower yourself to tackle a wide range of data-related tasks efficiently.

References

  1. Pandas documentation