Overview of Pandas Data Frames

While collections are typically the group of objects or tuples or simple strings, we need to parse them to further process the data. This process is tedious at times.

  • With Data Frames we can define the structure.

  • Data Frame is nothing but group of rows where each row have multiple attributes with names.

  • Data Frame is similar to a Database Table or Spreadsheet with Header.

  • Pandas provide rich and simple functions to convert data in files into Data Frames and process them

  • Data can be read from files into Data Frame using functions such as read_csv.

  • We can perform all standard operations on Data Frames.

    • Projection or Selection

    • Filtering

    • Aggregations

    • Joins

    • Sorting


Let us perform few tasks to recap the usage of Pandas Data Frames.

  • Read order items data from the location on your system. In mine it is /data/retail_db/order_items/part-00000. Use the information below to define schema.

  • It has 6 fields with the below names in the same order as specified below.

    • order_item_id

    • order_item_order_id

    • order_item_product_id

    • order_item_quantity

    • order_item_subtotal

    • order_item_product_price

import pandas as pd
order_items_path = "/data/retail_db/order_items/part-00000"
order_items = pd. \
             names=["order_item_id", "order_item_order_id",
                    "order_item_product_id", "order_item_quantity",
                    "order_item_subtotal", "order_item_product_price"
  • Project order_item_order_id and order_item_subtotal

order_items[["order_item_id", "order_item_subtotal"]]
  • Filter for order_item_order_id 2

order_items.query("order_item_order_id == 2")
  • Compute revenue for order_item_order_id 2

order_items. \
    query("order_item_order_id == 2")["order_item_subtotal"]. \
  • Get number of items and revenue for each order id. Give alias to the order revenue as revenue.

order_items. \
    groupby("order_item_order_id")["order_item_subtotal"]. \
order_items. \
    groupby("order_item_order_id")["order_item_subtotal"]. \
    agg(['sum', 'count']). \
    rename(columns={'sum': 'revenue'})