Overview of Collections

Let’s quickly recap about Collections and Tuples in Python. We will primarily talk about collections that comes as part of Python standard library such as list, set, dict and tuple.

  • Group of elements with length and index - list

  • Group of unique elements - set

  • Group of key value pairs - dict

  • While list and set contain group of homogeneous elements, dict and tuple contains group of heterogeneous elements.

  • list or set are analogous to a database table while dict or tuple are analogous to individual record.

  • Typically we create list of tuples or dicts or set of tuples or dicts. Also a dict can be considered as list of pairs. A pair is nothing but a tuple with 2 elements.

  • list and dict are quite extensively used compared to set and tuple.

  • We typically use Map Reduce APIs to process the data in collections. There are also some pre-defined functions such as len, sum, min, max etc for aggregating data in collections.

Tasks

Let us perform few tasks to quickly recap details about Collections and Tuples in Python. We will also quickly recap about Map Reduce APIs.

  • Create a collection of orders by reading data from a file.

%%sh

ls -ltr /data/retail_db/orders/part-00000
orders_path = "/data/retail_db/orders/part-00000"
orders = open(orders_path). \
    read(). \
    splitlines()
  • Get all unique order statuses. Make sure data is sorted in alphabetical order.

sorted(set(map(lambda o: o.split(",")[3], orders)))
  • Get count of all unique dates.

len(list(map(lambda o: o.split(",")[1], orders)))
  • Sort the data in orders in ascending order by order_customer_id and then order_date.

sorted(orders, key=lambda k: (int(k.split(",")[2]), k.split(",")[1]))
  • Create a collection of order_items by reading data from a file.

order_items_path = "/data/retail_db/order_items/part-00000"
order_items = open(order_items_path). \
    read(). \
    splitlines()
  • Get revenue for a given order_item_order_id.

def get_order_revenue(order_items, order_id):
    order_items_filtered = filter(lambda oi: 
                                  int(oi.split(",")[1]) == 2, 
                                  order_items
                                 )
    order_items_map = map(lambda oi: 
                          float(oi.split(",")[4]), 
                          order_items_filtered
                         )
    return round(sum(order_items_map), 2)
get_order_revenue(order_items, 2)