Overview of Collections¶
Let’s quickly recap about Collections and Tuples in Python. We will primarily talk about collections that comes as part of Python standard library such as list, set, dict and tuple.
Group of elements with length and index -
listGroup of unique elements -
setGroup of key value pairs -
dictWhile
listandsetcontain group of homogeneous elements,dictandtuplecontains group of heterogeneous elements.listorsetare analogous to a database table whiledictortupleare analogous to individual record.Typically we create list of tuples or dicts or set of tuples or dicts. Also a dict can be considered as list of pairs. A pair is nothing but a tuple with 2 elements.
listanddictare quite extensively used compared tosetandtuple.We typically use Map Reduce APIs to process the data in collections. There are also some pre-defined functions such as
len,sum,min,maxetc for aggregating data in collections.
Tasks¶
Let us perform few tasks to quickly recap details about Collections and Tuples in Python. We will also quickly recap about Map Reduce APIs.
Create a collection of orders by reading data from a file.
%%sh
ls -ltr /data/retail_db/orders/part-00000
orders_path = "/data/retail_db/orders/part-00000"
orders = open(orders_path). \
read(). \
splitlines()
Get all unique order statuses. Make sure data is sorted in alphabetical order.
sorted(set(map(lambda o: o.split(",")[3], orders)))
Get count of all unique dates.
len(list(map(lambda o: o.split(",")[1], orders)))
Sort the data in orders in ascending order by order_customer_id and then order_date.
sorted(orders, key=lambda k: (int(k.split(",")[2]), k.split(",")[1]))
Create a collection of order_items by reading data from a file.
order_items_path = "/data/retail_db/order_items/part-00000"
order_items = open(order_items_path). \
read(). \
splitlines()
Get revenue for a given order_item_order_id.
def get_order_revenue(order_items, order_id):
order_items_filtered = filter(lambda oi:
int(oi.split(",")[1]) == 2,
order_items
)
order_items_map = map(lambda oi:
float(oi.split(",")[4]),
order_items_filtered
)
return round(sum(order_items_map), 2)
get_order_revenue(order_items, 2)