Overview of Collections¶
Let’s quickly recap about Collections and Tuples in Python. We will primarily talk about collections that comes as part of Python standard library such as list
, set
, dict
and tuple
.
Group of elements with length and index -
list
Group of unique elements -
set
Group of key value pairs -
dict
While
list
andset
contain group of homogeneous elements,dict
andtuple
contains group of heterogeneous elements.list
orset
are analogous to a database table whiledict
ortuple
are analogous to individual record.Typically we create list of tuples or dicts or set of tuples or dicts. Also a dict can be considered as list of pairs. A pair is nothing but a tuple with 2 elements.
list
anddict
are quite extensively used compared toset
andtuple
.We typically use Map Reduce APIs to process the data in collections. There are also some pre-defined functions such as
len
,sum
,min
,max
etc for aggregating data in collections.
Tasks¶
Let us perform few tasks to quickly recap details about Collections and Tuples in Python. We will also quickly recap about Map Reduce APIs.
Create a collection of orders by reading data from a file.
%%sh
ls -ltr /data/retail_db/orders/part-00000
orders_path = "/data/retail_db/orders/part-00000"
orders = open(orders_path). \
read(). \
splitlines()
Get all unique order statuses. Make sure data is sorted in alphabetical order.
sorted(set(map(lambda o: o.split(",")[3], orders)))
Get count of all unique dates.
len(list(map(lambda o: o.split(",")[1], orders)))
Sort the data in orders in ascending order by order_customer_id and then order_date.
sorted(orders, key=lambda k: (int(k.split(",")[2]), k.split(",")[1]))
Create a collection of order_items by reading data from a file.
order_items_path = "/data/retail_db/order_items/part-00000"
order_items = open(order_items_path). \
read(). \
splitlines()
Get revenue for a given order_item_order_id.
def get_order_revenue(order_items, order_id):
order_items_filtered = filter(lambda oi:
int(oi.split(",")[1]) == 2,
order_items
)
order_items_map = map(lambda oi:
float(oi.split(",")[4]),
order_items_filtered
)
return round(sum(order_items_map), 2)
get_order_revenue(order_items, 2)