{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating Partitioned Tables\n", "\n", "We can also create partitioned tables as part of Spark Metastore Tables.\n", "\n", "* There are some challenges in creating partitioned tables directly using `spark.catalog.createTable`.\n", "* But if the directories are similar to partitioned tables with data, we should be able to create partitioned tables.\n", "* Let us create partitioned table for `orders` by `order_month`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our [10 node state of the art cluster/labs](https://labs.itversity.com/plans) to learn Spark SQL using our unique integrated LMS." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from pyspark.sql import SparkSession\n", "\n", "import getpass\n", "username = getpass.getuser()\n", "\n", "spark = SparkSession. \\\n", " builder. \\\n", " config('spark.ui.port', '0'). \\\n", " config(\"spark.sql.warehouse.dir\", f\"/user/{username}/warehouse\"). \\\n", " enableHiveSupport(). \\\n", " appName(f'{username} | Python - Spark Metastore'). \\\n", " master('yarn'). \\\n", " getOrCreate()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.\n", "\n", "**Using Spark SQL**\n", "\n", "```\n", "spark2-sql \\\n", " --master yarn \\\n", " --conf spark.ui.port=0 \\\n", " --conf spark.sql.warehouse.dir=/user/${USER}/warehouse\n", "```\n", "\n", "**Using Scala**\n", "\n", "```\n", "spark2-shell \\\n", " --master yarn \\\n", " --conf spark.ui.port=0 \\\n", " --conf spark.sql.warehouse.dir=/user/${USER}/warehouse\n", "```\n", "\n", "**Using Pyspark**\n", "\n", "```\n", "pyspark2 \\\n", " --master yarn \\\n", " --conf spark.ui.port=0 \\\n", " --conf spark.sql.warehouse.dir=/user/${USER}/warehouse\n", "```" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "spark.conf.set('spark.sql.shuffle.partitions', '2')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tasks\n", "\n", "Let us perform tasks related to partitioned tables.\n", "* Read data from file into data frame.\n", "* Add additional column which will be used to partition the data.\n", "* Write the data into the target location on which we are going to create the table.\n", "* Create partitioned table using the location to which we have copied the data and validate.\n", "* We can recover partitions by running `MSCK REPAIR TABLE` using `spark.sql` or by invoking `spark.catalog.recoverPartitions`.\n", "* When we use `createTable` to create partitioned table, we have to recover partitions so that partitions are visible." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import getpass\n", "username = getpass.getuser()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
\n" ], "text/plain": [ "++\n", "||\n", "++\n", "++" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "spark.sql(f'CREATE DATABASE IF NOT EXISTS {username}_retail')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "spark.catalog.setCurrentDatabase(f'{username}_retail')" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'itversity_retail'" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "spark.catalog.currentDatabase()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "orders_path = '/public/retail_db/orders'" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Found 1 items\n", "-rw-r--r-- 2 hdfs supergroup 2999944 2021-01-28 09:27 /public/retail_db/orders/part-00000\n" ] } ], "source": [ "%%sh\n", "\n", "hdfs dfs -ls /public/retail_db/orders" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
\n" ], "text/plain": [ "++\n", "||\n", "++\n", "++" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "spark.sql('DROP TABLE IF EXISTS orders_part')" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Found 14 items\n", "-rw-r--r-- 3 itversity itversity 0 2021-03-13 15:25 /user/itversity/retail_db/orders_part/_SUCCESS\n", "drwxr-xr-x - itversity itversity 0 2021-03-13 15:25 /user/itversity/retail_db/orders_part/order_month=201307\n", "drwxr-xr-x - itversity itversity 0 2021-03-13 15:25 /user/itversity/retail_db/orders_part/order_month=201308\n", "drwxr-xr-x - itversity itversity 0 2021-03-13 15:25 /user/itversity/retail_db/orders_part/order_month=201309\n", "drwxr-xr-x - itversity itversity 0 2021-03-13 15:25 /user/itversity/retail_db/orders_part/order_month=201310\n", "drwxr-xr-x - itversity itversity 0 2021-03-13 15:25 /user/itversity/retail_db/orders_part/order_month=201311\n", "drwxr-xr-x - itversity itversity 0 2021-03-13 15:25 /user/itversity/retail_db/orders_part/order_month=201312\n", "drwxr-xr-x - itversity itversity 0 2021-03-13 15:25 /user/itversity/retail_db/orders_part/order_month=201401\n", "drwxr-xr-x - itversity itversity 0 2021-03-13 15:25 /user/itversity/retail_db/orders_part/order_month=201402\n", "drwxr-xr-x - itversity itversity 0 2021-03-13 15:25 /user/itversity/retail_db/orders_part/order_month=201403\n", "drwxr-xr-x - itversity itversity 0 2021-03-13 15:25 /user/itversity/retail_db/orders_part/order_month=201404\n", "drwxr-xr-x - itversity itversity 0 2021-03-13 15:25 /user/itversity/retail_db/orders_part/order_month=201405\n", "drwxr-xr-x - itversity itversity 0 2021-03-13 15:25 /user/itversity/retail_db/orders_part/order_month=201406\n", "drwxr-xr-x - itversity itversity 0 2021-03-13 15:25 /user/itversity/retail_db/orders_part/order_month=201407\n" ] } ], "source": [ "%%sh\n", "\n", "hdfs dfs -ls /user/`whoami`/retail_db/orders_part" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2021-03-13 15:31:04,403 INFO fs.TrashPolicyDefault: Moved: 'hdfs://m01.itversity.com:9000/user/itversity/retail_db/orders_part' to trash at: hdfs://m01.itversity.com:9000/user/itversity/.Trash/Current/user/itversity/retail_db/orders_part1615667464378\n" ] } ], "source": [ "%%sh\n", "\n", "hdfs dfs -rm -R -skipTrash /user/`whoami`/retail_db/orders_part" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "from pyspark.sql.functions import date_format" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "spark. \\\n", " read. \\\n", " csv(orders_path,\n", " schema='''order_id INT, order_date DATE,\n", " order_customer_id INT, order_status STRING\n", " '''\n", " ). \\\n", " withColumn('order_month', date_format('order_date', 'yyyyMM')). \\\n", " write. \\\n", " partitionBy('order_month'). \\\n", " parquet(f'/user/{username}/retail_db/orders_part')" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Found 14 items\n", "-rw-r--r-- 3 itversity itversity 0 2021-03-13 15:31 /user/itversity/retail_db/orders_part/_SUCCESS\n", "drwxr-xr-x - itversity itversity 0 2021-03-13 15:31 /user/itversity/retail_db/orders_part/order_month=201307\n", "drwxr-xr-x - itversity itversity 0 2021-03-13 15:31 /user/itversity/retail_db/orders_part/order_month=201308\n", "drwxr-xr-x - itversity itversity 0 2021-03-13 15:31 /user/itversity/retail_db/orders_part/order_month=201309\n", "drwxr-xr-x - itversity itversity 0 2021-03-13 15:31 /user/itversity/retail_db/orders_part/order_month=201310\n", "drwxr-xr-x - itversity itversity 0 2021-03-13 15:31 /user/itversity/retail_db/orders_part/order_month=201311\n", "drwxr-xr-x - itversity itversity 0 2021-03-13 15:31 /user/itversity/retail_db/orders_part/order_month=201312\n", "drwxr-xr-x - itversity itversity 0 2021-03-13 15:31 /user/itversity/retail_db/orders_part/order_month=201401\n", "drwxr-xr-x - itversity itversity 0 2021-03-13 15:31 /user/itversity/retail_db/orders_part/order_month=201402\n", "drwxr-xr-x - itversity itversity 0 2021-03-13 15:31 /user/itversity/retail_db/orders_part/order_month=201403\n", "drwxr-xr-x - itversity itversity 0 2021-03-13 15:31 /user/itversity/retail_db/orders_part/order_month=201404\n", "drwxr-xr-x - itversity itversity 0 2021-03-13 15:31 /user/itversity/retail_db/orders_part/order_month=201405\n", "drwxr-xr-x - itversity itversity 0 2021-03-13 15:31 /user/itversity/retail_db/orders_part/order_month=201406\n", "drwxr-xr-x - itversity itversity 0 2021-03-13 15:31 /user/itversity/retail_db/orders_part/order_month=201407\n" ] } ], "source": [ "%%sh\n", "\n", "hdfs dfs -ls /user/`whoami`/retail_db/orders_part" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-rw-r--r-- 3 itversity itversity 0 2021-03-13 15:31 /user/itversity/retail_db/orders_part/_SUCCESS\n", "drwxr-xr-x - itversity itversity 0 2021-03-13 15:31 /user/itversity/retail_db/orders_part/order_month=201307\n", "-rw-r--r-- 3 itversity itversity 14435 2021-03-13 15:31 /user/itversity/retail_db/orders_part/order_month=201307/part-00000-ccc05948-3707-47fe-9855-f041cd350b7e.c000.snappy.parquet\n", "drwxr-xr-x - itversity itversity 0 2021-03-13 15:31 /user/itversity/retail_db/orders_part/order_month=201308\n", "-rw-r--r-- 3 itversity itversity 49997 2021-03-13 15:31 /user/itversity/retail_db/orders_part/order_month=201308/part-00000-ccc05948-3707-47fe-9855-f041cd350b7e.c000.snappy.parquet\n", "drwxr-xr-x - itversity itversity 0 2021-03-13 15:31 /user/itversity/retail_db/orders_part/order_month=201309\n", "-rw-r--r-- 3 itversity itversity 51358 2021-03-13 15:31 /user/itversity/retail_db/orders_part/order_month=201309/part-00000-ccc05948-3707-47fe-9855-f041cd350b7e.c000.snappy.parquet\n", "drwxr-xr-x - itversity itversity 0 2021-03-13 15:31 /user/itversity/retail_db/orders_part/order_month=201310\n", "-rw-r--r-- 3 itversity itversity 47051 2021-03-13 15:31 /user/itversity/retail_db/orders_part/order_month=201310/part-00000-ccc05948-3707-47fe-9855-f041cd350b7e.c000.snappy.parquet\n", "drwxr-xr-x - itversity itversity 0 2021-03-13 15:31 /user/itversity/retail_db/orders_part/order_month=201311\n", "-rw-r--r-- 3 itversity itversity 55949 2021-03-13 15:31 /user/itversity/retail_db/orders_part/order_month=201311/part-00000-ccc05948-3707-47fe-9855-f041cd350b7e.c000.snappy.parquet\n", "drwxr-xr-x - itversity itversity 0 2021-03-13 15:31 /user/itversity/retail_db/orders_part/order_month=201312\n", "-rw-r--r-- 3 itversity itversity 51794 2021-03-13 15:31 /user/itversity/retail_db/orders_part/order_month=201312/part-00000-ccc05948-3707-47fe-9855-f041cd350b7e.c000.snappy.parquet\n", "drwxr-xr-x - itversity itversity 0 2021-03-13 15:31 /user/itversity/retail_db/orders_part/order_month=201401\n", "-rw-r--r-- 3 itversity itversity 51938 2021-03-13 15:31 /user/itversity/retail_db/orders_part/order_month=201401/part-00000-ccc05948-3707-47fe-9855-f041cd350b7e.c000.snappy.parquet\n", "drwxr-xr-x - itversity itversity 0 2021-03-13 15:31 /user/itversity/retail_db/orders_part/order_month=201402\n", "-rw-r--r-- 3 itversity itversity 49591 2021-03-13 15:31 /user/itversity/retail_db/orders_part/order_month=201402/part-00000-ccc05948-3707-47fe-9855-f041cd350b7e.c000.snappy.parquet\n", "drwxr-xr-x - itversity itversity 0 2021-03-13 15:31 /user/itversity/retail_db/orders_part/order_month=201403\n", "-rw-r--r-- 3 itversity itversity 50816 2021-03-13 15:31 /user/itversity/retail_db/orders_part/order_month=201403/part-00000-ccc05948-3707-47fe-9855-f041cd350b7e.c000.snappy.parquet\n", "drwxr-xr-x - itversity itversity 0 2021-03-13 15:31 /user/itversity/retail_db/orders_part/order_month=201404\n", "-rw-r--r-- 3 itversity itversity 49799 2021-03-13 15:31 /user/itversity/retail_db/orders_part/order_month=201404/part-00000-ccc05948-3707-47fe-9855-f041cd350b7e.c000.snappy.parquet\n", "drwxr-xr-x - itversity itversity 0 2021-03-13 15:31 /user/itversity/retail_db/orders_part/order_month=201405\n", "-rw-r--r-- 3 itversity itversity 48183 2021-03-13 15:31 /user/itversity/retail_db/orders_part/order_month=201405/part-00000-ccc05948-3707-47fe-9855-f041cd350b7e.c000.snappy.parquet\n", "drwxr-xr-x - itversity itversity 0 2021-03-13 15:31 /user/itversity/retail_db/orders_part/order_month=201406\n", "-rw-r--r-- 3 itversity itversity 46828 2021-03-13 15:31 /user/itversity/retail_db/orders_part/order_month=201406/part-00000-ccc05948-3707-47fe-9855-f041cd350b7e.c000.snappy.parquet\n", "drwxr-xr-x - itversity itversity 0 2021-03-13 15:31 /user/itversity/retail_db/orders_part/order_month=201407\n", "-rw-r--r-- 3 itversity itversity 39605 2021-03-13 15:31 /user/itversity/retail_db/orders_part/order_month=201407/part-00000-ccc05948-3707-47fe-9855-f041cd350b7e.c000.snappy.parquet\n" ] } ], "source": [ "%%sh\n", "\n", "hdfs dfs -ls -R /user/`whoami`/retail_db/orders_part" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+--------+----------+-----------------+---------------+\n", "|order_id|order_date|order_customer_id| order_status|\n", "+--------+----------+-----------------+---------------+\n", "| 1297|2013-08-01| 11607| COMPLETE|\n", "| 1298|2013-08-01| 5105| CLOSED|\n", "| 1299|2013-08-01| 7802| COMPLETE|\n", "| 1300|2013-08-01| 553|PENDING_PAYMENT|\n", "| 1301|2013-08-01| 1604|PENDING_PAYMENT|\n", "| 1302|2013-08-01| 1695| COMPLETE|\n", "| 1303|2013-08-01| 7018| PROCESSING|\n", "| 1304|2013-08-01| 2059| COMPLETE|\n", "| 1305|2013-08-01| 3844| COMPLETE|\n", "| 1306|2013-08-01| 11672|PENDING_PAYMENT|\n", "| 1307|2013-08-01| 4474| COMPLETE|\n", "| 1308|2013-08-01| 11645| PENDING|\n", "| 1309|2013-08-01| 2367| CLOSED|\n", "| 1310|2013-08-01| 5602| PENDING|\n", "| 1311|2013-08-01| 5396|PENDING_PAYMENT|\n", "| 1312|2013-08-01| 12291| COMPLETE|\n", "| 1313|2013-08-01| 3471| CANCELED|\n", "| 1314|2013-08-01| 10993| COMPLETE|\n", "| 1315|2013-08-01| 5660| COMPLETE|\n", "| 1316|2013-08-01| 6376|PENDING_PAYMENT|\n", "+--------+----------+-----------------+---------------+\n", "only showing top 20 rows\n", "\n" ] } ], "source": [ "spark. \\\n", " read. \\\n", " parquet(f'/user/{username}/retail_db/orders_part/order_month=201308'). \\\n", " show()" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+--------+----------+-----------------+---------------+-----------+\n", "|order_id|order_date|order_customer_id| order_status|order_month|\n", "+--------+----------+-----------------+---------------+-----------+\n", "| 15488|2013-11-01| 8987|PENDING_PAYMENT| 201311|\n", "| 15489|2013-11-01| 5359|PENDING_PAYMENT| 201311|\n", "| 15490|2013-11-01| 10149| COMPLETE| 201311|\n", "| 15491|2013-11-01| 10635| ON_HOLD| 201311|\n", "| 15492|2013-11-01| 7784|PENDING_PAYMENT| 201311|\n", "| 15493|2013-11-01| 1104| ON_HOLD| 201311|\n", "| 15494|2013-11-01| 7313| PROCESSING| 201311|\n", "| 15495|2013-11-01| 7067| CLOSED| 201311|\n", "| 15496|2013-11-01| 12153|PENDING_PAYMENT| 201311|\n", "| 15497|2013-11-01| 11115|PENDING_PAYMENT| 201311|\n", "| 15498|2013-11-01| 11195| COMPLETE| 201311|\n", "| 15499|2013-11-01| 7113| CLOSED| 201311|\n", "| 15500|2013-11-01| 6780|PENDING_PAYMENT| 201311|\n", "| 15501|2013-11-01| 9703| ON_HOLD| 201311|\n", "| 15502|2013-11-01| 10009| COMPLETE| 201311|\n", "| 15503|2013-11-01| 6521|PENDING_PAYMENT| 201311|\n", "| 15504|2013-11-01| 10601|PENDING_PAYMENT| 201311|\n", "| 15505|2013-11-01| 1068|PENDING_PAYMENT| 201311|\n", "| 15506|2013-11-01| 2742|PENDING_PAYMENT| 201311|\n", "| 15507|2013-11-01| 3503| COMPLETE| 201311|\n", "+--------+----------+-----------------+---------------+-----------+\n", "only showing top 20 rows\n", "\n" ] } ], "source": [ "spark. \\\n", " read. \\\n", " parquet(f'/user/{username}/retail_db/orders_part'). \\\n", " show()" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
order_idorder_dateorder_customer_idorder_statusorder_month
\n" ], "text/plain": [ "+--------+----------+-----------------+------------+-----------+\n", "|order_id|order_date|order_customer_id|order_status|order_month|\n", "+--------+----------+-----------------+------------+-----------+\n", "+--------+----------+-----------------+------------+-----------+" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "spark. \\\n", " catalog. \\\n", " createTable('orders_part',\n", " path=f'/user/{username}/retail_db/orders_part',\n", " source='parquet'\n", " )" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+--------+----------+-----------------+------------+-----------+\n", "|order_id|order_date|order_customer_id|order_status|order_month|\n", "+--------+----------+-----------------+------------+-----------+\n", "+--------+----------+-----------------+------------+-----------+\n", "\n" ] } ], "source": [ "spark.read.table('orders_part').show()" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+---------+\n", "|partition|\n", "+---------+\n", "+---------+\n", "\n" ] } ], "source": [ "spark.sql('SHOW PARTITIONS orders_part').show()" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "spark.catalog.recoverPartitions('orders_part')" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+------------------+\n", "| partition|\n", "+------------------+\n", "|order_month=201307|\n", "|order_month=201308|\n", "|order_month=201309|\n", "|order_month=201310|\n", "|order_month=201311|\n", "|order_month=201312|\n", "|order_month=201401|\n", "|order_month=201402|\n", "|order_month=201403|\n", "|order_month=201404|\n", "|order_month=201405|\n", "|order_month=201406|\n", "|order_month=201407|\n", "+------------------+\n", "\n" ] } ], "source": [ "spark.sql('SHOW PARTITIONS orders_part').show()" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+--------+----------+-----------------+---------------+-----------+\n", "|order_id|order_date|order_customer_id| order_status|order_month|\n", "+--------+----------+-----------------+---------------+-----------+\n", "| 15488|2013-11-01| 8987|PENDING_PAYMENT| 201311|\n", "| 15489|2013-11-01| 5359|PENDING_PAYMENT| 201311|\n", "| 15490|2013-11-01| 10149| COMPLETE| 201311|\n", "| 15491|2013-11-01| 10635| ON_HOLD| 201311|\n", "| 15492|2013-11-01| 7784|PENDING_PAYMENT| 201311|\n", "| 15493|2013-11-01| 1104| ON_HOLD| 201311|\n", "| 15494|2013-11-01| 7313| PROCESSING| 201311|\n", "| 15495|2013-11-01| 7067| CLOSED| 201311|\n", "| 15496|2013-11-01| 12153|PENDING_PAYMENT| 201311|\n", "| 15497|2013-11-01| 11115|PENDING_PAYMENT| 201311|\n", "| 15498|2013-11-01| 11195| COMPLETE| 201311|\n", "| 15499|2013-11-01| 7113| CLOSED| 201311|\n", "| 15500|2013-11-01| 6780|PENDING_PAYMENT| 201311|\n", "| 15501|2013-11-01| 9703| ON_HOLD| 201311|\n", "| 15502|2013-11-01| 10009| COMPLETE| 201311|\n", "| 15503|2013-11-01| 6521|PENDING_PAYMENT| 201311|\n", "| 15504|2013-11-01| 10601|PENDING_PAYMENT| 201311|\n", "| 15505|2013-11-01| 1068|PENDING_PAYMENT| 201311|\n", "| 15506|2013-11-01| 2742|PENDING_PAYMENT| 201311|\n", "| 15507|2013-11-01| 3503| COMPLETE| 201311|\n", "+--------+----------+-----------------+---------------+-----------+\n", "only showing top 20 rows\n", "\n" ] } ], "source": [ "spark.read.table('orders_part').show()" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-----------+--------+\n", "|order_month|count(1)|\n", "+-----------+--------+\n", "| 201405| 5467|\n", "| 201311| 6381|\n", "| 201401| 5908|\n", "| 201309| 5841|\n", "| 201308| 5680|\n", "| 201404| 5657|\n", "| 201402| 5635|\n", "| 201310| 5335|\n", "| 201406| 5308|\n", "| 201407| 4468|\n", "| 201307| 1533|\n", "| 201312| 5892|\n", "| 201403| 5778|\n", "+-----------+--------+\n", "\n" ] } ], "source": [ "spark.sql('SELECT order_month, count(1) FROM orders_part GROUP BY order_month').show()" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-----------+-----+\n", "|order_month|count|\n", "+-----------+-----+\n", "| 201311| 6381|\n", "| 201401| 5908|\n", "| 201309| 5841|\n", "| 201308| 5680|\n", "| 201404| 5657|\n", "| 201405| 5467|\n", "| 201312| 5892|\n", "| 201403| 5778|\n", "| 201402| 5635|\n", "| 201310| 5335|\n", "| 201406| 5308|\n", "| 201407| 4468|\n", "| 201307| 1533|\n", "+-----------+-----+\n", "\n" ] } ], "source": [ "spark.read.table('orders_part'). \\\n", " groupBy('order_month'). \\\n", " count(). \\\n", " show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Pyspark 2", "language": "python", "name": "pyspark2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.12" } }, "nbformat": 4, "nbformat_minor": 4 }