Creating Temp Views¶

So far we spoke about permanent metastore tables. Now let us understand how to create temporary views using a Data Frame.

We can create temporary view for a Data Frame using createTempView or createOrReplaceTempView.
createOrReplaceTempView will replace existing view, if it already exists.
While tables in Metastore are permanent, views are temporary.
Once the application exits, temporary views will be deleted or flushed out.

Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our 10 node state of the art cluster/labs to learn Spark SQL using our unique integrated LMS.

from pyspark.sql import SparkSession

import getpass
username = getpass.getuser()

spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Spark Metastore'). \
    master('yarn'). \
    getOrCreate()

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

Using Spark SQL

spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Scala

spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Pyspark

pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

spark.conf.set('spark.sql.shuffle.partitions', '2')

Tasks¶

Let us perform few tasks to create temporary view and process the data using the temporary view.

Create temporary view by name airport_codes_v for file airport-codes.txt. The file contains header and each field in each row is delimited by a tab character.

import getpass
username = getpass.getuser()

spark.catalog.setCurrentDatabase(f"{username}_airtraffic")

spark.catalog.currentDatabase()

'itversity_airtraffic'

spark.catalog.listTables()

[Table(name='airport_codes', database='itversity_airtraffic', description=None, tableType='MANAGED', isTemporary=False)]

airport_codes_path = f"/public/airtraffic_all/airport-codes"

airport_codes_df = spark. \
    read. \
    csv(airport_codes_path,
        sep="\t",
        header=True,
        inferSchema=True
       )

airport_codes_df.printSchema()

root
 |-- City: string (nullable = true)
 |-- State: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- IATA: string (nullable = true)

airport_codes_df.show()

+-----------+-----+-------+----+
|       City|State|Country|IATA|
+-----------+-----+-------+----+
| Abbotsford|   BC| Canada| YXX|
|   Aberdeen|   SD|    USA| ABR|
|    Abilene|   TX|    USA| ABI|
|      Akron|   OH|    USA| CAK|
|    Alamosa|   CO|    USA| ALS|
|     Albany|   GA|    USA| ABY|
|     Albany|   NY|    USA| ALB|
|Albuquerque|   NM|    USA| ABQ|
| Alexandria|   LA|    USA| AEX|
|  Allentown|   PA|    USA| ABE|
|   Alliance|   NE|    USA| AIA|
|     Alpena|   MI|    USA| APN|
|    Altoona|   PA|    USA| AOO|
|   Amarillo|   TX|    USA| AMA|
|Anahim Lake|   BC| Canada| YAA|
|  Anchorage|   AK|    USA| ANC|
|   Appleton|   WI|    USA| ATW|
|     Arviat|  NWT| Canada| YEK|
|  Asheville|   NC|    USA| AVL|
|      Aspen|   CO|    USA| ASE|
+-----------+-----+-------+----+
only showing top 20 rows

airport_codes_df.createTempView("airport_codes_v")

spark.catalog.listTables()

[Table(name='airport_codes', database='itversity_airtraffic', description=None, tableType='MANAGED', isTemporary=False),
 Table(name='airport_codes_v', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]

Read data from view and get number of airports by state.

airport_codes = spark.read.table("airport_codes_v")

airport_codes. \
    groupBy("state"). \
    count(). \
    show()

+-----+-----+
|state|count|
+-----+-----+
|   BC|   22|
|   SD|    7|
|   NY|   18|
|   NM|    9|
|   NE|    9|
|   MI|   18|
|  NWT|    4|
|   NC|   10|
|   NJ|    3|
|   MD|    3|
|   WV|    8|
|   MN|    8|
|   IL|   12|
|   ID|    6|
|   IA|    8|
|   MO|    8|
|   SC|    6|
|   VA|    7|
|  PEI|    1|
|   TN|    6|
+-----+-----+
only showing top 20 rows

spark.sql('''
    SELECT state, count(1) AS airport_count
    FROM airport_codes_v 
    GROUP BY state
    ORDER BY airport_count DESC
''').show()

+-----+-------------+
|state|airport_count|
+-----+-------------+
|   CA|           29|
|   TX|           26|
|   AK|           25|
|   BC|           22|
|   NY|           18|
|   ON|           18|
|   MI|           18|
|   FL|           18|
|   MT|           14|
|   PA|           13|
|   PQ|           13|
|   IL|           12|
|   CO|           12|
|   NC|           10|
|   WY|           10|
|   NE|            9|
|   WI|            9|
|   WA|            9|
|   GA|            9|
|   NM|            9|
+-----+-------------+
only showing top 20 rows

Mastering Pyspark

Creating Temp Views¶

Tasks¶