Creating Temp Views¶
So far we spoke about permanent metastore tables. Now let us understand how to create temporary views using a Data Frame.
We can create temporary view for a Data Frame using
createTempView
orcreateOrReplaceTempView
.createOrReplaceTempView
will replace existing view, if it already exists.While tables in Metastore are permanent, views are temporary.
Once the application exits, temporary views will be deleted or flushed out.
Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our 10 node state of the art cluster/labs to learn Spark SQL using our unique integrated LMS.
from pyspark.sql import SparkSession
import getpass
username = getpass.getuser()
spark = SparkSession. \
builder. \
config('spark.ui.port', '0'). \
config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
enableHiveSupport(). \
appName(f'{username} | Python - Spark Metastore'). \
master('yarn'). \
getOrCreate()
If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.
Using Spark SQL
spark2-sql \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
Using Scala
spark2-shell \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
Using Pyspark
pyspark2 \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
spark.conf.set('spark.sql.shuffle.partitions', '2')
Tasks¶
Let us perform few tasks to create temporary view and process the data using the temporary view.
Create temporary view by name airport_codes_v for file airport-codes.txt. The file contains header and each field in each row is delimited by a tab character.
import getpass
username = getpass.getuser()
spark.catalog.setCurrentDatabase(f"{username}_airtraffic")
spark.catalog.currentDatabase()
'itversity_airtraffic'
spark.catalog.listTables()
[Table(name='airport_codes', database='itversity_airtraffic', description=None, tableType='MANAGED', isTemporary=False)]
airport_codes_path = f"/public/airtraffic_all/airport-codes"
airport_codes_df = spark. \
read. \
csv(airport_codes_path,
sep="\t",
header=True,
inferSchema=True
)
airport_codes_df.printSchema()
root
|-- City: string (nullable = true)
|-- State: string (nullable = true)
|-- Country: string (nullable = true)
|-- IATA: string (nullable = true)
airport_codes_df.show()
+-----------+-----+-------+----+
| City|State|Country|IATA|
+-----------+-----+-------+----+
| Abbotsford| BC| Canada| YXX|
| Aberdeen| SD| USA| ABR|
| Abilene| TX| USA| ABI|
| Akron| OH| USA| CAK|
| Alamosa| CO| USA| ALS|
| Albany| GA| USA| ABY|
| Albany| NY| USA| ALB|
|Albuquerque| NM| USA| ABQ|
| Alexandria| LA| USA| AEX|
| Allentown| PA| USA| ABE|
| Alliance| NE| USA| AIA|
| Alpena| MI| USA| APN|
| Altoona| PA| USA| AOO|
| Amarillo| TX| USA| AMA|
|Anahim Lake| BC| Canada| YAA|
| Anchorage| AK| USA| ANC|
| Appleton| WI| USA| ATW|
| Arviat| NWT| Canada| YEK|
| Asheville| NC| USA| AVL|
| Aspen| CO| USA| ASE|
+-----------+-----+-------+----+
only showing top 20 rows
airport_codes_df.createTempView("airport_codes_v")
spark.catalog.listTables()
[Table(name='airport_codes', database='itversity_airtraffic', description=None, tableType='MANAGED', isTemporary=False),
Table(name='airport_codes_v', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]
Read data from view and get number of airports by state.
airport_codes = spark.read.table("airport_codes_v")
airport_codes. \
groupBy("state"). \
count(). \
show()
+-----+-----+
|state|count|
+-----+-----+
| BC| 22|
| SD| 7|
| NY| 18|
| NM| 9|
| NE| 9|
| MI| 18|
| NWT| 4|
| NC| 10|
| NJ| 3|
| MD| 3|
| WV| 8|
| MN| 8|
| IL| 12|
| ID| 6|
| IA| 8|
| MO| 8|
| SC| 6|
| VA| 7|
| PEI| 1|
| TN| 6|
+-----+-----+
only showing top 20 rows
spark.sql('''
SELECT state, count(1) AS airport_count
FROM airport_codes_v
GROUP BY state
ORDER BY airport_count DESC
''').show()
+-----+-------------+
|state|airport_count|
+-----+-------------+
| CA| 29|
| TX| 26|
| AK| 25|
| BC| 22|
| NY| 18|
| ON| 18|
| MI| 18|
| FL| 18|
| MT| 14|
| PA| 13|
| PQ| 13|
| IL| 12|
| CO| 12|
| NC| 10|
| WY| 10|
| NE| 9|
| WI| 9|
| WA| 9|
| GA| 9|
| NM| 9|
+-----+-------------+
only showing top 20 rows