Previewing airlines dataΒΆ
Let us preview the airlines data to understand more about it.
As we have too many files, we will just process ten files and preview the data.
File Name: hdfs://public/airlines_all/airlines/part-0000*
spark.read.csv
will create a variable or object of type Data Frame.
Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our 10 node state of the art cluster/labs to learn Spark SQL using our unique integrated LMS.
from pyspark.sql import SparkSession
import getpass
username = getpass.getuser()
spark = SparkSession. \
builder. \
config('spark.ui.port', '0'). \
config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
enableHiveSupport(). \
appName(f'{username} | Python - Data Processing - Overview'). \
master('yarn'). \
getOrCreate()
If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.
Using Spark SQL
spark2-sql \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
Using Scala
spark2-shell \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
Using Pyspark
pyspark2 \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
airlines_schema = spark.read. \
csv("/public/airlines_all/airlines/part-00000",
header=True,
inferSchema=True
). \
schema
airlines = spark.read. \
schema(airlines_schema). \
csv("/public/airlines_all/airlines/part*",
header=True
)
A Data Frame will have structure or schema.
We can print the schema using
airlines.printSchema()
We can preview the data using
airlines.show()
. By default it shows 20 records and some of the column values might be truncated for readability purpose.We can review the details of show by using
help(airlines.show)
We can pass custom number of records and say
truncate=False
to show complete information of all the records requested. It will facilitate us to preview all columns with desired number of records.
airlines.show(100, truncate=False)
We can get the number of records or rows in a Data Frame using
airlines.count()
In Databricks Notebook, we can use
display
to preview the data using Visualization featureWe can perform all kinds of standard transformations on our data. We need to have good knowledge of functions on Data Frames as well as functions on columns to apply all standard transformations.
Let us also validate if there are duplicates in our data, if yes we will remove duplicates while reorganizing the data later.
airlines_schema = spark.read. \
csv("/public/airlines_all/airlines/part-00000",
header=True,
inferSchema=True
). \
schema
%%sh
hdfs dfs -ls /public/airlines_all/airlines/part-0000*
airlines = spark.read. \
schema(airlines_schema). \
csv("/public/airlines_all/airlines/part-0000*",
header=True
)
airlines.printSchema()
airlines.show()
airlines.show(100, truncate=False)
airlines.count()
airlines.distinct().count()