Previewing airlines dataΒΆ

Let us preview the airlines data to understand more about it.

  • As we have too many files, we will just process ten files and preview the data.

  • File Name: hdfs://public/airlines_all/airlines/part-0000*

  • spark.read.csv will create a variable or object of type Data Frame.

Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our 10 node state of the art cluster/labs to learn Spark SQL using our unique integrated LMS.

from pyspark.sql import SparkSession

import getpass
username = getpass.getuser()

spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Data Processing - Overview'). \
    master('yarn'). \
    getOrCreate()

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

Using Spark SQL

spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Scala

spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Pyspark

pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
airlines_schema = spark.read. \
    csv("/public/airlines_all/airlines/part-00000",
        header=True,
        inferSchema=True
       ). \
    schema
airlines = spark.read. \
    schema(airlines_schema). \
    csv("/public/airlines_all/airlines/part*",
        header=True
       )

A Data Frame will have structure or schema.

  • We can print the schema using airlines.printSchema()

  • We can preview the data using airlines.show(). By default it shows 20 records and some of the column values might be truncated for readability purpose.

  • We can review the details of show by using help(airlines.show)

  • We can pass custom number of records and say truncate=False to show complete information of all the records requested. It will facilitate us to preview all columns with desired number of records.

airlines.show(100, truncate=False)
  • We can get the number of records or rows in a Data Frame using airlines.count()

  • In Databricks Notebook, we can use display to preview the data using Visualization feature

  • We can perform all kinds of standard transformations on our data. We need to have good knowledge of functions on Data Frames as well as functions on columns to apply all standard transformations.

  • Let us also validate if there are duplicates in our data, if yes we will remove duplicates while reorganizing the data later.

airlines_schema = spark.read. \
    csv("/public/airlines_all/airlines/part-00000",
        header=True,
        inferSchema=True
       ). \
    schema
%%sh

hdfs dfs -ls /public/airlines_all/airlines/part-0000*
airlines = spark.read. \
    schema(airlines_schema). \
    csv("/public/airlines_all/airlines/part-0000*",
        header=True
       )
airlines.printSchema()
airlines.show()
airlines.show(100, truncate=False)
airlines.count()
airlines.distinct().count()