Analyze and Understand Data

Let us analyze and understand more about the data in detail using data of 2008 January.

  • First let us read the data for the month of 2008 January.

Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our 10 node state of the art cluster/labs to learn Spark SQL using our unique integrated LMS.

from pyspark.sql import SparkSession

import getpass
username = getpass.getuser()

spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Data Processing - Overview'). \
    master('yarn'). \
    getOrCreate()

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

Using Spark SQL

spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Scala

spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Pyspark

pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
airlines_path = "/public/airlines_all/airlines-part/flightmonth=200801"

airlines = spark. \
    read. \
    parquet(airlines_path)
airlines.count()
airlines.printSchema()
  • Get number of records - airlines.count()

  • Go through the list of columns and understand the purpose of them.

    • Year

    • Month

    • DayOfMonth

    • CRSDepTime - Scheduled Departure Time

    • DepTime - Actual Departure Time.

    • DepDelay - Departure Delay in Minutes

    • CRSArrTime - Scheduled Arrival Time

    • ArrTime - Actual Arrival Time.

    • ArrDelay - Arrival Delay in Minutes.

    • UniqueCarrier - Carrier or Airlines

    • FlightNum - Flight Number

    • Distance - Distance between Origin and Destination

    • IsDepDelayed - this is set to yes for those flights where departure is delayed.

    • IsArrDelayed – this is set to yes for those flights where arrival is delayed.

  • Get number of unique origins

airlines. \
    select("Origin"). \
    distinct(). \
    count()
  • Get number of unique destinations

airlines. \
    select("Dest"). \
    distinct(). \
    count()
  • Get all unique carriers

airlines. \
    select('UniqueCarrier'). \
    distinct(). \
    show()