Overview of Basic TransformationsΒΆ

Let us define problem statements to learn more about Data Frame APIs. We will try to cover filtering, aggregations and sorting as part of solutions for these problem statements.

  • Get total number of flights as well as number of flights which are delayed in departure and number of flights delayed in arrival.

    • Output should contain 3 columns - FlightCount, DepDelayedCount, ArrDelayedCount

  • Get number of flights which are delayed in departure and number of flights delayed in arrival for each day along with number of flights departed for each day.

    • Output should contain 4 columns - FlightDate, FlightCount, DepDelayedCount, ArrDelayedCount

    • FlightDate should be of yyyy-MM-dd format.

    • Data should be sorted in ascending order by flightDate

  • Get all the flights which are departed late but arrived early (IsArrDelayed is NO).

    • Output should contain - FlightCRSDepTime, UniqueCarrier, FlightNum, Origin, Dest, DepDelay, ArrDelay

    • FlightCRSDepTime need to be computed using Year, Month, DayOfMonth, CRSDepTime

    • FlightCRSDepTime should be displayed using yyyy-MM-dd HH:mm format.

    • Output should be sorted by FlightCRSDepTime and then by the difference between DepDelay and ArrDelay

    • Also get the count of such flights