Prerequisites and ObjectivesΒΆ

Let us understand prerequisites before getting into the topics related to this section.

  • Good understanding of Data Processing using Python.

  • Data Processing Life Cycle

    • Reading Data from files

    • Processing Data using APIs

    • Writing Processed Data back to files

  • We can also use Databases as sources and sinks. It will be covered at a later point in time.

  • We can also read data in streaming fashion which is out of the scope of this course.

We will get an overview of the Data Processing Life Cycle using Pyspark by the end of the section or module.

  • Read airlines data from the file.

  • Preview the schema and data to understand the characteristics of the data.

  • Get an overview of Data Frame APIs as well as functions used to process the data.

  • Check if there are any duplicates in the data.

  • Get an overview of how to write data in Data Frames to Files using File Formats such as Parquet using Compression.

  • Reorganize the data by month with different file format and using partitioning strategy.

  • We will deep dive into Data Frame APIs to process the data in subsequent sections or modules.