Overview of Spark read APIsΒΆ

Let us get the overview of Spark read APIs to read files of different formats.

  • spark has a bunch of APIs to read data from files of different formats.

  • All APIs are exposed under spark.read

    • text - to read single column data from text files as well as reading each of the whole text file as one record.

    • csv- to read text files with delimiters. Default is a comma, but we can use other delimiters as well.

    • json - to read data from JSON files

    • orc - to read data from ORC files

    • parquet - to read data from Parquet files.

    • We can also read data from other file formats by plugging in and by using spark.read.format

  • We can also pass options based on the file formats.

    • inferSchema - to infer the data types of the columns based on the data.

    • header - to use header to get the column names in case of text files.

    • schema - to explicitly specify the schema.

  • We can get the help on APIs like spark.read.csv using help(spark.read.csv).

  • Reading delimited data from text files.

Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our 10 node state of the art cluster/labs to learn Spark SQL using our unique integrated LMS.

from pyspark.sql import SparkSession

import getpass
username = getpass.getuser()

spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Data Processing - Overview'). \
    master('yarn'). \
    getOrCreate()

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

Using Spark SQL

spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Scala

spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Pyspark

pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
spark
spark.read
spark.read.csv?
help(spark.read.csv)
spark. \
    read. \
    csv('/public/retail_db/orders',
        schema='''
            order_id INT, 
            order_date STRING, 
            order_customer_id INT, 
            order_status STRING
        '''
       ). \
    show()
  • Reading JSON data from text files. We can infer schema from the data as each JSON object contain both column name and value.

  • Example for JSON

{
    "order_id": 1, 
    "order_date": "2013-07-25 00:00:00.0", 
    "order_customer_id": 12345, 
    "order_status": "COMPLETE"
}
spark.read.json?
spark. \
    read. \
    json('/public/retail_db_json/orders'). \
    show()