Overview of Spark read APIsΒΆ
Let us get the overview of Spark read APIs to read files of different formats.
spark
has a bunch of APIs to read data from files of different formats.All APIs are exposed under
spark.read
text
- to read single column data from text files as well as reading each of the whole text file as one record.csv
- to read text files with delimiters. Default is a comma, but we can use other delimiters as well.json
- to read data from JSON filesorc
- to read data from ORC filesparquet
- to read data from Parquet files.We can also read data from other file formats by plugging in and by using
spark.read.format
We can also pass options based on the file formats.
inferSchema
- to infer the data types of the columns based on the data.header
- to use header to get the column names in case of text files.schema
- to explicitly specify the schema.
We can get the help on APIs like
spark.read.csv
usinghelp(spark.read.csv)
.Reading delimited data from text files.
Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our 10 node state of the art cluster/labs to learn Spark SQL using our unique integrated LMS.
from pyspark.sql import SparkSession
import getpass
username = getpass.getuser()
spark = SparkSession. \
builder. \
config('spark.ui.port', '0'). \
config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
enableHiveSupport(). \
appName(f'{username} | Python - Data Processing - Overview'). \
master('yarn'). \
getOrCreate()
If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.
Using Spark SQL
spark2-sql \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
Using Scala
spark2-shell \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
Using Pyspark
pyspark2 \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
spark
spark.read
spark.read.csv?
help(spark.read.csv)
spark. \
read. \
csv('/public/retail_db/orders',
schema='''
order_id INT,
order_date STRING,
order_customer_id INT,
order_status STRING
'''
). \
show()
Reading JSON data from text files. We can infer schema from the data as each JSON object contain both column name and value.
Example for JSON
{
"order_id": 1,
"order_date": "2013-07-25 00:00:00.0",
"order_customer_id": 12345,
"order_status": "COMPLETE"
}
spark.read.json?
spark. \
read. \
json('/public/retail_db_json/orders'). \
show()