Starting Spark ContextΒΆ

Let us start Spark Context using SparkSession.

  • SparkSession is a class that is part of pyspark.sql package.

  • It is a wrapper on top of Spark Context.

  • When Spark application is submitted using spark-submit or spark-shell or pyspark, a web service called as Spark Context will be started.

  • Spark Context maintains the context of all the jobs that are submitted until it is killed.

  • SparkSession is nothing but wrapper on top of Spark Context.

  • We need to first create SparkSession object with any name. But typically we use spark. Once it is created, several APIs will be exposed including read.

  • We need to at least set Application Name and also specify the execution mode in which Spark Context should run while creating SparkSession object.

  • We can use appName to specify name for the application and master to specify the execution mode.

  • Below is the sample code snippet which will start the Spark Session object for us.

Let us start spark context for this Notebook so that we can execute the code provided.

from pyspark.sql import SparkSession
import getpass
username = getpass.getuser()
spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Data Processing - Overview'). \
    master('yarn'). \

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

Using Spark SQL

spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Scala

spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Pyspark

pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse