Starting Spark ContextΒΆ
Let us start Spark Context using SparkSession.
SparkSession
is a class that is part ofpyspark.sql
package.It is a wrapper on top of Spark Context.
When Spark application is submitted using
spark-submit
orspark-shell
orpyspark
, a web service called as Spark Context will be started.Spark Context maintains the context of all the jobs that are submitted until it is killed.
SparkSession
is nothing but wrapper on top of Spark Context.We need to first create SparkSession object with any name. But typically we use
spark
. Once it is created, several APIs will be exposed includingread
.We need to at least set Application Name and also specify the execution mode in which Spark Context should run while creating
SparkSession
object.We can use
appName
to specify name for the application andmaster
to specify the execution mode.Below is the sample code snippet which will start the Spark Session object for us.
Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our 10 node state of the art cluster/labs to learn Spark SQL using our unique integrated LMS.
from pyspark.sql import SparkSession
import getpass
username = getpass.getuser()
username
spark = SparkSession. \
builder. \
config('spark.ui.port', '0'). \
config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
enableHiveSupport(). \
appName(f'{username} | Python - Data Processing - Overview'). \
master('yarn'). \
getOrCreate()
If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.
Using Spark SQL
spark2-sql \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
Using Scala
spark2-shell \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
Using Pyspark
pyspark2 \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
spark