Getting Started¶
Setup Spark Locally - Ubuntu¶
Let us setup Spark Locally on Ubuntu.
Install latest version of Anaconda
Make sure Jupyter Notebook is setup and validated.
Setup Spark and Validate.
Setup Environment Variables to integrate Pyspark with Jupyter Notebook.
Launch Jupyter Notebook using
pyspark
command.Setup PyCharm (IDE) for application development.
Setup Spark Locally - Mac¶
Let us setup Spark Locally on Ubuntu.¶
Install latest version of Anaconda
Make sure Jupyter Notebook is setup and validated.
Setup Spark and Validate.
Setup Environment Variables to integrate Pyspark with Jupyter Notebook.
Launch Jupyter Notebook using
pyspark
command.Setup PyCharm (IDE) for application development.
Signing up for ITVersity Labs¶
Here are the steps for signing to ITVersity labs.
Go to https://labs.itversity.com
Sign up to our website
Purchase lab access
Go to lab page and create lab account
Login and practice
Using ITVersity Labs¶
Let us understand how to submit the Spark Jobs in ITVersity Labs.
You can either use Jupyter based environment or
pyspark
in terminal to submit jobs in ITVersity labs.You can also submit Spark jobs using
spark-submit
command.As we are using Python we can also use the help command to get the documentation - for example
help(spark.read.csv)
Interacting with File Systems¶
Let us understand how to interact with file system using %fs command from Databricks Notebook.
We can access datasets using %fs magic command in Databricks notebook
By default, we will see files under dbfs
We can list the files using ls command - e. g.:
%fs ls
Databricks provides lot of datasets for free under databricks-datasets
If the cluster is integrated with AWS or Azure Blob we can access files by specifying the appropriate protocol (e.g.: s3:// for s3)
List of commands available under
%fs
Copying files or directories
-cp
Moving files or directories
-mv
Creating directories
-mkdirs
Deleting files and directories
-rm
We can copy or delete directories recursively using
-r
or--recursive
Getting File Metadata¶
Let us review the source location to get number of files and the size of the data we are going to process.
Location of airlines data dbfs:/databricks-datasets/airlines
We can get first 1000 files using %fs ls dbfs:/databricks-datasets/airlines
Location contain 1919 Files, however we will not be able to see all the details using %fs command.
Databricks File System commands does not have capability to understand metadata of files such as size in details.
When Spark Cluster is started, it will create 2 objects - spark and sc
sc is of type SparkContext and spark is of type SparkSession
Spark uses HDFS APIs to interact with the file system and we can access HDFS APIs using sc._jsc and sc._jvm to get file metadata.
Here are the steps to get the file metadata.
Get Hadoop Configuration using
sc._jsc.hadoopConfiguration()
- let’s sayconf
We can pass conf to
sc._jvm.org.apache.hadoop.fs.FileSystem
get to get FileSystem object - let’s sayfs
We can build
path
object by passing the path as string tosc._jvm.org.apache.hadoop.fs.Path
We can invoke
listStatus
on top of fs by passing path which will return an array of FileStatus objects - let’s say files.Each
FileStatus
object have all the metadata of each file.We can use
len
on files to get number of files.We can use
>getLen
on eachFileStatus
object to get the size of each file.Cumulative size of all files can be achieved using
sum(map(lambda file: file.getLen(), files))
Let us first get list of files
%fs ls dbfs:/databricks-datasets/airlines
Here is the consolidated script to get number of files and cumulative size of all files in a given folder.
conf = sc._jsc.hadoopConfiguration()
fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(conf)
path = sc._jvm.org.apache.hadoop.fs.Path("dbfs:/databricks-datasets/airlines")
files = fs.listStatus(path)
sum(map(lambda file: file.getLen(), files))/1024/1024/1024