Getting Started

Setup Spark Locally - Ubuntu

Let us setup Spark Locally on Ubuntu.

  • Install latest version of Anaconda

  • Make sure Jupyter Notebook is setup and validated.

  • Setup Spark and Validate.

  • Setup Environment Variables to integrate Pyspark with Jupyter Notebook.

  • Launch Jupyter Notebook using pyspark command.

  • Setup PyCharm (IDE) for application development.

Setup Spark Locally - Mac

Let us setup Spark Locally on Ubuntu.

  • Install latest version of Anaconda

  • Make sure Jupyter Notebook is setup and validated.

  • Setup Spark and Validate.

  • Setup Environment Variables to integrate Pyspark with Jupyter Notebook.

  • Launch Jupyter Notebook using pyspark command.

  • Setup PyCharm (IDE) for application development.

Signing up for ITVersity Labs

Here are the steps for signing to ITVersity labs.

  • Go to https://labs.itversity.com

  • Sign up to our website

  • Purchase lab access

  • Go to lab page and create lab account

  • Login and practice

Using ITVersity Labs

Let us understand how to submit the Spark Jobs in ITVersity Labs.

  • You can either use Jupyter based environment or pyspark in terminal to submit jobs in ITVersity labs.

  • You can also submit Spark jobs using spark-submit command.

  • As we are using Python we can also use the help command to get the documentation - for example help(spark.read.csv)

Interacting with File Systems

Let us understand how to interact with file system using %fs command from Databricks Notebook.

  • We can access datasets using %fs magic command in Databricks notebook

  • By default, we will see files under dbfs

  • We can list the files using ls command - e. g.: %fs ls

  • Databricks provides lot of datasets for free under databricks-datasets

  • If the cluster is integrated with AWS or Azure Blob we can access files by specifying the appropriate protocol (e.g.: s3:// for s3)

  • List of commands available under %fs

    • Copying files or directories -cp

    • Moving files or directories -mv

    • Creating directories -mkdirs

    • Deleting files and directories -rm

    • We can copy or delete directories recursively using -r or --recursive

Getting File Metadata

Let us review the source location to get number of files and the size of the data we are going to process.

  • Location of airlines data dbfs:/databricks-datasets/airlines

  • We can get first 1000 files using %fs ls dbfs:/databricks-datasets/airlines

  • Location contain 1919 Files, however we will not be able to see all the details using %fs command.

  • Databricks File System commands does not have capability to understand metadata of files such as size in details.

  • When Spark Cluster is started, it will create 2 objects - spark and sc

  • sc is of type SparkContext and spark is of type SparkSession

  • Spark uses HDFS APIs to interact with the file system and we can access HDFS APIs using sc._jsc and sc._jvm to get file metadata.

  • Here are the steps to get the file metadata.

    • Get Hadoop Configuration using sc._jsc.hadoopConfiguration() - let’s say conf

    • We can pass conf to sc._jvm.org.apache.hadoop.fs.FileSystem get to get FileSystem object - let’s say fs

    • We can build path object by passing the path as string to sc._jvm.org.apache.hadoop.fs.Path

    • We can invoke listStatus on top of fs by passing path which will return an array of FileStatus objects - let’s say files.

    • Each FileStatus object have all the metadata of each file.

    • We can use len on files to get number of files.

    • We can use >getLen on each FileStatus object to get the size of each file.

    • Cumulative size of all files can be achieved using sum(map(lambda file: file.getLen(), files))

Let us first get list of files

%fs ls dbfs:/databricks-datasets/airlines

Here is the consolidated script to get number of files and cumulative size of all files in a given folder.

conf = sc._jsc.hadoopConfiguration()
fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(conf)
path = sc._jvm.org.apache.hadoop.fs.Path("dbfs:/databricks-datasets/airlines")

files = fs.listStatus(path)
sum(map(lambda file: file.getLen(), files))/1024/1024/1024