Overview of Functions

Let us get an overview of different functions that are available to process data in columns.

  • While Data Frame APIs work on the Data Frame, at times we might want to apply functions on column values.

  • Functions to process column values are available under pyspark.sql.functions. These are typically used in select or withColumn on top of Data Frame.

  • There are approximately 300 pre-defined functions available for us.

  • Some of the important functions can be broadly categorized into String Manipulation, Date Manipulation, Numeric Functions and Aggregate Functions.

  • String Manipulation Functions

    • Concatenating Strings - concat

    • Getting Length - length

    • Trimming Strings - trim, rtrim, ltrim

    • Padding Strings - lpad, rpad

    • Extracting Strings - split, substring

  • Date Manipulation Functions

    • Date Arithmetic - date_add, date_sub, datediff, add_months

    • Date Extraction - dayofmonth, month, year

    • Get beginning period - trunc, date_trunc

  • Numeric Functions - abs, greatest

  • Aggregate Functions - sum, min, max

Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our 10 node state of the art cluster/labs to learn Spark SQL using our unique integrated LMS.

from pyspark.sql import SparkSession

import getpass
username = getpass.getuser()

spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Data Processing - Overview'). \
    master('yarn'). \
    getOrCreate()

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

Using Spark SQL

spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Scala

spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Pyspark

pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
employees = [(1, "Scott", "Tiger", 1000.0, "united states"),
             (2, "Henry", "Ford", 1250.0, "India"),
             (3, "Nick", "Junior", 750.0, "united KINGDOM"),
             (4, "Bill", "Gomes", 1500.0, "AUSTRALIA")
            ]
employeesDF = spark. \
    createDataFrame(employees,
                    schema="""employee_id INT, first_name STRING, 
                    last_name STRING, salary FLOAT, nationality STRING"""
                   )
employeesDF.printSchema()
employeesDF.show()

Tasks

Let us perform a task to understand how functions are typically used.

  • Project full name by concatenating first name and last name along with other fields excluding first name and last name.

from pyspark.sql.functions import lit, concat
employeesDF. \
    withColumn("full_name", concat("first_name", ", ", "last_name")). \
    drop("first_name", "last_name"). \
    show()
employeesDF. \
    withColumn("full_name", concat("first_name", lit(", "), "last_name")). \
    drop("first_name", "last_name"). \
    show()
employeesDF. \
    select("employee_id",
           concat("first_name", lit(", "), "last_name"),
           "salary",
           "nationality"
          ). \
    show()
employeesDF. \
    select("employee_id",
           concat("first_name", lit(", "), "last_name").alias("full_name"),
           "salary",
           "nationality"
          ). \
    show()
employeesDF. \
    selectExpr("employee_id",
               "concat(first_name, ', ', last_name) AS full_name",
               "salary",
               "nationality"
              ). \
    show()

We will explore most of the functions as we get into the data processing at a later point in time