Using date_format Function

Let us understand how to extract information from dates or times using date_format function.

  • We can use date_format to extract the required information in a desired format from standard date or timestamp. Earlier we have explored to_date and to_timestamp to convert non standard date or timestamp to standard ones respectively.

  • There are also specific functions to extract year, month, day with in a week, a day with in a month, day with in a year etc. These are covered as part of earlier topics in this section or module.

Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our 10 node state of the art cluster/labs to learn Spark SQL using our unique integrated LMS.

from pyspark.sql import SparkSession

import getpass
username = getpass.getuser()

spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Processing Column Data'). \
    master('yarn'). \
    getOrCreate()

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

Using Spark SQL

spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Scala

spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Pyspark

pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Tasks

Let us perform few tasks to extract the information we need from date or timestamp.

  • Create a Dataframe by name datetimesDF with columns date and time.

datetimes = [("2014-02-28", "2014-02-28 10:00:00.123"),
                     ("2016-02-29", "2016-02-29 08:08:08.999"),
                     ("2017-10-31", "2017-12-31 11:59:59.123"),
                     ("2019-11-30", "2019-08-31 00:00:00.000")
                ]
datetimesDF = spark.createDataFrame(datetimes, schema="date STRING, time STRING")
datetimesDF.show(truncate=False)
+----------+-----------------------+
|date      |time                   |
+----------+-----------------------+
|2014-02-28|2014-02-28 10:00:00.123|
|2016-02-29|2016-02-29 08:08:08.999|
|2017-10-31|2017-12-31 11:59:59.123|
|2019-11-30|2019-08-31 00:00:00.000|
+----------+-----------------------+
from pyspark.sql.functions import date_format
  • Get the year and month from both date and time columns using yyyyMM format. Also make sure that the data type is converted to integer.

datetimesDF. \
    withColumn("date_ym", date_format("date", "yyyyMM")). \
    withColumn("time_ym", date_format("time", "yyyyMM")). \
    show(truncate=False)

# yyyy
# MM
# dd
# DD
# HH
# hh
# mm
# ss
# SSS
+----------+-----------------------+-------+-------+
|date      |time                   |date_ym|time_ym|
+----------+-----------------------+-------+-------+
|2014-02-28|2014-02-28 10:00:00.123|201402 |201402 |
|2016-02-29|2016-02-29 08:08:08.999|201602 |201602 |
|2017-10-31|2017-12-31 11:59:59.123|201710 |201712 |
|2019-11-30|2019-08-31 00:00:00.000|201911 |201908 |
+----------+-----------------------+-------+-------+
datetimesDF. \
    withColumn("date_ym", date_format("date", "yyyyMM")). \
    withColumn("time_ym", date_format("time", "yyyyMM")). \
    printSchema()
root
 |-- date: string (nullable = true)
 |-- time: string (nullable = true)
 |-- date_ym: string (nullable = true)
 |-- time_ym: string (nullable = true)
datetimesDF. \
    withColumn("date_ym", date_format("date", "yyyyMM").cast('int')). \
    withColumn("time_ym", date_format("time", "yyyyMM").cast('int')). \
    printSchema()
root
 |-- date: string (nullable = true)
 |-- time: string (nullable = true)
 |-- date_ym: integer (nullable = true)
 |-- time_ym: integer (nullable = true)
datetimesDF. \
    withColumn("date_ym", date_format("date", "yyyyMM").cast('int')). \
    withColumn("time_ym", date_format("time", "yyyyMM").cast('int')). \
    show(truncate=False)
+----------+-----------------------+-------+-------+
|date      |time                   |date_ym|time_ym|
+----------+-----------------------+-------+-------+
|2014-02-28|2014-02-28 10:00:00.123|201402 |201402 |
|2016-02-29|2016-02-29 08:08:08.999|201602 |201602 |
|2017-10-31|2017-12-31 11:59:59.123|201710 |201712 |
|2019-11-30|2019-08-31 00:00:00.000|201911 |201908 |
+----------+-----------------------+-------+-------+
  • Get the information from time in yyyyMMddHHmmss format.

from pyspark.sql.functions import date_format
datetimesDF. \
    withColumn("date_dt", date_format("date", "yyyyMMddHHmmss")). \
    withColumn("date_ts", date_format("time", "yyyyMMddHHmmss")). \
    show(truncate=False)
+----------+-----------------------+--------------+--------------+
|date      |time                   |date_dt       |date_ts       |
+----------+-----------------------+--------------+--------------+
|2014-02-28|2014-02-28 10:00:00.123|20140228000000|20140228100000|
|2016-02-29|2016-02-29 08:08:08.999|20160229000000|20160229080808|
|2017-10-31|2017-12-31 11:59:59.123|20171031000000|20171231115959|
|2019-11-30|2019-08-31 00:00:00.000|20191130000000|20190831000000|
+----------+-----------------------+--------------+--------------+
datetimesDF. \
    withColumn("date_dt", date_format("date", "yyyyMMddHHmmss").cast('long')). \
    withColumn("date_ts", date_format("time", "yyyyMMddHHmmss").cast('long')). \
    show(truncate=False)
+----------+-----------------------+--------------+--------------+
|date      |time                   |date_dt       |date_ts       |
+----------+-----------------------+--------------+--------------+
|2014-02-28|2014-02-28 10:00:00.123|20140228000000|20140228100000|
|2016-02-29|2016-02-29 08:08:08.999|20160229000000|20160229080808|
|2017-10-31|2017-12-31 11:59:59.123|20171031000000|20171231115959|
|2019-11-30|2019-08-31 00:00:00.000|20191130000000|20190831000000|
+----------+-----------------------+--------------+--------------+
  • Get year and day of year using yyyyDDD format.

datetimesDF. \
    withColumn("date_yd", date_format("date", "yyyyDDD").cast('int')). \
    withColumn("time_yd", date_format("time", "yyyyDDD").cast('int')). \
    show(truncate=False)
+----------+-----------------------+-------+-------+
|date      |time                   |date_yd|time_yd|
+----------+-----------------------+-------+-------+
|2014-02-28|2014-02-28 10:00:00.123|2014059|2014059|
|2016-02-29|2016-02-29 08:08:08.999|2016060|2016060|
|2017-10-31|2017-12-31 11:59:59.123|2017304|2017365|
|2019-11-30|2019-08-31 00:00:00.000|2019334|2019243|
+----------+-----------------------+-------+-------+
  • Get complete description of the date.

datetimesDF. \
    withColumn("date_desc", date_format("date", "MMMM d, yyyy")). \
    show(truncate=False)
+----------+-----------------------+-----------------+
|date      |time                   |date_desc        |
+----------+-----------------------+-----------------+
|2014-02-28|2014-02-28 10:00:00.123|February 28, 2014|
|2016-02-29|2016-02-29 08:08:08.999|February 29, 2016|
|2017-10-31|2017-12-31 11:59:59.123|October 31, 2017 |
|2019-11-30|2019-08-31 00:00:00.000|November 30, 2019|
+----------+-----------------------+-----------------+
  • Get name of the week day using date.

datetimesDF. \
    withColumn("day_name_abbr", date_format("date", "EE")). \
    show(truncate=False)
+----------+-----------------------+-------------+
|date      |time                   |day_name_abbr|
+----------+-----------------------+-------------+
|2014-02-28|2014-02-28 10:00:00.123|Fri          |
|2016-02-29|2016-02-29 08:08:08.999|Mon          |
|2017-10-31|2017-12-31 11:59:59.123|Tue          |
|2019-11-30|2019-08-31 00:00:00.000|Sat          |
+----------+-----------------------+-------------+
datetimesDF. \
    withColumn("day_name_full", date_format("date", "EEEE")). \
    show(truncate=False)
+----------+-----------------------+-------------+
|date      |time                   |day_name_full|
+----------+-----------------------+-------------+
|2014-02-28|2014-02-28 10:00:00.123|Friday       |
|2016-02-29|2016-02-29 08:08:08.999|Monday       |
|2017-10-31|2017-12-31 11:59:59.123|Tuesday      |
|2019-11-30|2019-08-31 00:00:00.000|Saturday     |
+----------+-----------------------+-------------+