Using date_format Function¶
Let us understand how to extract information from dates or times using date_format
function.
We can use
date_format
to extract the required information in a desired format from standard date or timestamp. Earlier we have exploredto_date
andto_timestamp
to convert non standard date or timestamp to standard ones respectively.There are also specific functions to extract year, month, day with in a week, a day with in a month, day with in a year etc. These are covered as part of earlier topics in this section or module.
Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our 10 node state of the art cluster/labs to learn Spark SQL using our unique integrated LMS.
from pyspark.sql import SparkSession
import getpass
username = getpass.getuser()
spark = SparkSession. \
builder. \
config('spark.ui.port', '0'). \
config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
enableHiveSupport(). \
appName(f'{username} | Python - Processing Column Data'). \
master('yarn'). \
getOrCreate()
If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.
Using Spark SQL
spark2-sql \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
Using Scala
spark2-shell \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
Using Pyspark
pyspark2 \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
Tasks¶
Let us perform few tasks to extract the information we need from date or timestamp.
Create a Dataframe by name datetimesDF with columns date and time.
datetimes = [("2014-02-28", "2014-02-28 10:00:00.123"),
("2016-02-29", "2016-02-29 08:08:08.999"),
("2017-10-31", "2017-12-31 11:59:59.123"),
("2019-11-30", "2019-08-31 00:00:00.000")
]
datetimesDF = spark.createDataFrame(datetimes, schema="date STRING, time STRING")
datetimesDF.show(truncate=False)
+----------+-----------------------+
|date |time |
+----------+-----------------------+
|2014-02-28|2014-02-28 10:00:00.123|
|2016-02-29|2016-02-29 08:08:08.999|
|2017-10-31|2017-12-31 11:59:59.123|
|2019-11-30|2019-08-31 00:00:00.000|
+----------+-----------------------+
from pyspark.sql.functions import date_format
Get the year and month from both date and time columns using
yyyyMM
format. Also make sure that the data type is converted to integer.
datetimesDF. \
withColumn("date_ym", date_format("date", "yyyyMM")). \
withColumn("time_ym", date_format("time", "yyyyMM")). \
show(truncate=False)
# yyyy
# MM
# dd
# DD
# HH
# hh
# mm
# ss
# SSS
+----------+-----------------------+-------+-------+
|date |time |date_ym|time_ym|
+----------+-----------------------+-------+-------+
|2014-02-28|2014-02-28 10:00:00.123|201402 |201402 |
|2016-02-29|2016-02-29 08:08:08.999|201602 |201602 |
|2017-10-31|2017-12-31 11:59:59.123|201710 |201712 |
|2019-11-30|2019-08-31 00:00:00.000|201911 |201908 |
+----------+-----------------------+-------+-------+
datetimesDF. \
withColumn("date_ym", date_format("date", "yyyyMM")). \
withColumn("time_ym", date_format("time", "yyyyMM")). \
printSchema()
root
|-- date: string (nullable = true)
|-- time: string (nullable = true)
|-- date_ym: string (nullable = true)
|-- time_ym: string (nullable = true)
datetimesDF. \
withColumn("date_ym", date_format("date", "yyyyMM").cast('int')). \
withColumn("time_ym", date_format("time", "yyyyMM").cast('int')). \
printSchema()
root
|-- date: string (nullable = true)
|-- time: string (nullable = true)
|-- date_ym: integer (nullable = true)
|-- time_ym: integer (nullable = true)
datetimesDF. \
withColumn("date_ym", date_format("date", "yyyyMM").cast('int')). \
withColumn("time_ym", date_format("time", "yyyyMM").cast('int')). \
show(truncate=False)
+----------+-----------------------+-------+-------+
|date |time |date_ym|time_ym|
+----------+-----------------------+-------+-------+
|2014-02-28|2014-02-28 10:00:00.123|201402 |201402 |
|2016-02-29|2016-02-29 08:08:08.999|201602 |201602 |
|2017-10-31|2017-12-31 11:59:59.123|201710 |201712 |
|2019-11-30|2019-08-31 00:00:00.000|201911 |201908 |
+----------+-----------------------+-------+-------+
Get the information from time in yyyyMMddHHmmss format.
from pyspark.sql.functions import date_format
datetimesDF. \
withColumn("date_dt", date_format("date", "yyyyMMddHHmmss")). \
withColumn("date_ts", date_format("time", "yyyyMMddHHmmss")). \
show(truncate=False)
+----------+-----------------------+--------------+--------------+
|date |time |date_dt |date_ts |
+----------+-----------------------+--------------+--------------+
|2014-02-28|2014-02-28 10:00:00.123|20140228000000|20140228100000|
|2016-02-29|2016-02-29 08:08:08.999|20160229000000|20160229080808|
|2017-10-31|2017-12-31 11:59:59.123|20171031000000|20171231115959|
|2019-11-30|2019-08-31 00:00:00.000|20191130000000|20190831000000|
+----------+-----------------------+--------------+--------------+
datetimesDF. \
withColumn("date_dt", date_format("date", "yyyyMMddHHmmss").cast('long')). \
withColumn("date_ts", date_format("time", "yyyyMMddHHmmss").cast('long')). \
show(truncate=False)
+----------+-----------------------+--------------+--------------+
|date |time |date_dt |date_ts |
+----------+-----------------------+--------------+--------------+
|2014-02-28|2014-02-28 10:00:00.123|20140228000000|20140228100000|
|2016-02-29|2016-02-29 08:08:08.999|20160229000000|20160229080808|
|2017-10-31|2017-12-31 11:59:59.123|20171031000000|20171231115959|
|2019-11-30|2019-08-31 00:00:00.000|20191130000000|20190831000000|
+----------+-----------------------+--------------+--------------+
Get year and day of year using
yyyyDDD
format.
datetimesDF. \
withColumn("date_yd", date_format("date", "yyyyDDD").cast('int')). \
withColumn("time_yd", date_format("time", "yyyyDDD").cast('int')). \
show(truncate=False)
+----------+-----------------------+-------+-------+
|date |time |date_yd|time_yd|
+----------+-----------------------+-------+-------+
|2014-02-28|2014-02-28 10:00:00.123|2014059|2014059|
|2016-02-29|2016-02-29 08:08:08.999|2016060|2016060|
|2017-10-31|2017-12-31 11:59:59.123|2017304|2017365|
|2019-11-30|2019-08-31 00:00:00.000|2019334|2019243|
+----------+-----------------------+-------+-------+
Get complete description of the date.
datetimesDF. \
withColumn("date_desc", date_format("date", "MMMM d, yyyy")). \
show(truncate=False)
+----------+-----------------------+-----------------+
|date |time |date_desc |
+----------+-----------------------+-----------------+
|2014-02-28|2014-02-28 10:00:00.123|February 28, 2014|
|2016-02-29|2016-02-29 08:08:08.999|February 29, 2016|
|2017-10-31|2017-12-31 11:59:59.123|October 31, 2017 |
|2019-11-30|2019-08-31 00:00:00.000|November 30, 2019|
+----------+-----------------------+-----------------+
Get name of the week day using date.
datetimesDF. \
withColumn("day_name_abbr", date_format("date", "EE")). \
show(truncate=False)
+----------+-----------------------+-------------+
|date |time |day_name_abbr|
+----------+-----------------------+-------------+
|2014-02-28|2014-02-28 10:00:00.123|Fri |
|2016-02-29|2016-02-29 08:08:08.999|Mon |
|2017-10-31|2017-12-31 11:59:59.123|Tue |
|2019-11-30|2019-08-31 00:00:00.000|Sat |
+----------+-----------------------+-------------+
datetimesDF. \
withColumn("day_name_full", date_format("date", "EEEE")). \
show(truncate=False)
+----------+-----------------------+-------------+
|date |time |day_name_full|
+----------+-----------------------+-------------+
|2014-02-28|2014-02-28 10:00:00.123|Friday |
|2016-02-29|2016-02-29 08:08:08.999|Monday |
|2017-10-31|2017-12-31 11:59:59.123|Tuesday |
|2019-11-30|2019-08-31 00:00:00.000|Saturday |
+----------+-----------------------+-------------+