Using Date and Time Trunc Functions¶
In Data Warehousing we quite often run to date reports such as week to date, month to date, year to date etc. Let us understand how we can take care of such requirements using appropriate functions over Spark Data Frames.
We can use
trunc
ordate_trunc
for the same to get the beginning date of the week, month, current year etc by passing date or timestamp to it.We can use
trunc
to get beginning date of the month or year by passing date or timestamp to it - for exampletrunc(current_date(), "MM")
will give the first of the current month.We can use
date_trunc
to get beginning date of the month or year as well as beginning time of the day or hour by passing timestamp to it.Get beginning date based on month -
date_trunc("MM", current_timestamp())
Get beginning time based on day -
date_trunc("DAY", current_timestamp())
Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our 10 node state of the art cluster/labs to learn Spark SQL using our unique integrated LMS.
from pyspark.sql import SparkSession
import getpass
username = getpass.getuser()
spark = SparkSession. \
builder. \
config('spark.ui.port', '0'). \
config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
enableHiveSupport(). \
appName(f'{username} | Python - Processing Column Data'). \
master('yarn'). \
getOrCreate()
If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.
Using Spark SQL
spark2-sql \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
Using Scala
spark2-shell \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
Using Pyspark
pyspark2 \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
from pyspark.sql.functions import trunc, date_trunc
Tasks¶
Let us perform few tasks to understand trunc and date_trunc in detail.
Create a Dataframe by name datetimesDF with columns date and time.
datetimes = [("2014-02-28", "2014-02-28 10:00:00.123"),
("2016-02-29", "2016-02-29 08:08:08.999"),
("2017-10-31", "2017-12-31 11:59:59.123"),
("2019-11-30", "2019-08-31 00:00:00.000")
]
datetimesDF = spark.createDataFrame(datetimes, schema="date STRING, time STRING")
datetimesDF.show(truncate=False)
+----------+-----------------------+
|date |time |
+----------+-----------------------+
|2014-02-28|2014-02-28 10:00:00.123|
|2016-02-29|2016-02-29 08:08:08.999|
|2017-10-31|2017-12-31 11:59:59.123|
|2019-11-30|2019-08-31 00:00:00.000|
+----------+-----------------------+
Get beginning month date using date field and beginning year date using time field.
from pyspark.sql.functions import trunc
datetimesDF. \
withColumn("date_trunc", trunc("date", "MM")). \
withColumn("time_trunc", trunc("time", "yy")). \
show(truncate=False)
+----------+-----------------------+----------+----------+
|date |time |date_trunc|time_trunc|
+----------+-----------------------+----------+----------+
|2014-02-28|2014-02-28 10:00:00.123|2014-02-01|2014-01-01|
|2016-02-29|2016-02-29 08:08:08.999|2016-02-01|2016-01-01|
|2017-10-31|2017-12-31 11:59:59.123|2017-10-01|2017-01-01|
|2019-11-30|2019-08-31 00:00:00.000|2019-11-01|2019-01-01|
+----------+-----------------------+----------+----------+
Get beginning hour time using date and time field.
from pyspark.sql.functions import date_trunc
datetimesDF. \
withColumn("date_trunc", date_trunc('MM', "date")). \
withColumn("time_trunc", date_trunc('yy', "time")). \
show(truncate=False)
+----------+-----------------------+-------------------+-------------------+
|date |time |date_trunc |time_trunc |
+----------+-----------------------+-------------------+-------------------+
|2014-02-28|2014-02-28 10:00:00.123|2014-02-01 00:00:00|2014-01-01 00:00:00|
|2016-02-29|2016-02-29 08:08:08.999|2016-02-01 00:00:00|2016-01-01 00:00:00|
|2017-10-31|2017-12-31 11:59:59.123|2017-10-01 00:00:00|2017-01-01 00:00:00|
|2019-11-30|2019-08-31 00:00:00.000|2019-11-01 00:00:00|2019-01-01 00:00:00|
+----------+-----------------------+-------------------+-------------------+
datetimesDF. \
withColumn("date_dt", date_trunc("HOUR", "date")). \
withColumn("time_dt", date_trunc("HOUR", "time")). \
withColumn("time_dt1", date_trunc("dd", "time")). \
show(truncate=False)
+----------+-----------------------+-------------------+-------------------+-------------------+
|date |time |date_dt |time_dt |time_dt1 |
+----------+-----------------------+-------------------+-------------------+-------------------+
|2014-02-28|2014-02-28 10:00:00.123|2014-02-28 00:00:00|2014-02-28 10:00:00|2014-02-28 00:00:00|
|2016-02-29|2016-02-29 08:08:08.999|2016-02-29 00:00:00|2016-02-29 08:00:00|2016-02-29 00:00:00|
|2017-10-31|2017-12-31 11:59:59.123|2017-10-31 00:00:00|2017-12-31 11:00:00|2017-12-31 00:00:00|
|2019-11-30|2019-08-31 00:00:00.000|2019-11-30 00:00:00|2019-08-31 00:00:00|2019-08-31 00:00:00|
+----------+-----------------------+-------------------+-------------------+-------------------+