Dealing with Unix Timestamp

Let us understand how to deal with Unix Timestamp in Spark.

  • It is an integer and started from January 1st 1970 Midnight UTC.

  • Beginning time is also known as epoch and is incremented by 1 every second.

  • We can convert Unix Timestamp to regular date or timestamp and vice versa.

  • We can use unix_timestamp to convert regular date or timestamp to a unix timestamp value. For example unix_timestamp(lit("2019-11-19 00:00:00"))

  • We can use from_unixtime to convert unix timestamp to regular date or timestamp. For example from_unixtime(lit(1574101800))

  • We can also pass format to both the functions.

Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our 10 node state of the art cluster/labs to learn Spark SQL using our unique integrated LMS.

from pyspark.sql import SparkSession

import getpass
username = getpass.getuser()

spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Processing Column Data'). \
    master('yarn'). \
    getOrCreate()

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

Using Spark SQL

spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Scala

spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Pyspark

pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Tasks

Let us perform few tasks to understand how to deal with Unix Timestamp.

  • Create a Dataframe by name datetimesDF with columns dateid, date and time.

datetimes = [(20140228, "2014-02-28", "2014-02-28 10:00:00.123"),
                     (20160229, "2016-02-29", "2016-02-29 08:08:08.999"),
                     (20171031, "2017-10-31", "2017-12-31 11:59:59.123"),
                     (20191130, "2019-11-30", "2019-08-31 00:00:00.000")
                ]
datetimesDF = spark.createDataFrame(datetimes).toDF("dateid", "date", "time")
datetimesDF.show(truncate=False)
+--------+----------+-----------------------+
|dateid  |date      |time                   |
+--------+----------+-----------------------+
|20140228|2014-02-28|2014-02-28 10:00:00.123|
|20160229|2016-02-29|2016-02-29 08:08:08.999|
|20171031|2017-10-31|2017-12-31 11:59:59.123|
|20191130|2019-11-30|2019-08-31 00:00:00.000|
+--------+----------+-----------------------+
  • Get unix timestamp for dateid, date and time.

from pyspark.sql.functions import unix_timestamp, col
datetimesDF. \
    withColumn("unix_date_id", unix_timestamp(col("dateid").cast("string"), "yyyyMMdd")). \
    withColumn("unix_date", unix_timestamp("date", "yyyy-MM-dd")). \
    withColumn("unix_time", unix_timestamp("time")). \
    show()
+--------+----------+--------------------+------------+----------+----------+
|  dateid|      date|                time|unix_date_id| unix_date| unix_time|
+--------+----------+--------------------+------------+----------+----------+
|20140228|2014-02-28|2014-02-28 10:00:...|  1393563600|1393563600|1393599600|
|20160229|2016-02-29|2016-02-29 08:08:...|  1456722000|1456722000|1456751288|
|20171031|2017-10-31|2017-12-31 11:59:...|  1509422400|1509422400|1514739599|
|20191130|2019-11-30|2019-08-31 00:00:...|  1575090000|1575090000|1567224000|
+--------+----------+--------------------+------------+----------+----------+
  • Create a Dataframe by name unixtimesDF with one column unixtime using 4 values. You can use the unix timestamp generated for time column in previous task.

unixtimes = [(1393561800, ),
             (1456713488, ),
             (1514701799, ),
             (1567189800, )
            ]
unixtimesDF = spark.createDataFrame(unixtimes).toDF("unixtime")
unixtimesDF.show()
+----------+
|  unixtime|
+----------+
|1393561800|
|1456713488|
|1514701799|
|1567189800|
+----------+
unixtimesDF.printSchema()
root
 |-- unixtime: long (nullable = true)
  • Get date in yyyyMMdd format and also complete timestamp.

from pyspark.sql.functions import from_unixtime
unixtimesDF. \
    withColumn("date", from_unixtime("unixtime", "yyyyMMdd")). \
    withColumn("time", from_unixtime("unixtime")). \
    show()
#yyyyMMdd
+----------+--------+-------------------+
|  unixtime|    date|               time|
+----------+--------+-------------------+
|1393561800|20140227|2014-02-27 23:30:00|
|1456713488|20160228|2016-02-28 21:38:08|
|1514701799|20171231|2017-12-31 01:29:59|
|1567189800|20190830|2019-08-30 14:30:00|
+----------+--------+-------------------+