Padding Characters around Strings

Let us go through how to pad characters to strings using Spark Functions.

  • We typically pad characters to build fixed length values or records.

  • Fixed length values or records are extensively used in Mainframes based systems.

  • Length of each and every field in fixed length records is predetermined and if the value of the field is less than the predetermined length then we pad with a standard character.

  • In terms of numeric fields we pad with zero on the leading or left side. For non numeric fields, we pad with some standard character on leading or trailing side.

  • We use lpad to pad a string with a specific character on leading or left side and rpad to pad on trailing or right side.

  • Both lpad and rpad, take 3 arguments - column or expression, desired length and the character need to be padded.

Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our 10 node state of the art cluster/labs to learn Spark SQL using our unique integrated LMS.

from pyspark.sql import SparkSession

import getpass
username = getpass.getuser()

spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Processing Column Data'). \
    master('yarn'). \
    getOrCreate()

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

Using Spark SQL

spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Scala

spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Pyspark

pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Tasks - Padding Strings

Let us perform simple tasks to understand the syntax of lpad or rpad.

  • Create a Dataframe with single value and single column.

  • Apply lpad to pad with - to Hello to make it 10 characters.

l = [('X',)]
df = spark.createDataFrame(l).toDF("dummy")
from pyspark.sql.functions import lit, lpad
df.select(lpad(lit("Hello"), 10, "-").alias("dummy")).show()
+----------+
|     dummy|
+----------+
|-----Hello|
+----------+
  • Let’s create the employees Dataframe

employees = [(1, "Scott", "Tiger", 1000.0, 
                      "united states", "+1 123 456 7890", "123 45 6789"
                     ),
                     (2, "Henry", "Ford", 1250.0, 
                      "India", "+91 234 567 8901", "456 78 9123"
                     ),
                     (3, "Nick", "Junior", 750.0, 
                      "united KINGDOM", "+44 111 111 1111", "222 33 4444"
                     ),
                     (4, "Bill", "Gomes", 1500.0, 
                      "AUSTRALIA", "+61 987 654 3210", "789 12 6118"
                     )
                ]
employeesDF = spark.createDataFrame(employees). \
    toDF("employee_id", "first_name",
         "last_name", "salary",
         "nationality", "phone_number",
         "ssn"
        )
employeesDF.show()
+-----------+----------+---------+------+--------------+----------------+-----------+
|employee_id|first_name|last_name|salary|   nationality|    phone_number|        ssn|
+-----------+----------+---------+------+--------------+----------------+-----------+
|          1|     Scott|    Tiger|1000.0| united states| +1 123 456 7890|123 45 6789|
|          2|     Henry|     Ford|1250.0|         India|+91 234 567 8901|456 78 9123|
|          3|      Nick|   Junior| 750.0|united KINGDOM|+44 111 111 1111|222 33 4444|
|          4|      Bill|    Gomes|1500.0|     AUSTRALIA|+61 987 654 3210|789 12 6118|
+-----------+----------+---------+------+--------------+----------------+-----------+
employeesDF.printSchema()
root
 |-- employee_id: long (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- salary: double (nullable = true)
 |-- nationality: string (nullable = true)
 |-- phone_number: string (nullable = true)
 |-- ssn: string (nullable = true)
  • Use pad functions to convert each of the field into fixed length and concatenate. Here are the details for each of the fields.

    • Length of the employee_id should be 5 characters and should be padded with zero.

    • Length of first_name and last_name should be 10 characters and should be padded with - on the right side.

    • Length of salary should be 10 characters and should be padded with zero.

    • Length of the nationality should be 15 characters and should be padded with - on the right side.

    • Length of the phone_number should be 17 characters and should be padded with - on the right side.

    • Length of the ssn can be left as is. It is 11 characters.

  • Create a new Dataframe empFixedDF with column name employee. Preview the data by disabling truncate.

from pyspark.sql.functions import lpad, rpad, concat
empFixedDF = employeesDF.select(
    concat(
        lpad("employee_id", 5, "0"), 
        rpad("first_name", 10, "-"), 
        rpad("last_name", 10, "-"),
        lpad("salary", 10, "0"), 
        rpad("nationality", 15, "-"), 
        rpad("phone_number", 17, "-"), 
        "ssn"
    ).alias("employee")
)
empFixedDF.show(truncate=False)
+------------------------------------------------------------------------------+
|employee                                                                      |
+------------------------------------------------------------------------------+
|00001Scott-----Tiger-----00001000.0united states--+1 123 456 7890--123 45 6789|
|00002Henry-----Ford------00001250.0India----------+91 234 567 8901-456 78 9123|
|00003Nick------Junior----00000750.0united KINGDOM-+44 111 111 1111-222 33 4444|
|00004Bill------Gomes-----00001500.0AUSTRALIA------+61 987 654 3210-789 12 6118|
+------------------------------------------------------------------------------+