Extracting Strings using split

Let us understand how to extract substrings from main string using split function.

  • If we are processing variable length columns with delimiter then we use split to extract the information.

  • Here are some of the examples for variable length columns and the use cases for which we typically extract information.

  • Address where we store House Number, Street Name, City, State and Zip Code comma separated. We might want to extract City and State for demographics reports.

  • split takes 2 arguments, column and delimiter.

  • split convert each string into array and we can access the elements using index.

  • We can also use explode in conjunction with split to explode the list or array into records in Data Frame. It can be used in cases such as word count, phone count etc.

Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our 10 node state of the art cluster/labs to learn Spark SQL using our unique integrated LMS.

from pyspark.sql import SparkSession

import getpass
username = getpass.getuser()

spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Processing Column Data'). \
    master('yarn'). \
    getOrCreate()

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

Using Spark SQL

spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Scala

spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Pyspark

pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
l = [('X', )]
df = spark.createDataFrame(l, "dummy STRING")
from pyspark.sql.functions import split, explode, lit
df.select(split(lit("Hello World, how are you"), " ")). \
    show(truncate=False)
+----------------------------------+
|split(Hello World, how are you,  )|
+----------------------------------+
|[Hello, World,, how, are, you]    |
+----------------------------------+
df.select(split(lit("Hello World, how are you"), " ")[2]). \
    show(truncate=False)
+-------------------------------------+
|split(Hello World, how are you,  )[2]|
+-------------------------------------+
|how                                  |
+-------------------------------------+
df.select(explode(split(lit("Hello World, how are you"), " ")).alias('word')). \
    show(truncate=False)
+------+
|word  |
+------+
|Hello |
|World,|
|how   |
|are   |
|you   |
+------+
  • Most of the problems can be solved either by using substring or split.

Tasks - split

Let us perform few tasks to extract information from fixed length strings as well as delimited variable length strings.

  • Create a list for employees with name, ssn and phone_numbers.

  • SSN Format 3 2 4 - Fixed Length with 11 characters

  • Phone Number Format - Country Code is variable and remaining phone number have 10 digits. One can have multiple phone numbers where they are separated by ,:

  • Country Code - one to 3 digits

  • Area Code - 3 digits

  • Phone Number Prefix - 3 digits

  • Phone Number Remaining - 4 digits

  • All the 4 parts are separated by spaces

  • Create a Dataframe with column names name, ssn and phone_number

  • Extract area code and last 4 digits from the phone number.

  • Extract last 4 digits from SSN.

employees = [(1, "Scott", "Tiger", 1000.0, 
                      "united states", "+1 123 456 7890,+1 234 567 8901", "123 45 6789"
                     ),
                     (2, "Henry", "Ford", 1250.0, 
                      "India", "+91 234 567 8901", "456 78 9123"
                     ),
                     (3, "Nick", "Junior", 750.0, 
                      "united KINGDOM", "+44 111 111 1111,+44 222 222 2222", "222 33 4444"
                     ),
                     (4, "Bill", "Gomes", 1500.0, 
                      "AUSTRALIA", "+61 987 654 3210,+61 876 543 2109", "789 12 6118"
                     )
                ]
employeesDF = spark. \
    createDataFrame(employees,
                    schema="""employee_id INT, first_name STRING, 
                    last_name STRING, salary FLOAT, nationality STRING,
                    phone_numbers STRING, ssn STRING"""
                   )
employeesDF. \
    select('employee_id', 'phone_numbers'). \
    show(truncate=False)
+-----------+---------------------------------+
|employee_id|phone_numbers                    |
+-----------+---------------------------------+
|1          |+1 123 456 7890,+1 234 567 8901  |
|2          |+91 234 567 8901                 |
|3          |+44 111 111 1111,+44 222 222 2222|
|4          |+61 987 654 3210,+61 876 543 2109|
+-----------+---------------------------------+
from pyspark.sql.functions import split, explode
employeesDF = employeesDF. \
    select('employee_id', 'phone_numbers', 'ssn'). \
    withColumn('phone_number', explode(split('phone_numbers', ',')))
employeesDF.show(truncate=False)
+-----------+---------------------------------+-----------+----------------+
|employee_id|phone_numbers                    |ssn        |phone_number    |
+-----------+---------------------------------+-----------+----------------+
|1          |+1 123 456 7890,+1 234 567 8901  |123 45 6789|+1 123 456 7890 |
|1          |+1 123 456 7890,+1 234 567 8901  |123 45 6789|+1 234 567 8901 |
|2          |+91 234 567 8901                 |456 78 9123|+91 234 567 8901|
|3          |+44 111 111 1111,+44 222 222 2222|222 33 4444|+44 111 111 1111|
|3          |+44 111 111 1111,+44 222 222 2222|222 33 4444|+44 222 222 2222|
|4          |+61 987 654 3210,+61 876 543 2109|789 12 6118|+61 987 654 3210|
|4          |+61 987 654 3210,+61 876 543 2109|789 12 6118|+61 876 543 2109|
+-----------+---------------------------------+-----------+----------------+
employeesDF. \
    select("employee_id", "phone_number", "ssn"). \
    withColumn("area_code", split("phone_number", " ")[1].cast("int")). \
    withColumn("phone_last4", split("phone_number", " ")[3].cast("int")). \
    withColumn("ssn_last4", split("ssn", " ")[2].cast("int")). \
    show()
+-----------+----------------+-----------+---------+-----------+---------+
|employee_id|    phone_number|        ssn|area_code|phone_last4|ssn_last4|
+-----------+----------------+-----------+---------+-----------+---------+
|          1| +1 123 456 7890|123 45 6789|      123|       7890|     6789|
|          1| +1 234 567 8901|123 45 6789|      234|       8901|     6789|
|          2|+91 234 567 8901|456 78 9123|      234|       8901|     9123|
|          3|+44 111 111 1111|222 33 4444|      111|       1111|     4444|
|          3|+44 222 222 2222|222 33 4444|      222|       2222|     4444|
|          4|+61 987 654 3210|789 12 6118|      987|       3210|     6118|
|          4|+61 876 543 2109|789 12 6118|      876|       2109|     6118|
+-----------+----------------+-----------+---------+-----------+---------+
  • Getting Phone Count

employeesDF.show(truncate=False)
+-----------+---------------------------------+-----------+----------------+
|employee_id|phone_numbers                    |ssn        |phone_number    |
+-----------+---------------------------------+-----------+----------------+
|1          |+1 123 456 7890,+1 234 567 8901  |123 45 6789|+1 123 456 7890 |
|1          |+1 123 456 7890,+1 234 567 8901  |123 45 6789|+1 234 567 8901 |
|2          |+91 234 567 8901                 |456 78 9123|+91 234 567 8901|
|3          |+44 111 111 1111,+44 222 222 2222|222 33 4444|+44 111 111 1111|
|3          |+44 111 111 1111,+44 222 222 2222|222 33 4444|+44 222 222 2222|
|4          |+61 987 654 3210,+61 876 543 2109|789 12 6118|+61 987 654 3210|
|4          |+61 987 654 3210,+61 876 543 2109|789 12 6118|+61 876 543 2109|
+-----------+---------------------------------+-----------+----------------+
employeesDF. \
    groupBy('employee_id'). \
    count(). \
    show()
+-----------+-----+
|employee_id|count|
+-----------+-----+
|          1|    2|
|          3|    2|
|          4|    2|
|          2|    1|
+-----------+-----+