Extracting Strings using split¶
Let us understand how to extract substrings from main string using split function.
- If we are processing variable length columns with delimiter then we use - splitto extract the information.
- Here are some of the examples for variable length columns and the use cases for which we typically extract information. 
- Address where we store House Number, Street Name, City, State and Zip Code comma separated. We might want to extract City and State for demographics reports. 
- splittakes 2 arguments, column and delimiter.
- splitconvert each string into array and we can access the elements using index.
- We can also use - explodein conjunction with- splitto explode the list or array into records in Data Frame. It can be used in cases such as word count, phone count etc.
Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our 10 node state of the art cluster/labs to learn Spark SQL using our unique integrated LMS.
from pyspark.sql import SparkSession
import getpass
username = getpass.getuser()
spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Processing Column Data'). \
    master('yarn'). \
    getOrCreate()
If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.
Using Spark SQL
spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
Using Scala
spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
Using Pyspark
pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
l = [('X', )]
df = spark.createDataFrame(l, "dummy STRING")
from pyspark.sql.functions import split, explode, lit
df.select(split(lit("Hello World, how are you"), " ")). \
    show(truncate=False)
+----------------------------------+
|split(Hello World, how are you,  )|
+----------------------------------+
|[Hello, World,, how, are, you]    |
+----------------------------------+
df.select(split(lit("Hello World, how are you"), " ")[2]). \
    show(truncate=False)
+-------------------------------------+
|split(Hello World, how are you,  )[2]|
+-------------------------------------+
|how                                  |
+-------------------------------------+
df.select(explode(split(lit("Hello World, how are you"), " ")).alias('word')). \
    show(truncate=False)
+------+
|word  |
+------+
|Hello |
|World,|
|how   |
|are   |
|you   |
+------+
- Most of the problems can be solved either by using - substringor- split.
Tasks - split¶
Let us perform few tasks to extract information from fixed length strings as well as delimited variable length strings.
- Create a list for employees with name, ssn and phone_numbers. 
- SSN Format 3 2 4 - Fixed Length with 11 characters 
- Phone Number Format - Country Code is variable and remaining phone number have 10 digits. One can have multiple phone numbers where they are separated by - ,:
- Country Code - one to 3 digits 
- Area Code - 3 digits 
- Phone Number Prefix - 3 digits 
- Phone Number Remaining - 4 digits 
- All the 4 parts are separated by spaces 
- Create a Dataframe with column names name, ssn and phone_number 
- Extract area code and last 4 digits from the phone number. 
- Extract last 4 digits from SSN. 
employees = [(1, "Scott", "Tiger", 1000.0, 
                      "united states", "+1 123 456 7890,+1 234 567 8901", "123 45 6789"
                     ),
                     (2, "Henry", "Ford", 1250.0, 
                      "India", "+91 234 567 8901", "456 78 9123"
                     ),
                     (3, "Nick", "Junior", 750.0, 
                      "united KINGDOM", "+44 111 111 1111,+44 222 222 2222", "222 33 4444"
                     ),
                     (4, "Bill", "Gomes", 1500.0, 
                      "AUSTRALIA", "+61 987 654 3210,+61 876 543 2109", "789 12 6118"
                     )
                ]
employeesDF = spark. \
    createDataFrame(employees,
                    schema="""employee_id INT, first_name STRING, 
                    last_name STRING, salary FLOAT, nationality STRING,
                    phone_numbers STRING, ssn STRING"""
                   )
employeesDF. \
    select('employee_id', 'phone_numbers'). \
    show(truncate=False)
+-----------+---------------------------------+
|employee_id|phone_numbers                    |
+-----------+---------------------------------+
|1          |+1 123 456 7890,+1 234 567 8901  |
|2          |+91 234 567 8901                 |
|3          |+44 111 111 1111,+44 222 222 2222|
|4          |+61 987 654 3210,+61 876 543 2109|
+-----------+---------------------------------+
from pyspark.sql.functions import split, explode
employeesDF = employeesDF. \
    select('employee_id', 'phone_numbers', 'ssn'). \
    withColumn('phone_number', explode(split('phone_numbers', ',')))
employeesDF.show(truncate=False)
+-----------+---------------------------------+-----------+----------------+
|employee_id|phone_numbers                    |ssn        |phone_number    |
+-----------+---------------------------------+-----------+----------------+
|1          |+1 123 456 7890,+1 234 567 8901  |123 45 6789|+1 123 456 7890 |
|1          |+1 123 456 7890,+1 234 567 8901  |123 45 6789|+1 234 567 8901 |
|2          |+91 234 567 8901                 |456 78 9123|+91 234 567 8901|
|3          |+44 111 111 1111,+44 222 222 2222|222 33 4444|+44 111 111 1111|
|3          |+44 111 111 1111,+44 222 222 2222|222 33 4444|+44 222 222 2222|
|4          |+61 987 654 3210,+61 876 543 2109|789 12 6118|+61 987 654 3210|
|4          |+61 987 654 3210,+61 876 543 2109|789 12 6118|+61 876 543 2109|
+-----------+---------------------------------+-----------+----------------+
employeesDF. \
    select("employee_id", "phone_number", "ssn"). \
    withColumn("area_code", split("phone_number", " ")[1].cast("int")). \
    withColumn("phone_last4", split("phone_number", " ")[3].cast("int")). \
    withColumn("ssn_last4", split("ssn", " ")[2].cast("int")). \
    show()
+-----------+----------------+-----------+---------+-----------+---------+
|employee_id|    phone_number|        ssn|area_code|phone_last4|ssn_last4|
+-----------+----------------+-----------+---------+-----------+---------+
|          1| +1 123 456 7890|123 45 6789|      123|       7890|     6789|
|          1| +1 234 567 8901|123 45 6789|      234|       8901|     6789|
|          2|+91 234 567 8901|456 78 9123|      234|       8901|     9123|
|          3|+44 111 111 1111|222 33 4444|      111|       1111|     4444|
|          3|+44 222 222 2222|222 33 4444|      222|       2222|     4444|
|          4|+61 987 654 3210|789 12 6118|      987|       3210|     6118|
|          4|+61 876 543 2109|789 12 6118|      876|       2109|     6118|
+-----------+----------------+-----------+---------+-----------+---------+
- Getting Phone Count 
employeesDF.show(truncate=False)
+-----------+---------------------------------+-----------+----------------+
|employee_id|phone_numbers                    |ssn        |phone_number    |
+-----------+---------------------------------+-----------+----------------+
|1          |+1 123 456 7890,+1 234 567 8901  |123 45 6789|+1 123 456 7890 |
|1          |+1 123 456 7890,+1 234 567 8901  |123 45 6789|+1 234 567 8901 |
|2          |+91 234 567 8901                 |456 78 9123|+91 234 567 8901|
|3          |+44 111 111 1111,+44 222 222 2222|222 33 4444|+44 111 111 1111|
|3          |+44 111 111 1111,+44 222 222 2222|222 33 4444|+44 222 222 2222|
|4          |+61 987 654 3210,+61 876 543 2109|789 12 6118|+61 987 654 3210|
|4          |+61 987 654 3210,+61 876 543 2109|789 12 6118|+61 876 543 2109|
+-----------+---------------------------------+-----------+----------------+
employeesDF. \
    groupBy('employee_id'). \
    count(). \
    show()
+-----------+-----+
|employee_id|count|
+-----------+-----+
|          1|    2|
|          3|    2|
|          4|    2|
|          2|    1|
+-----------+-----+