Before getting into UDFs, knowing what predefined functions are is essential.

For example, SQL Server has predefined functions, which are built-in functions. For example, if you want to get the max, min or count the number of records in a particular query, SQL has functions to help get the result.

Spark SQL also provides predefined functions which help while working with DataFrame/ Dataset and SQL. Sometimes, you may not have a built-in standard function to resolve the problem for a particular use case, and you have to use UDFs to resolve your issue.

UDF, a.k.a User Defined Function, is a custom function built by the user in other to resolve a problem. Using UDFs can be advantageous because:

It expands the capability of Spark SQL by defining a defined function
It's Straightforward to implement
It transforms Pyspark Dataframe

Well-experienced Data engineers recommended using UDFs only if there aren't built-in Spark SQL functions to resolve the issue because it can affect the performance.

Steps for using UDFs

1. Create Spark Session and import all required packages

I walk with real-time Data. So, I ingest data using Kafka and send them to my Spark application. Note that you can directly download a Dataset from Kaggle or another open data platform and read them using Spark.

from pyspark.sql import SparkSession
 spark = SparkSession \
        .builder \
        .appName("TwitterSentimentAnalysis") \
        .config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2") \
        .getOrCreate()

    df = spark \
        .readStream \
        .format("kafka") \
        .option("kafka.bootstrap.servers", "localhost:9092") \
        .option("subscribe", "twitter") \
        .load()

We receive DataFrame from Kafka with multiples information. We assume that only one column catches our attention (This is the text column). We have then to transform it.

mySchema = StructType([StructField("text", StringType(), True)])
values = df.select(from_json(df.value.cast("string"), mySchema).alias("tweet"))
df1 = values.select("tweet.*")

This piece of code gets a Dataframe with tweets, and those tweets are the ones to be transformed.

2. Create a python function

The purpose of this function is to clean tweets. So, I will give a string as a parameter and then I will do some transformations operation on each line and get a string transform at the end.

def cleanTweet(tweet: str) -> str:
    tweet = re.sub(r'http\S+', '', str(tweet))
    tweet = re.sub(r'bit.ly/\S+', '', str(tweet))
    tweet = tweet.strip('[link]')

    # remove users
    tweet = re.sub('(RT\s@[A-Za-z]+[A-Za-z0-9-_]+)', '', str(tweet))
    tweet = re.sub('(@[A-Za-z]+[A-Za-z0-9-_]+)', '', str(tweet))

    # remove puntuation
    my_punctuation = '!"$%&\'()*+,-./:;<=>?[\\]^_`{|}~•@â'
    tweet = re.sub('[' + my_punctuation + ']+', ' ', str(tweet))

    # remove number
    tweet = re.sub('([0-9]+)', '', str(tweet))

    # remove hashtag
    tweet = re.sub('(#[A-Za-z]+[A-Za-z0-9-_]+)', '', str(tweet))

    # remove emoji
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               u"\U00002500-\U00002BEF"  # chinese char
                               u"\U00002702-\U000027B0"
                               u"\U00002702-\U000027B0"
                               u"\U000024C2-\U0001F251"
                               u"\U0001f926-\U0001f937"
                               u"\U00010000-\U0010ffff"
                               u"\u2640-\u2642"
                               u"\u2600-\u2B55"
                               u"\u200d"
                               u"\u23cf"
                               u"\u23e9"
                               u"\u231a"
                               u"\ufe0f"  # dingbats
                               u"\u3030"
                               "]+", flags=re.UNICODE)
    tweet = emoji_pattern.sub(r'', tweet)

    return tweet

3. Convert the above python function to UDF

from pyspark.sql import functions as F
clean_tweets = F.udf(cleanTweet, StringType())

This function takes two arguments: The python function and the return type.

4. Apply the UDF function

raw_tweets = df1.withColumn('processed_text', clean_tweets(col("text")))

By doing this, I am creating a new column name, "processed_text", and applying the UDF function on the column "text" of the DataFrame.

Well done, you have created a UDF function and applied it to a Dataframe. That was pretty simple.

Conclusion

To conclude, UDFs are pretty good to use, and their architecture makes the usage much more straightforward. We have to define it once and use it across multiple DataFrame.

I hope this information was helpful and exciting. If you have any questions or want to say hi, I'm happy to connect and respond to your questions about my blogs! Feel free to visit my website for more!

Deep Dive into UDFs with Pyspark

Table of contents

Steps for using UDFs

Conclusion