
Imagine a dataset where we have an array of words on each row. The use case for which we are going to benchmark the queries is related to the following example.

We can simply specify it as the write format and it will materialize the query and execute all the transformations but it will not write the result anywhere. In Spark 3.0 the situation for benchmarking simplified and doing performance benchmarks became much more convenient thanks to the noop write format, which is a new feature in Spark 3.0. But saving the output has another drawback because the execution time will be affected by the writing process. So to make sure that really all transformations are executed, it is better to call write and save the result somewhere.

The Spark optimizer can simplify the query plan in such a way that the actual transformation that you need to measure will be skipped because it is simply not needed for finding out the final count. However with the count, there is this problem, that to evaluate how many rows are represented by the resulting DataFrame, Spark doesn’t always need to execute all transformations. One action that can come to your mind is the function count(). First of all, we need to materialize the transformations by calling an action to run a Spark job (to understand the difference between transformations and actions in Spark, see my recent article). When measuring performance for different transformations in Spark one needs to be careful to avoid some pitfalls. We will show it on one particular example related to arrays processing, but the results can be generalized beyond. In this article, we will compare 9 different techniques that lead to the same result, but each of them has a very different performance. This is also a consequence of the continuous development of Spark because new techniques and functions are available in new releases.

In Apache Spark, it is quite common that the same transformation can be achieved in different ways.
