Pandas is the driving force behind literally millions of notebooks, and even that is an understatement. Prior work gives us an idea of the scale: they gathered one million notebooks from Github and found that 42.3% (of those that imported packages) imported pandas; which is to say almost every other notebook uses pandas (and that work did not consider all the notebooks in Kaggle, Google Colab, etc.)

We’re interested in studying and improving the performance of all these pandas-backed notebooks, which we refer to as single-machine notebooks. These differ from cloud/distributed solutions (e.g., Snowflake Pandas, Spark DataFrames, etc.) in that they run on a single machine. This is obvious for all the notebooks pandas users run on their laptops, and interestingly it is also the case for platforms like Kaggle and Google Colab, since the resources they provide are those found in a single machine.

But at the moment there is a big void in the ecosystem: there is no good benchmark for the Pandas API. And what a void that is—because without a benchmark, how can we understand where the available systems stand in terms of capabilities? And if we don’t know that, how can we improve upon them, or simply know which one to use?

To that end, we created PandasBench, the first systematic effort to create a benchmark for the Pandas API for single-machine workloads. PandasBench is the largest Pandas API benchmark to date with 102 notebooks and 3,721 cells. We used it to evaluate Modin, Dask, Koalas, and Dias, over randomly-selected real-world notebooks from Kaggle, creating the largest-scale evaluation of any of these techniques to date. We show that slowdowns over these single-machine notebooks are the norm, and we identify many failures of these systems which we highlight below.

The First Pandas API Benchmark

As we said, we created PandasBench because there was no good benchmark for the Pandas API that can serve as a standard for the performance evaluation of Pandas API optimization techniques. In other words, there was no way to know how good the existing techniques were, and that of course includes even the research we’re carrying out! In fact, any set of workloads that has been used in prior work for the evaluation of the respective systems, was not created systematically to fulfill the requirements of a standard benchmark, which should (at least): include a large set of workloads, cover real-world use cases, bundle realistic input data (so that we can execute the notebooks), etc. In this list we even include the set of workloads we created to evaluate Dias: dias-benchmarks! To the best of our knowledge, the Dias benchmarks constitute the largest collection of real-world, executable notebooks before PandasBench, but they still do not measure up to PandasBench on several dimensions.

At the same time, any existing benchmarks from other domains are insufficient for the Pandas API. More specifically, dataframes (the driving force behind the Pandas API) are closest, as a data model, to the relational model and to matrices. Matrices are clearly much simpler than dataframes, because they only include numbers, and so any related benchmark falls short (which is why e.g., numpy benchmarks like these don’t capture the characteristics of dataframes). As far as the relational model goes, the standard benchmarks there are part of the TPC family. In the paper we present a detailed argument of why the TPC variants are inadequate (see the Introduction and Appendix A). The summary is that the TPC variants don’t test core features and use cases of the Pandas API like: burst computations, reading and writing (raw data) files, data cleaning, and inspection of intermediate results, etc.

In summary, there was a big void in benchmarks for the Pandas API, and with PandasBench we attempted to ameliorate the situation by focusing on single-machine workloads. Such workloads are ubiquitous, and probably what the “average user” of pandas will concern themselves with (but as we said earlier, millions of single-machine notebooks exist in platforms such as Github, Kaggle, and Google Colab).

We constructed PandasBench with one primary goal in mind: testing the real-world coverage of the Pandas API. This is concerned with the API’s usage in practice, because e.g., in practice only 10% of an API may be used. For example, it’s much less of an issue if a technique does not support an obscure, rarely used operation (e.g., .pop()), compared to not supporting .sort_values(), which is found in approximately 50% of the notebooks. This pragmatism was the central influence in how we collected and prepared PandasBench.

The Construction of PandasBench



Collection
⬇ Hover Over
We need to collect notebooks such that we satisfy the real-world coverage conditions we lay down below. As we explain, satisfying all three conditions simultaneously creates a hard problem that pratically constraints us to collect notebooks from sources that provide both the code and the data; for that, we use Kaggle.
Fixing
Fixing involves finding the correct library versions, replacing deprecated methods, and undoing mismatches between the code and the data. For example, in the following snippet we replace a deprecated version.
# Original
X = trainData[testFeatures].fillna(0).as_matrix()
# Replacement
X = trainData[testFeatures].fillna(0).to_numpy()
The challenge here is that we do not know of any systematic method to fix all the problems. Essentially, we came across an unbounded number of problems.
Cleaning
In Cleaning we remove code that does not use the Pandas API, including machine-learning code like in the snippet below. But there is a problem: dependencies. There may be Pandas code that depends on the we want to delete.
lm=smf.ols('...',data = train).fit()
lm.summary()
# This is Pandas code, but it
# depends on non-Pandas code.
imc = pd.DataFrame(lm.pvalues)
Adaptation
The goal of Adaptation is to make sure the notebooks are appropriate for a benchmark suite. One main concern in that regard is making sure all the original code gets executed. Libraries like Modin and Dask (unlike pandas) only evaluate code if a side-effect-incurring operation calls it (e.g., plotting a Series). But since some this code is deleted during cleaning, some Pandas code that used to feed it now remains non-evaluated.
Scaling
Scaling is related to how we scale data. As we explain below, we follow an unconventional scheme, in which the data of every notebook is scaled independently. Instead of targeting scaling factors, we target runtimes (e.g., we scale the data of each notebook independently such that each notebook runs for about 5s).

Collection

Our main target is real-world coverage. In the paper we argue that a benchmark tests real-world coverage if it fulfills the following (real-world coverage) conditions: RWC1) it uses real-world code, RWC2) is executable, and RWC3) is large and diverse.

The interesting thing here is that satisfying the combinations RWC1+RWC3 or RWC2+RWC3 is easy; but if we include both RWC1 and RWC2, the problem becomes significantly harder! For example, let’s say we remove RWC1. Then, we can simply create some random data (whose structure we guide and control), and then programmatically generate (copious amounts of) code that aligns with that data. We don’t claim that we have tried this out, but based on our programming-languages background, we deem it tractable. However, adding RWC1 forbids generating code programmatically, so this whole approach goes out of the window.

One then may think: just pull real-world code e.g., from Github and generate data for it. However, this (specifically: generating valid data for Pandas API code) is a problem that, to the best of our knowledge, is unsolved (see Section 3 in the paper)! Again, if we remove RWC2 (i.e., being able to execute the code), then the problem becomes almost trivial: we simply pull thousands of notebooks from Kaggle. In fact, the problem is already solved by works such as KGTorrent, which contains more than 200,000 notebooks.

In summary, if we want to satisfy all three conditions simultaneously, we need to collect notebooks from a source that gives us access to real-world code, and provides both the code and the data. The good news is that such a source exists: Kaggle! Kaggle hosts notebooks that solve real-world problems (e.g., in competitions that reward actual money), and it gives us both the notebook, and the original (and thus non-synthetic) data it used. The bad news is that Kaggle is less than ideal if we want to end up with executable notebooks.

The problem lies in that the code and data are not enough for executing a notebook, and they’re definitely not enough if we want this notebook to be part of a Pandas API benchmark; at the least we need some information about the environment (which libraries were used, what was the version, etc.). Unfortunately, Kaggle doesn’t provide that information, and we don’t know any way to automate the extraction of such information. So, there was a human-imposed limit to how many notebooks we could collect.

Preperation

That said, since the notebooks were collected manually, we had the chance to carry out an intricate and detailed preparation process of these notebooks that ensures that each notebook is executable and appropriate for a Pandas API benchmark suite (this process is described in Section 4 of the paper, and more in Appendix B). For example, the first step is to fix the notebook (library versions, missing code, etc.) such that it runs! Another is to clean them. Notebooks may have code that is irrelevant for a Pandas API benchmark; namely, code that does not use the Pandas API (or at least APIs of the same family, like numpy), like machine-learning code. So, we remove such code to isolate what is relevant.

Probably the most interesting step is the last one, which involves data scaling. Data scaling is a standard feature of most benchmark suites (e.g., TPC-H), but in our case the interesting aspect is that it is not uniform! A uniform data scaling scheme means that we scale the data of every workload by the same amount, e.g., 2×. The simplest case is when all workloads use the same data, like in TPC-H. This is not the case in PandasBench, in which every notebook was collected independently and thus uses its own data.

Upon reflection it became clear that scaling the data uniformly does not make sense. Consider this: why do we scale data in the first place? We downscale data because something may be too slow, or have enough data volume already for our purposes, and we upscale data for the opposite reasons. Then it’s obvious that there’s no reason to assume that if notebook A is e.g., too slow and we need to downscale its data, that notebook B will also be (equally) too slow and thus we need to downscale its data (by the same amount). This is simply because notebooks A and B are completely independent both in the code they run, and in the data (and the sizes of this data) that they use.

In short, we need to scale the data independently for each notebook. But we need to decide how that will happen; in other words, we need a scaling scheme! It seems that the most useful scaling scheme involves targeting specific running times. For example, the goal is to scale the data such that every notebook runs for roughly 5 seconds. Again, this may involve scaling down the data for some notebooks, while scaling up the data for others. In this scheme, by targeting different runtimes we can emulate different classes of workloads. So, this is the scheme we proceeded with, which allowed us to distill useful conclusions.

Our Findings Using PandasBench

We created PandasBench to evaluate Pandas optimization techniques in terms of performance and (real-world) coverage, and this is the use case we’ll explore here. We evaluated three Pandas API alternatives—Modin, Dask, and koalas—and one Pandas API complement: Dias (we also tried to evaluate SCIRPy, but we were not able to obtain a copy of the code). A Pandas API alternative is an implementation of the Pandas API that is supposed to act as a replacement. A Pandas API complement is a technique that complements a Pandas API implementation.

The evaluation section of our paper (Section 6) is extensive and, we hope, illuminating. Here we’ll just mention some highlights, which boil down to:

  1. No system we tested gives significant speedups for almost all the notebooks.
  2. Modin, Dask, and Koalas face serious coverage problems, something that is true for Dias too, but to a lesser extent.
  3. The memory consumption of Modin, Dask, and Koalas can make running notebooks impossible on a single machine, especially for Modin.

In that regard, this evaluation reaches conclusions that are consistent with the experimental results of our prior work, Dias: most existing techniques slow down almost all of the notebooks (we remind the reader that we focus on single-machine workloads) and can’t act as replacements to due to insufficient coverage. To our dismay, this is true even for Dias, our tool. Even though it achieved the best speedups and the best coverage, it did not speed up most of the notebooks.

In terms of memory consumption, the situation is even worse (although thankfully in this case Dias is excluded because it does not have a storage layer and it only touches code). Modin for example reaches hundreds of gigabytes, and no normal computer would be able to run PandasBench with Modin, Dask, and Koalas. This is, in fact, why we used a different machine for the PandasBench evaluation (we remind our readers that in Dias we used a commodity machine, whereas here we used a cluster with 1TB of main memory).

Definitely the most unexpected result concerns the real-world coverage of the techniques (see the Collection section). In short, no technique runs all 102 notebooks, and Dask and Koalas run up to 10. Modin fairs a bit better with 72, and Dias even better with 97.


PandasBench helped us identify and classify the problems that led to these failures; these are summarized in the table below. The paper gives specific examples for each of them.


From a personal perspective, for us definitely the most surprising result was that scaling impacted how many notebooks ran! That was indeed startling and definitely none of us anticipated it; why would data scaling have anything to do with what code runs? After observing this result, our first supposition was that out-of-memory failures must have played a role. And sure enough, that was the case for Koalas, which had the most failures due to scaling (only 6 out of 10 (60%) of the notebooks that ran on the default target runtime also ran on the 20-second target runtime). Koalas uses Spark under the hood, and the failures occur because the JVM workers run out of memory.

But the situation puzzled us for Modin and Dask. Modin fails with scaled data because it internally does not convert a Series to a numpy.ndarray on larger target runtimes, and so when a numpy method is called, an exception is raised. Dask acted even more unexpectedly. It runs one more notebook in the default target runtime than the 5s target runtime. It turns out this is because this notebook was downscaled in the 5s target runtime. With the downscaling, a float that appears later in a column that Dask infers as int gets cut out, which “avoids” an inference error (for more information, see Schema Inference in Section 5, and Bad Type Inference in 6.2.1).

Conclusion

We hope PandasBench will help both researchers and practitioners gain insights into the landscape of the current Pandas API solutions. Our findings may make the current situation appear dreary, but only through an understanding of the ecosystem will we be able to improve it. We do believe that a benchmark is a positive and necessary step towards that goal.

However, we should note that PandasBench is only the first step! It was not created to (and it cannot) serve as the “end all, be all” benchmark for the Pandas API. For example, we recognize that there are important classes of workloads and use cases that it does not capture (e.g., cloud workloads). But it is an addition to the presently limited benchmarking apparatus available for the Pandas API, which we hope to extend; and if we don’t, we hope someone else will!