Why pandas feels clunky (to you) when (you're) coming from R

This post is a response to the following blog post: https://www.sumsar.net/blog/pandas-feels-clunky-when-coming-from-r/

As an experienced R user, the author finds pandas difficult and clunky. Their post provides a detailed comparison of a performing a simple data analysis task in both R and Python. In R, the author uses tidyverse to analyze a dataset of purchases, using operations such as summing, groupby, and filter. The R code is concise and intuitive.

The author then describes their approach to the problem using python's pandas library, which is indeed clunky.

I propose a different perspective: the perceived clunkiness of pandas is not inherently due to the library itself, but rather because the author is approaching Python with an R-centric mindset. This view is particularly evident when considering that many who learn Python and pandas as their first data manipulation tools. I, for one, found dplyr to be very clunky at first, because I wanted to write pythonic code when I first learned R. It got easier when I learned more about R idioms and general coding practices.

The core of this issue lies in the difference in methodologies between R and Python. In the blog post, the author relies heavily on achieving tasks in a single line of code, a common practice in R facilitated by the pipe operator. This operator allows for smooth chaining of functions, feeding the output of one directly into another. However, in Python, particularly when working with pandas, a similar concept exists but is often underutilized or misunderstood. The concept is referred to as method chaining.

Method chaining is a powerful tool in pandas, but it requires a basic understanding of how each method affects the data type and structure of its output. The author's Python solution appears clunky and overly complex because of a lack of familiarity with pandas' capabilities, such as the DataFrame's built-in pipe methods and the ability to fluidly change data types.

To illustrate a more Pythonic approach, let's consider a refined version of their example. By effectively using the .pipe method to apply custom functions to the data, and the Series.to_frame() for converting outputs back into pandas DataFrames, we can achieve the desired result in a more streamlined and elegant manner. Here's an improved version of the Python code:

(
df
.groupby('country')['amount']
.median()
.to_frame(name='med')
.reset_index()
.merge(df)
.pipe(lambda data: data.loc[data['amount'] <= 10 * data['med']])
.groupby('country')[['amount','discount']].sum()
.pipe(lambda data: data['amount'] - data['discount'])
)

This code efficiently computes the desired statistics and returns a series, which can be easily converted back to a DataFrame using .to_frame() if needed.

I don't recommend method chaining to this extent, however, because it can be a nightmare to debug. It would be better to take the more pythonic imperative approach of defining reusable functions and intermediary dataframes. The folks reviewing your code will thank you.

The key to leveraging pandas effectively is not in attempting to mimic R methodologies, but rather in understanding and adapting to the idiomatic approaches of the language and library being used.