Supplement: Pandas Operations For Transforming Data

The Pandas library includes some advanced features that are quite useful for data processing and analysis. In this supplement, we provide an overview of some of the most useful (and at times, confusing) features.

Overview

We’ll look at four functions in roughly increasing order of complexity: map, applymap, apply, and transform. At a high level, these functions all work similarly: they accept an input function and use it to modify elements in a dataframe. However, each is used in slightly different situations. Here is a summary:

  • map – used to apply a function element-wise to a Series. This is useful for very simple transformations, but keep in mind it can only be applied to a Series object (e.g., a column).

  • applymap – similar to map in that is applies a function element-wise, but it can only be used with a DataFrame. Again, useful for simple transformations.

  • apply – used to apply a function along an axis (either rows or columns) in a way that could change the shape of the dataframe.

  • transform – Similar to apply, transform is used to apply a function along an axis

One thing to keep in mind with all of the above operations is that they take an input function to apply. This function is often times specified as an anonymous function, i.e., using the lambda keyword syntax. For example, one could express the doubling function in using lambda syntax as follows:

lambda x: x * 2

Let’s look at each operation in detail with some examples.

Map

Possibly the simplest function, map applies a function element-wise, i.e., to every element in a Series. Thus, we cannot apply map to an entire dataframe, but we can apply it to a column. For example, given the following dataframe:

df = pd.DataFrame({
  "A": [1, 2, 3],
  "B": [10, 20, 30]
})

We could apply map to the A column, passing the doubling function as follows:

df["A"].map(lambda x: x * 2)

This would multiply every element of the Series by 2. Note that this doesn’t actually modify the original df, it returns a new Series object. So if we want to update the "A" column, we need to save it back to the dataframe, like so:

>>> df["A"] = df["A"].map(lambda x: x * 2)
>>> df
      A       B
0     2       10
1     4       20
2     6       30

Applymap

The applymap function works the same as map but for an entire dataframe. Therefore, given the dataframe from before:

df = pd.DataFrame({
  "A": [1, 2, 3],
  "B": [10, 20, 30]
})

We can apply the doubling function to every element in the dataframe as follows:

>>> df.applymap(lambda x: x * 2)

But as before, this does not modify the df object; it returns a new dataframe instead. So, if we want to modify the df, we need to save the result back to it, like so:

>>> df = df.applymap(lambda x: x * 2)
>>> df
      A       B
0     2       20
1     4       40
2     6       60

Apply

The apply function is used to apply a function to a specific axis, either columns (axis=0) or rows (axis=1). In this way, it works similarly to transform which we will talk about last, but the key thing to keep in mind that apply may change the shape of the dataframe! Let’s see some examples.

At first, this may need seem like a big deal, since if we have numeric data and we apply our doubling function, the result of apply is the same as that of applymap, whether we use rows or columns:

df = pd.DataFrame({
  "A": [1, 2, 3],
  "B": [10, 20, 30],
  "C": [5, 15, 20],
  "D": [100, 150, 170]
})
>>> df.apply(lambda x: x * 2, axis=0)
    A       B       C       D
0   2       20      10      200
1   4       40      30      300
2   6       60      40      340
# gives the same result!
>>> df.apply(lambda x: x * 2, axis=1)
    A       B       C       D
0   2       20      10      200
1   4       40      30      300
2   6       60      40      340

However, let’s see what happens when we pass the sum function:

>>> df.apply(sum, axis=0)
A      6
B     60
C     40
D    420
dtype: int64

The shape of the dataframe has been changed entirely, as it has collapsed all rows into a single value (the sum). And of course, if we change the axis, we get a very different result:

>>> df.apply(sum, axis=1)
0    116
1    187
2    223
dtype: int64

In this case, it summed the rows, as expected. Keep in mind that none of these changed the actual contents of the df object.

Transform

Finally, let’s look at transform, which applies a function to either the columns (axis=0) or the rows (axis=1), just like with apply, but this time, it must preserve the original shape of the dataframe. You almost always use transform in conjunction with a groupby. A standard use of transform is to fill in missing values.

Note that transform gives the exact same result as apply and applymap when passed the doubling function:

>>> df.transform(lambda x: x * 2, axis=0)
    A       B       C       D
0   2       20      10      200
1   4       40      30      300
2   6       60      40      340

But let’s look at a slightly more complicated dataframe; suppose we have:

cars = pd.DataFrame({
  "brand": ["Toyota", "Toyota", "Tesla", "Tesla"],
  "price": [20000, 25000, 80000, 90000]
})

If we use groupby to collect elements by brand and then access the price column, we can then use transform to apply a function like sum, and the result is a new Series of the same length:

>>> cars.groupby("brand")["price"].transform(sum)
0     45000
1     45000
2    170000
3    170000
Name: price, dtype: int64

The sums of each column are repeated for every value of the same brand. Note that this behavior differs from that of apply:

>>> cars.groupby("brand")["price"].apply(sum)
brand
Tesla     170000
Toyota     45000
Name: price, dtype: int64