Supplement: Pandas Operations For Transforming Data
The Pandas library includes some advanced features that are quite useful for data processing and analysis. In this supplement, we provide an overview of some of the most useful (and at times, confusing) features.
Overview
We’ll look at four functions in roughly increasing order of complexity: map, applymap, apply,
and transform. At a high level, these functions all work similarly: they accept
an input function and use it to modify elements in a dataframe. However, each is used
in slightly different situations. Here is a summary:
map– used to apply a function element-wise to a Series. This is useful for very simple transformations, but keep in mind it can only be applied to a Series object (e.g., a column).applymap– similar tomapin that is applies a function element-wise, but it can only be used with a DataFrame. Again, useful for simple transformations.apply– used to apply a function along an axis (either rows or columns) in a way that could change the shape of the dataframe.transform– Similar toapply,transformis used to apply a function along an axis
One thing to keep in mind with all of the above operations is that they take an input
function to apply. This function is often times specified as an anonymous function, i.e.,
using the lambda keyword syntax. For example, one could express the doubling function
in using lambda syntax as follows:
lambda x: x * 2
Let’s look at each operation in detail with some examples.
Map
Possibly the simplest function, map applies a function element-wise, i.e., to every element
in a Series. Thus, we cannot apply map to an entire dataframe, but we can apply it
to a column. For example, given the following dataframe:
df = pd.DataFrame({
"A": [1, 2, 3],
"B": [10, 20, 30]
})
We could apply map to the A column, passing the doubling function as follows:
df["A"].map(lambda x: x * 2)
This would multiply every element of the Series by 2. Note that this doesn’t actually
modify the original df, it returns a new Series object. So if we want to update the
"A" column, we need to save it back to the dataframe, like so:
>>> df["A"] = df["A"].map(lambda x: x * 2)
>>> df
A B
0 2 10
1 4 20
2 6 30
Applymap
The applymap function works the same as map but for an entire dataframe. Therefore,
given the dataframe from before:
df = pd.DataFrame({
"A": [1, 2, 3],
"B": [10, 20, 30]
})
We can apply the doubling function to every element in the dataframe as follows:
>>> df.applymap(lambda x: x * 2)
But as before, this does not modify the df object; it returns a new dataframe instead.
So, if we want to modify the df, we need to save the result back to it, like so:
>>> df = df.applymap(lambda x: x * 2)
>>> df
A B
0 2 20
1 4 40
2 6 60
Apply
The apply function is used to apply a function to a specific axis, either columns
(axis=0) or rows (axis=1). In this way, it works similarly to transform
which we will talk about last, but the key thing to keep in mind that apply may
change the shape of the dataframe! Let’s see some examples.
At first, this may need seem like a big deal, since if we have numeric data and we apply
our doubling function, the result of apply is the same as that of applymap, whether
we use rows or columns:
df = pd.DataFrame({
"A": [1, 2, 3],
"B": [10, 20, 30],
"C": [5, 15, 20],
"D": [100, 150, 170]
})
>>> df.apply(lambda x: x * 2, axis=0)
A B C D
0 2 20 10 200
1 4 40 30 300
2 6 60 40 340
# gives the same result!
>>> df.apply(lambda x: x * 2, axis=1)
A B C D
0 2 20 10 200
1 4 40 30 300
2 6 60 40 340
However, let’s see what happens when we pass the sum function:
>>> df.apply(sum, axis=0)
A 6
B 60
C 40
D 420
dtype: int64
The shape of the dataframe has been changed entirely, as it has collapsed all rows into a single value (the sum). And of course, if we change the axis, we get a very different result:
>>> df.apply(sum, axis=1)
0 116
1 187
2 223
dtype: int64
In this case, it summed the rows, as expected. Keep in mind that none of these changed
the actual contents of the df object.
Transform
Finally, let’s look at transform, which applies a function to either the columns (axis=0)
or the rows (axis=1), just like with apply, but this time, it must preserve the
original shape of the dataframe. You almost always use transform in conjunction with a
groupby. A standard use of transform is to fill in missing values.
Note that transform gives the exact same result as apply and applymap when
passed the doubling function:
>>> df.transform(lambda x: x * 2, axis=0)
A B C D
0 2 20 10 200
1 4 40 30 300
2 6 60 40 340
But let’s look at a slightly more complicated dataframe; suppose we have:
cars = pd.DataFrame({
"brand": ["Toyota", "Toyota", "Tesla", "Tesla"],
"price": [20000, 25000, 80000, 90000]
})
If we use groupby to collect elements by brand and then access the price column,
we can then use transform to apply a function like sum, and the result is a
new Series of the same length:
>>> cars.groupby("brand")["price"].transform(sum)
0 45000
1 45000
2 170000
3 170000
Name: price, dtype: int64
The sums of each column are repeated for every value of the same brand. Note that this
behavior differs from that of apply:
>>> cars.groupby("brand")["price"].apply(sum)
brand
Tesla 170000
Toyota 45000
Name: price, dtype: int64