Supplement: Pandas Operations For Transforming Data ==================================================== The Pandas library includes some advanced features that are quite useful for data processing and analysis. In this supplement, we provide an overview of some of the most useful (and at times, confusing) features. Overview -------- We'll look at four functions in roughly increasing order of complexity: ``map``, ``applymap``, ``apply``, and ``transform``. At a high level, these functions all work similarly: they accept an input function and use it to modify elements in a dataframe. However, each is used in slightly different situations. Here is a summary: * ``map`` -- used to apply a function element-wise to a **Series**. This is useful for very simple transformations, but keep in mind it can only be applied to a Series object (e.g., a column). * ``applymap`` -- similar to ``map`` in that is applies a function element-wise, but it can only be used with a **DataFrame**. Again, useful for simple transformations. * ``apply`` -- used to apply a function along an axis (either rows or columns) in a way that could change the shape of the dataframe. * ``transform`` -- Similar to ``apply``, ``transform`` is used to apply a function along an axis One thing to keep in mind with all of the above operations is that they take an *input function* to apply. This function is often times specified as an anonymous function, i.e., using the ``lambda`` keyword syntax. For example, one could express the doubling function in using lambda syntax as follows: .. code-block:: python lambda x: x * 2 Let's look at each operation in detail with some examples. Map --- Possibly the simplest function, ``map`` applies a function element-wise, i.e., to every element in a Series. Thus, we cannot apply ``map`` to an entire dataframe, but we can apply it to a column. For example, given the following dataframe: .. code-block:: python df = pd.DataFrame({ "A": [1, 2, 3], "B": [10, 20, 30] }) We could apply ``map`` to the ``A`` column, passing the doubling function as follows: .. code-block:: python df["A"].map(lambda x: x * 2) This would multiply every element of the Series by 2. Note that this doesn't actually modify the original ``df``, it returns a new Series object. So if we want to update the ``"A"`` column, we need to save it back to the dataframe, like so: .. code-block:: python >>> df["A"] = df["A"].map(lambda x: x * 2) >>> df A B 0 2 10 1 4 20 2 6 30 Applymap -------- The ``applymap`` function works the same as ``map`` but for an entire dataframe. Therefore, given the dataframe from before: .. code-block:: python df = pd.DataFrame({ "A": [1, 2, 3], "B": [10, 20, 30] }) We can apply the doubling function to every element in the dataframe as follows: .. code-block:: python >>> df.applymap(lambda x: x * 2) But as before, this does not modify the ``df`` object; it returns a new dataframe instead. So, if we want to modify the ``df``, we need to save the result back to it, like so: .. code-block:: python >>> df = df.applymap(lambda x: x * 2) >>> df A B 0 2 20 1 4 40 2 6 60 Apply ----- The ``apply`` function is used to apply a function to a specific axis, either columns (``axis=0``) or rows (``axis=1``). In this way, it works similarly to ``transform`` which we will talk about last, but the key thing to keep in mind that ``apply`` may change the shape of the dataframe! Let's see some examples. At first, this may need seem like a big deal, since if we have numeric data and we apply our doubling function, the result of ``apply`` is the same as that of ``applymap``, whether we use rows or columns: .. code-block:: python df = pd.DataFrame({ "A": [1, 2, 3], "B": [10, 20, 30], "C": [5, 15, 20], "D": [100, 150, 170] }) .. code-block:: python >>> df.apply(lambda x: x * 2, axis=0) A B C D 0 2 20 10 200 1 4 40 30 300 2 6 60 40 340 .. code-block:: python # gives the same result! >>> df.apply(lambda x: x * 2, axis=1) A B C D 0 2 20 10 200 1 4 40 30 300 2 6 60 40 340 However, let's see what happens when we pass the ``sum`` function: .. code-block:: python >>> df.apply(sum, axis=0) A 6 B 60 C 40 D 420 dtype: int64 The shape of the dataframe has been changed entirely, as it has collapsed all rows into a single value (the sum). And of course, if we change the axis, we get a very different result: .. code-block:: python >>> df.apply(sum, axis=1) 0 116 1 187 2 223 dtype: int64 In this case, it summed the rows, as expected. Keep in mind that none of these changed the actual contents of the ``df`` object. Transform ---------- Finally, let's look at ``transform``, which applies a function to either the columns (``axis=0``) or the rows (``axis=1``), just like with ``apply``, but this time, it must preserve the original shape of the dataframe. You almost always use ``transform`` in conjunction with a ``groupby``. A standard use of ``transform`` is to fill in missing values. Note that ``transform`` gives the exact same result as ``apply`` and ``applymap`` when passed the doubling function: .. code-block:: python >>> df.transform(lambda x: x * 2, axis=0) A B C D 0 2 20 10 200 1 4 40 30 300 2 6 60 40 340 But let's look at a slightly more complicated dataframe; suppose we have: .. code-block:: python cars = pd.DataFrame({ "brand": ["Toyota", "Toyota", "Tesla", "Tesla"], "price": [20000, 25000, 80000, 90000] }) If we use ``groupby`` to collect elements by brand and then access the ``price`` column, we can then use ``transform`` to apply a function like ``sum``, and the result is a new Series of the same length: .. code-block:: python >>> cars.groupby("brand")["price"].transform(sum) 0 45000 1 45000 2 170000 3 170000 Name: price, dtype: int64 The sums of each column are repeated for every value of the same brand. Note that this behavior differs from that of ``apply``: .. code-block:: python >>> cars.groupby("brand")["price"].apply(sum) brand Tesla 170000 Toyota 45000 Name: price, dtype: int64