Introduction to Pandas 
======================

In this module, we will introduce the Python library ``pandas`` for working with the datsets 
and manipulating them [1]. After going through this module, students should be able to:

* Install and import the ``pandas`` package into a Python program.
* Understand the primary differences between the pandas ``Series`` and ``Dataframe``, and when to use each.
* Loading data from an external file into a pandas object. 
* Accessing pandas ``Series`` and ``Dataframe`` to perform various dataset manipulations.


Installing Pandas
~~~~~~~~~~~~~~~~~
The ``pandas`` package is available from the Python Package Index (PyPI) and can be installed on most
platforms using a Python package mananger such as ``pip``:

.. code-block:: console

  [container/virtualenv]$ pip install pandas

Once installed, we can import the ``pandas`` package; it is customary to import the top level package 
as ``pd``, i.e., 

.. code-block:: python3
    
    >>> import pandas as pd

.. note::

    You may see a warning message stating "Pyarrow will become a required dependency of pandas 
    in the next major release of pandas (pandas 3.0)...".

    This can be safely ignored. 


Now, let's take a look at the basic data structures supported by Pandas.


Pandas Series
~~~~~~~~~~~~~

    A pandas Series is a one-dimensional array that can hold data of different types (string, float, integer, objects, etc.) as well as axis labels. It can be thought of 
    as a single column in a dataset.

    We can create ``Series`` objects in different ways; for example, directly from a 
    numpy array or python list: 

    .. code-block:: python3
        
        >>> a = [1, 5, 8]
        >>> m = pd.Series(a)
        >>> m
        0    1
        1    5
        2    8
        dtype: int64

    As you can see from the output, every value in a pandas Series is labeled. If nothing 
    is specified when constructing the Series, values are labeled staring from index 0 
    (i.e., the first value will have index 0, the second will have index as 1, and so on).

    For instance, with the previous example, we can put:

    .. code-block:: python3 

        >>> m[2]
        8

    This works exactly like Python lists and numpy arrays. 

    However, we can customize the label indexes using the ``index`` argument 
    while creating series.

    .. code-block:: python3

        >>> a = [1, 5, 8]
        >>> m = pd.Series(a, index=["X", "Y","Z"])
        >>> m
        X    1
        Y    5
        Z    8
        dtype: int64

    And now we can use these custom lables to index the elements of the series; for example: 

    .. code-block:: python3

        >>> m["Y"]
        5

    Note that if we specify custom index lables, we shouldn't use the 0-based integer indexing 
    to index into our series.

    What happens if you try the following: 

    .. code-block:: python3

        >>> m[1]
        ?

    Custom labels for indexes provide part of the power of pandas; we can use lables 
    to attach meaning (or "metadata") to our data columns. 

    For example, say we want to create a series of back to school supplies with their cost, 
    and we have a supplies list and a cost list as follows:

    .. code-block:: python3 

        >>> supplies = ['Spiral_Notebook', 'Gel_Pens', 'Sticky_Notes', 'Laptop_Bag', 'Daily_Planner']
        >>> cost_supplies_dollars = [2.81, 9.99, 3.99, 23.66, 10.99]

    We can use these to create a Series as follows: 

    .. code-block:: python3 

        >>> supplies_cost = pd.Series(cost_supplies_dollars, index=supplies)
        >>> supplies_cost
        Spiral_Notebook     2.81
        Gel_Pens            9.99
        Sticky_Notes        3.99
        Laptop_Bag         23.66
        Daily_Planner      10.99
        dtype: float64

    We see that our series is indexed by the labels we gave for the prices. We can 
    now access the prices using the meaningful labels, e.g., 

    .. code-block:: python3 

        >>> supplies_cost['Gel_Pens']
        9.99

    We can even use these custom index labels in slices, but note that the slice is 
    inclusive of both endpoints; for instance, 

    .. code-block:: python3 

        >>> supplies_cost["Gel_Pens":"Daily_Planner"]
        Gel_Pens          9.99
        Sticky_Notes      3.99
        Laptop_Bag       23.66
        Daily_Planner    10.99
        dtype: float64

    **In-class Exercise:** 

    1. Try accessing multiple elements of the supplies_cost series at positions 0, 1, and 3.

    2. What will be the output of following code?

    .. code-block:: python3

        >>> supplies_cost[:'Laptop_Bag']


Pandas DataFrame
~~~~~~~~~~~~~~~~

The ``DataFrame`` is perhaps the most important and useful data structure in pandas. A pandas 
DataFrame is similar to a 2d-array that can hold heterogeneous data and labeled axes. You can 
think of a DataFrame as representing a spreadsheet or a database table with multiple columns. 
Said differently, a DataFrame object is like a dictionary of Series objects. 

Let's look at some examples to make it more clear. 

To begin, suppose we had information on employees at UT Austin. If we were storing this information 
in a spreadsheet, we might have several columns, such as: 

* Name
* EID
* Location 
* Department 

Each employee could be thought of as a row in our spreadsheet with values for each of the columns above. 
For instance, we might have data on the following employees: 

* John Doe, E0124, Austin, ITS
* Luna Lau, E0125, Houston, Student Services
* Bella Tran, E1119, Austin, Accounting 
* Raj Kumar, E2048, Dallas, Finance 

We can model these columns of data using a Pandas dataframe as follows: 

.. code-block:: python3

  >>> employees = pd.DataFrame(
      {
        'eid' :['E0124', 'E0125','E1119','E2048'],
        'name':['John Doe', 'Luna Lu', 'Bella Tran', 'Raj Kumar'],
        'location':['Austin','Houston', 'Austin', 'Dallas'],
        'department':['ITS','Student Services', 'Accounting','Finance']
      }
    )

Notice that in the above example we construct the DataFrame using a Python dictionary of lists, where 
each key in the dictionary represents a column in our dataset, and the corresponding list contains the 
values for that column. 

Indexing Columns 
^^^^^^^^^^^^^^^^^
We now have several access methods for getting at the data in our DataFrame. For example, we can access 
an individual column using the associated key:

.. code-block:: python3

  >>> employees['name']
    0      John Doe
    1       Luna Lu
    2    Bella Tran
    3     Raj Kumar
    Name: name, dtype: object

This is similar to normal Python dictionary access, but notice that the output contains indexes for the employees
(i.e., the rows) as well. 


Indexing Rows
^^^^^^^^^^^^^
We can access individual rows in the data set using the ``iloc`` function, like so:

.. code-block:: python3

  >>> employees.iloc[1]
    eid                      E0125
    name                   Luna Lu
    location               Houston
    department    Student Services
    Name: 1, dtype: object

.. note:: 

    Using ``iloc`` requires the use of brackets (``[]``), not parenthesis (``()``) as with normal function 
    invocation. 

Be aware that one *cannot* index into the DataFrame using an integer (row) index; it will result in an error:

.. code-block:: python3

  >>> employees[1]
    ---------------------------------------------------------------------------
    KeyError                                  Traceback (most recent call last)
    File ~/.cache/pypoetry/virtualenvs/risd-course-KKx7_8Y0-py3.11/lib/python3.11/site-packages/pandas/core/indexes/base.py:3791, in Index.get_loc(self, key)
    3790 try:
    -> 3791     return self._engine.get_loc(casted_key)
    3792 except KeyError as err:
    . . . 

This is the same error one would get if one tried to index a normal Python dictionary using 
an integer index (or any other index that didn't exist in the key set).

**Exercise.** What kind of object is a row? How would you find out? 

Attributes of Rows 
^^^^^^^^^^^^^^^^^^

With a given row, we can access a specific column (attribute) using the ``.<attribute>`` notation. 
For example, 

.. code-block:: python3 

    # get row 1 (i.e., the second row)
    >>> row = employees.iloc[1]

    # get the eid of row 1 
    >>> row.eid 
    'E0125'

**Exercise.** What happens if you try to get the ``name`` attribute using this notation? What do
you think might be happening here? 


You can also use the ``.get(<attribute>)`` method. This is useful when the name of a column is not 
a valid Python identifier (e.g., a column such as "Campus Mail Code")

.. code-block:: python3 

    # get the eid of row 1 
    >>> row.get('eid')
    'E0125'


More On the ``iloc`` and ``loc`` Functions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

We can use ``iloc`` to select multiple rows and even specific columns for each 
row. The syntax in its general form takes two lists of integers representing the rows and 
columns we want to select, like this: 

.. code-block:: python3

    >>> df.iloc[ [<rows to select>], [<colums to select>] ]

For example: 

.. code-block:: python3

    # select rows 0, 1 and 3 and all columns
    >>> employees.iloc[[0,1,3]]
        eid 	    name 	location    department
    0 	E0124 	John Doe 	Austin 	    ITS
    1 	E0125 	Luna Lu 	Houston     Student Services
    3 	E2048 	Raj Kumar 	Dallas      Finance

And: 

.. code-block:: python3 

    # select rows 1 and 2 and columns 0, 1 and 3
    >>> employees.iloc[[1,2], [0,1,3]]
        eid 	name 	    department
    1 	E0125 	Luna Lu     Student Services
    2 	E1119 	Bella Tran  Accounting    

The ``loc`` function works similarly to ``iloc`` except that it uses integer indexes for the rows and 
string labels for the indexes instead of integers. The general format is like this: 

.. code-block:: python3 

    >>> df.loc[ [<rows (as ints>)], [<columns (as strings)>] ]

For example, 

.. code-block:: python3 

    >>> employees.loc[[0,2], ['department', 'eid']]
 	department  eid
    0 	ITS         E0124
    2 	Accounting  E1119

.. note::

    Remember, the ``i`` is for integer; always use integer indexes with ``iloc`` and 
    string label indexes with ``loc``. 

Filtering Rows with Conditionals 
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Another powerful feature of DataFrames is the ability to filter rows using conditional statements. 
We can use a syntax like the following to return a Series object of booleans (i.e., ``True/False`` values) 
where an entry is ``True`` if the associated value from the original DataFrame matches the criterion:

.. code-block:: python3 

    >>> df['<column>'] <conditional>

For example, 

.. code-block:: python3

    >>> employees['location'] == 'Austin'
    0     True
    1    False
    2     True
    3    False
    Name: location, dtype: bool    

A powerful application of this feature is to create a DataFrame of rows matching the criterion. 
The general syntax is as follows: 

.. code-block:: python3

    >>> df[ df['<column>' <conditional>] ]

For example, we can use the equality operator (``==``) to find all employees with a given EID or 
located in a specific city:

.. code-block:: python3 

    # find all employees with eid E1119
    >>> employees[ employees['eid'] == 'E1119']
        eid 	name 	    location 	department
    2 	E1119 	Bella Tran  Austin 	Accounting    

    # find all employees located in Austin 
    >>> employees[ employees['location'] == 'Austin']
     	eid 	name 	    location 	department
    0 	E0124 	John Doe 	Austin 	ITS
    2 	E1119 	Bella Tran 	Austin 	Accounting

Note that this is returning to us an entire DataFrame, i.e., all of the columns associated 
with the rows that match our criterion. 

We can use other operators as well, such as ``>``, ``<``, ``>=``, ``<=``, etc. 
Keep in mind that the meaning of these operations depends on the underlying data type. 

**Exercise.** What does the following return? 

.. code-block:: python3 

    >>> employees[ employees['eid'] > "E0125" ]

The ``astype`` Method and More Complex Conditionals 
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
We mentioned that when we use the general filter syntax, the result is a pandas Series. 
Sometimes, we might want to apply functions as part of conditional expressions when 
filtering rows. 

For example, we might like to know what employees have EIDs that begin with ``"E0"``. To 
do that, we could write a conditional that utilized the string function ``startswith()``,
but we'll need to tell pandas we want to treat the column values as ``str`` type. We 
do that with the ``astype()`` method. Then, we chain it together with the ``str.startswith()``
condition that we want to filter on. 

Here is an example:

.. code-block:: python3 

    >>> employees [ employees['eid'].astype(str).str.startswith("E0") ]
 	eid 	name 	        location 	department
    0 	E0124 	John Doe 	Austin 	        ITS
    1 	E0125 	Luna Lu 	Houston 	Student Services    


Loading Data From External Files 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We will often be loading data from external files. Pandas makes it easy to create a DataFrame from 
a structured (e.g., sql file) or semi-structure (e.g., CSV) file. Here, we look at loading data from a 
CSV, but there are functions for loading data from many other sources. See the documentation on the ``io``
module for more details [2].

The basics of loading data from an external file are simple -- just use the associated function for the 
type of data you have. For CSV, that function is ``pd.read_csv(</path/to/file.csv>)``. When the function 
is successful, the result will be a Pandas DataFrame. 


DataSets on the Class Repo
^^^^^^^^^^^^^^^^^^^^^^^^^^
To show the ``read_csv()`` function, we'll download a couple of csv files from the class github repository. 
In general, the class github repository is where we will host a number of datasets for the class throughout 
the semester, including the datasets for the first two projects. 

In general, the datasets will be hosted within the ``datasets`` top-level directory, organized by unit. 
You can explore the datasets by navigating to the following URL:

..  note:: 

    Class DataSets URL: https://github.com/joestubbs/coe379L-fa25/tree/main/datasets

As you will see, the ``datasets`` directory is organized into subdirectories for each unit. 

Let's download an employees dataset from the ``unit01`` subdirectory. You can use the "Raw" button 
to get a link to the raw content of any file on GitHub; the domain will be ``https://raw.githubusercontent.com``.

**In-Class Exercise.** Download the ``employees.csv`` file from the class GitHub repository. You can use 
any method you like; for example, use ``wget <URL>`` from the command line. Once you have the file downloaded, 
use the ``read_csv()`` function to load it into a DataFrame.

.. code-block:: bash 

    employees2 = pd.read_csv('employees.csv')


Exploring the CSV and the DataFrame 
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Let's take a closer look at the CSV file and explore the DataFrame object we created from it. 
If we open the CSV file, one of the first things we notice is the header row:

.. code-block:: bash 

    eid,name,location,department,title,campus mail code,Business Card

Pandas automatically used this row to create labels for our DataFrame. We can see that by printing the 
entire dataframe or using the ``.columns`` attribute:

.. code-block:: python3 

    >>> employees2
 	eid 	name            location    department      title               campus mail code    Business Card
    0 	E0124 	John Doe 	Austin 	    ITS 	     Software Developer A4011 	            vCard
    1 	E0125 	Luna Lu 	Houston     Student Services Student Advisor 	G9109 	            vCard
    2 	E1119 	Bella Tran 	Austin 	    Accounting 	    Accountant 	        D6336 	            vCard
    3 	E2048 	Raj Kumar 	Dallas 	    Finance 	    Finance Manager 	C4315 	            vCard
    4 	E2218 	Sally Sims 	Austin 	    Student Services Software Developer G9109 	            vCard
    5 	E4321 	Alonzo Smith    Austin 	    ITS 	    Systems Administrator A4011 	    vCard    


    >>> employees2.columns
    Index(['eid', 'name', 'location', 'department', 'title', 'campus mail code',
       'Business Card'],
      dtype='object')

Notice also that spaces in the header row are copied character-for-character; in the CSV file, there are no spaces
around the column names, i.e., spaces before or after the ``,``. If there were spaces, the dataframe column 
names would also have spaces. 

Issues To Look Out For 
^^^^^^^^^^^^^^^^^^^^^^
When reading data from semi-structured files into dataframe, there are a number potential gotchas to be 
on the lookout for. We mention a few here. 

**Missing Column Headers.** Open the csv file in a file editor and remove the first line. Save the file with 
a different name. The result is a CSV file without column headers. What happens when you read the 
file into a pandas DataFrame? 

.. code-block:: python3

    >>> employees3 = pd.read_csv('employees_no_headers.csv')
 	E0124 	John Doe 	Austin 	ITS 	Software Developer 	A4011 	vCard
    0 	E0125 	Luna Lu 	Houston 	Student Services 	Student Advisor 	G9109 	vCard
    1 	E1119 	Bella Tran 	Austin 	Accounting 	Accountant 	D6336 	vCard    
    . . . 
    >>> employees3.columns 
    Index(['E0124', 'John Doe', 'Austin', 'ITS', 'Software Developer', 'A4011',
       'vCard'],
      dtype='object')

As you can see, the first row was used as the headers! This is obviously **not** what we want. 
Be careful about csv files that do not have column headers. From experience, if you are working 
with such a file, it is perhaps easiest to first edit the file to add a row of headers. 

**Missing Values.** By definition, every row of a DataFrames must have a value for every column. 

For example, the following code gives an error because there are 3 ``eid`` values but 4 values for 
all the other columns. 

.. code-block:: python3 

    >>> employees_bad1 = pd.DataFrame(
      {
        'eid' :['E0124', 'E0125','E1119'],
        'name':['John Doe', 'Luna Lu', 'Bella Tran', 'Raj Kumar'],
        'location':['Austin','Houston', 'Austin', 'Dallas'],
        'department':['ITS','Student Services', 'Accounting','Finance']
      }
    )

    ValueError: All arrays must be of the same length

In this case, the DataFrame simply fails to be created. 

The result is different when trying to load a csv file with a missing value. For example, 
suppose we had a csv file with an EID missing, say in the first row, as depicted below:

.. code-block:: bash 

    # employees_bad.csv 
    eid,name,location,department,title,campus mail code,Business Card
    John Doe,Austin,ITS,Software Developer,A4011,vCard
    E0125,Luna Lu,Houston,Student Services,Student Advisor,G9109,vCard
    E1119,Bella Tran,Austin,Accounting,Accountant,D6336,vCard
    E2048,Raj Kumar,Dallas,Finance,Finance Manager,C4315,vCard
    E2218,Sally Sims,Austin,Student Services,Software Developer,G9109,vCard


Using ``pd.read_csv()`` on this file "works" and produces a DataFrame, though it's not 
what we might expect: 

.. code-block:: python3 

    >>> employees_bad = pd.read_csv('employees_bad.csv')
    >>> employees_bad.iloc[[0, 1, 2]]

.. figure:: ./images/employees-bad-out.png
    :width: 1000px
    :align: center

Something interesting (and not in a good way) has happened... the first row has a value 
of ``NaN`` for the ``Business Card`` column and every other is off by one; for example, 
it has a value of ``John Doe`` for the ``eid`` column. 

A Word on Missing Values and the NaN Value 
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The pandas library has multiple ways of representing missing values. We'll discuss dealing with missing 
values more in the next lecture, and we will get practice working with missing values throughout the 
semester. For now, know that the ``NaN`` value showing up in the above DataFrame is the numpy "Nan"
value (i.e., ``np.nan``), and it has some interesting properties. For example, it never "equals" 
any other value when testing with the ``==`` operator. 


**In-Class Exercise.** 

1. Read the ``employees_bad.csv`` file into a DataFrame, and select the NaN value from the 0th row.

2. Confirm that the NaN value from the 0th row is not ``==`` to the numpy ``nan`` value. 

3. Replace the ``==`` operator in step 2 with the ``is`` operator. What do you find? 


.. warning:: 

    The main takeaway at this time is that dealing with missing values is subtle and tricky. 
    Care is required to make sure your DataFrame and the calculations you do with it aren't 
    corrupted in the presence of missing values. 

See the pandas documentation [3] for more about missing data. 

*Solutions:* 

.. code-block:: python3 

  # import numpy 
  >>> import numpy as np 

  # read the bad csv file 
  >>> employees_bad = pd.read_csv('employees_bad.csv')
  
  # grab the "Business Card" column from the 0 row 
  >>> r1_nan = employees_bad.iloc[[0]].get("Business Card")
  
  # confirm it is not == to np.nan 
  >>> r1_nan == np.nan 
  False 


Functions on DataFrames 
~~~~~~~~~~~~~~~~~~~~~~~

There are a number of important functions that we will use throughout the semester. Here 
are a few important ones to know now: 

* ``head()``: returns first 5 rows of the dataset.
* ``tail()``: returns last 5 rows of the dataset.
* ``shape``: returns the number of rows and columns in the dataset.
* ``info()``: returns the datatype of each column in the dataset
* ``count()``: returns the number of rows of each column in the dataset. 
* ``min``: returns minimun value of numeric column specified 
* ``max``:returns maximum value of numeric column specified 
* ``unique``: return unique values for given column
* ``value_counts``: return counts of each value for a given column


**In-Class Exercise.** 

1. Create a pandas DataFrame of used cars data based on the ``datasets/unit01/used_cars_data.csv`` 
   file in the class repo.

2. Print the first 5 and last 5 rows of the data set. 

3. How many rows and how many columns are in the dataset? 

4. Are any columns missing data? If so, which ones? And how many rows are missing for each? 


Remarks
~~~~~~~
1. You can use the built-in ``type()`` function to check that a row of a ``DataFrame`` is in fact 
   a ``Series`` object. 


References and Additional Resources
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1. Pandas Documentation (2.2.0). https://pandas.pydata.org/docs/index.html
2. Input/Output: Pandas Documentation (2.2.0). https://pandas.pydata.org/docs/reference/io.html
3. Working with Missing Data: Pandas Documentation (2.2.0). https://pandas.pydata.org/docs/user_guide/missing_data.html