Introduction to Exploratory Data Analysis and Pre-Processing ============================================================= As we will emphasize throughout the semester, having a basic familiarity and understanding of the dataset you are working with will be important for making decisions about what and how to analyze it, and the success of machine learning algorithms will depend in large part on having prepared the data to ensure it has certain qualities. In this unit, we introduce an initial set of data analysis and pre-processing techniques. This is just a starting point -- we will introduce additional topics throughout the semester. By the end of this module, students should be able to: 1. Investigate the shape and types of the variables (i.e., columns) contained within the dataset. 2. Perform type conversion for strings/objects of numerical data using ``apply`` combined with other methods. 3. Understand categorical and ordinal data and how to perform type conversion of categorical data to integers using one-hot encoding. 4. Perform basic duplicate and missing value detection and treatment. 5. Compute basic statistical properties of data, including mean, median, max and min. 6. Apply simple univariate and multivariate analysis of the feature columms including visualization techniques using matplotlib and/or seaborn libraries. In this module, we will illustrate the concepts by working through a dataset hosted on the class GitHub repo. Before getting started, please download the used cars `dataset `_ from the class repository. Step 0: Inspect the File ~~~~~~~~~~~~~~~~~~~~~~~~ Before you write even a line of Python, it is good to just manually inspect the file. We can answer questions like the following without writing a line of code: * How big is the file? * Is it really text? * What do the first few lines look like? For example: .. code-block:: bash # how big is the file? $ ls -lh used_cars_data.csv -rw-rw-r-- 1 jstubbs jstubbs 839K Jan 22 19:32 used_cars_data.csv # it's a 839K file.. not too big # is it really text? $ file used_cars_data.csv used_cars_data.csv: CSV text # yes! # other answers we could have gotten: # Python script, ASCII text executable (Python scripe) # ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked (binary file) # what do the first few lines look like? $ head used_cars_data.csv id,brand,model,model_year,mileage,fuel_type,engine,transmission,ext_col,int_col,accident,clean_title,price 0,Ford,F-150 Lariat,2018,74349,Gasoline,375.0HP 3.5L V6 Cylinder Engine Gasoline Fuel,10-Speed A/T,Blue,Gray,None reported,Yes,11000 1,BMW,335 i,2007,80000,Gasoline,300.0HP 3.0L Straight 6 Cylinder Engine Gasoline Fuel,6-Speed M/T,Black,Black,None reported,Yes,8250 2,Jaguar,XF Luxury,2009,91491,Gasoline,300.0HP 4.2L 8 Cylinder Engine Gasoline Fuel,6-Speed A/T,Purple,Beige,None reported,Yes,15000 3,BMW,X7 xDrive40i,2022,2437,Hybrid,335.0HP 3.0L Straight 6 Cylinder Engine Gasoline/Mild Electric Hybrid,Transmission w/Dual Shift Mode,Gray,Brown,None reported,Yes,63500 4,Pontiac,Firebird Base,2001,111000,Gasoline,200.0HP 3.8L V6 Cylinder Engine Gasoline Fuel,A/T,White,Black,None reported,Yes,7850 5,Acura,Integra LS,2003,124756,Gasoline,140.0HP 1.8L 4 Cylinder Engine Gasoline Fuel,5-Speed M/T,Red,Beige,At least 1 accident or damage reported,Yes,4995 6,Audi,S5 3.0T Prestige,2014,107380,Gasoline,333.0HP 3.0L V6 Cylinder Engine Gasoline Fuel,7-Speed A/T,Gray,Black,None reported,Yes,26500 7,GMC,Acadia SLT-1,2019,51300,Gasoline,193.0HP 2.5L 4 Cylinder Engine Gasoline Fuel,6-Speed A/T,White,Black,At least 1 accident or damage reported,Yes,25500 8,Audi,A3 2.0T Tech Premium,2016,87842,Gasoline,200.0HP 2.0L 4 Cylinder Engine Gasoline Fuel,7-Speed A/T,Silver,Black,None reported,Yes,13999 9,Acura,MDX Technology,2007,152270,Gasoline,300.0HP 3.7L V6 Cylinder Engine Gasoline Fuel,5-Speed A/T,Gray,Beige,At least 1 accident or damage reported,Yes,6700 Now that we know we are really dealing with text and that there appears to be a row of header labels on the first line, let's proceed to using Python. Data Shape and Types ~~~~~~~~~~~~~~~~~~~~ It's important to begin any data exploration by understanding the shape (number of rows and columns) as well as the types (strings, integers, floats, etc) of the values in the dataset. Let's read the used cars dataset from the class repo into a DataFrame and work through some of the first examples. .. code-block:: python3 # import the library and create the DataFrame >>> import pandas as pd >>> cars = pd.read_csv('used_cars_data.csv') # use head to make sure dataset was actually loaded >>> cars.head() .. figure:: ./images/cars_head.png :width: 1000px :align: center We begin by calling ``head()`` to print the first five rows. We also use ``shape`` to get the number of rows and columns .. code-block:: python3 >>> cars.shape (101,13) We see from the output of ``shape`` that there are 101 rows and 13 columns. The output of ``head()`` gives us an idea of the columns. We'll use ``info()`` to get the column types that were inferred: .. code-block:: python3 >>> cars.info() RangeIndex: 101 entries, 0 to 100 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 101 non-null int64 1 brand 101 non-null object 2 model 101 non-null object 3 model_year 101 non-null int64 4 mileage 101 non-null int64 5 fuel_type 94 non-null object 6 engine 101 non-null object 7 transmission 101 non-null object 8 ext_col 101 non-null object 9 int_col 98 non-null object 10 accident 101 non-null object 11 clean_title 101 non-null object 12 price 101 non-null int64 dtypes: int64(4), object(9) memory usage: 10.4+ KB We see a mix of ints and objects (e.g., strings). The column names all look like legitimate header names, though some could be a little mysterious (e.g., "id."). We see that most of the columns have 101 non-null values. There are missing or null values in some columns that would need a separate treatment. A Basic Understanding of the Data ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ At this point, we want to step back and see if we have a basic understanding of what is going on with this dataset. If we were given a complete description of the data, this wouldn't be difficult. Often times though, our information about a dataset may be partial and imperfect. For example, it may have been sent to us by the "sales department" or the "data group", and they may or may not have given us a complete explanation of all of the details. Or, we may have found the dataset on the internet, perhaps associated with a published paper, a blog post, or a git repository. Sometimes, we have to do some of our own investigating to figure out what is going on with particular data elements or columns. So let's think about this dataset. Any one have a thought as to what is going on here? This is a dataset about used cars -- their current price as well as number of other features, such as the brand of the car,the model, the year it was made, the fuel and transmission, etc. Dropping Irrelevant Columns ^^^^^^^^^^^^^^^^^^^^^^^^^^^ Let's think about whether we need all of the columns. It's always best to remove "irrelevant" columns whenever possible. What constitute's an "irrelevant" column? What do you think? It depends on the dataset and the question(s) being asked of it! There are plenty of interesting questions we could ask and (try to) answer with this dataset. Today, we're interested in understanding how the current (used) price is related to other features in the dataset. This "id" column looks suspicious. It looks like it might be just an integer index (i.e., the row number). That's virtually never useful because we can always get the row index using functions. But first, let's confirm that it really is just the row index. How might we check that? First, let's just look at the values by printing the column. (Remember: how do we print the column of a DataFrame?) .. code-block:: python3 >>> cars['id'] 0 0 1 1 2 2 3 3 4 4 ... 96 96 97 97 98 98 99 99 100 100 101 rows × 1 columns The output above tells us that the first five rows (rows 0 through 4) and the last five rows all have value for "id" matching the row index. That's pretty good evidence. If we need more evidence here are some other checks: .. code-block:: python3 >>> len(cars['id'].unique()) 101 # the same number as the total number of values, so all values are unique # compare with a numpy array >>> import numpy as np >>> n = np.arange(start=0, stop=101, step=1) >>> cars['id'].sum() == n.sum() True # the same sum, same length, and all unique, so we know they are identical! Let's drop this column. We'll use the ``drop()`` method of the DataFrame, which allows us to remove rows or columns using lables. We do need to specify the axis we want to delete from (``axis=0`` for rows, ``axis=1`` for columns), and we want to set ``inplace=True`` so that it changes the existing DataFrame instead of creating a new one. .. code-block:: python3 >>> cars.drop(['id'], axis=1, inplace=True) # it's always good to confirm >>> cars.shape (101, 12) You can read more about ``drop()`` from the documentation [1]. Type Conversions ~~~~~~~~~~~~~~~~ While most datasets will have a mix of different types of data, including strings and numerics, virtually all of the algorithms we use in class require numeric data. Thus, before we start any machine learning, we'll want to convert all of the columns to numbers. Broadly, there are two cases: * Numeric columns that are strings * Categorical columns that require an "embedding" to some space of numbers. .. Numeric Columns with Strings ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Recall that the ``info()`` function returned the type information for each column: .. code-block:: python3 >>> cars.info() RangeIndex: 101 entries, 0 to 100 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 brand 101 non-null object 1 model 101 non-null object 2 model_year 101 non-null int64 3 mileage 101 non-null int64 4 fuel_type 94 non-null object 5 engine 101 non-null object 6 transmission 101 non-null object 7 ext_col 101 non-null object 8 int_col 98 non-null object 9 accident 101 non-null object 10 clean_title 101 non-null object 11 price 101 non-null int64 dtypes: int64(3), object(9) memory usage: 9.6+ KB We can see that ``engine`` column contains numeric data, for example: 375.0HP 3.5L V6 Cylinder Engine Gasoline Fue but it is represented as a string We need to strip off the units characters and leave only the numeric value. At that point we can cast the value to a float. We need to take some care when attempting to modify all the values in a column. Remember, we've only looked at the first few values. There could be unexpected values later in the dataset. .. warning:: Like in other software engineering, data processing should be done defensively. That is, assume that any kind of value could appear in any part of the dataset until you have proven otherwise. We'll use the regular expression to extract number before the space in ``engine`` column. Recall from the previous module the ``astype()`` function, for casting to a specific python type. .. code-block:: python3 >>> cars['engine'] = cars['engine'].str.extract(r'(\d+(\.\d+)?)')[0].astype(float) Regular Expression ``(r'(\d+(\.\d+)?)')``: ``\d+`` matches one or more digits (i.e., the whole number part of the horsepower). ``(\.\d+)?`` optionally matches the decimal point followed by one or more digits, allowing for float values (like 375.0). After executing the above code, we can then check that the ``engine`` column was indeed converted: .. code-block:: python3 >>> cars.info() RangeIndex: 101 entries, 0 to 100 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 brand 101 non-null object 1 model 101 non-null object 2 model_year 101 non-null int64 3 mileage 101 non-null int64 4 fuel_type 94 non-null object 5 engine 101 non-null float64 6 transmission 101 non-null object 7 ext_col 101 non-null object 8 int_col 98 non-null object 9 accident 101 non-null object 10 clean_title 101 non-null object 11 price 101 non-null int64 dtypes: float64(1), int64(3), object(8) memory usage: 9.6+ KB We can also check several values of the column to see that indeed the string have been removed: .. code-block:: python3 >>> cars["engine"] 0 375.0 1 300.0 2 300.0 3 335.0 4 200.0 ... ... 96 255.0 97 381.0 98 228.0 99 150.0 100 250.0 101 rows × 1 columns dtype: float64 .. warning:: You will not be able to cast the values in a Pandas Series to ``int`` if the column contains missing values. Categorical Values ^^^^^^^^^^^^^^^^^^^ We see that there are many columns in the dataset that are categorical, example ``fuel_type``, ``transmission``, ``int_col``, ``ext_col``, ``accident``, etc. Looking at some of these columns which have type object, we see that the first few objects (``fuel_type``, ``transmission``, and ``accident``) are all non-numeric; that is, the string values are do not contain any numbers. However, it is easy to check the unique values within a column using the ``.unique()`` function; for example: .. code-block:: python3 >>> cars['fuel_type'].unique() array(['Gasoline', 'Hybrid', nan, 'Electric','E85 Flex Fuel', 'Diesel'], dtype=object) For the ``fuel_type`` column, we see there are only 5 different values in the entire DataFrame and some missing values represented as ``nan``. How many values do ``transmission`` and ``accident`` take? .. code-block:: python3 >>> cars['transmission'].unique() array(['10-Speed A/T', '6-Speed M/T', '6-Speed A/T', 'Transmission w/Dual Shift Mode', 'A/T', '5-Speed M/T', '7-Speed A/T', '5-Speed A/T', '8-Speed A/T', 'Transmission Overdrive Switch', '9-Speed Automatic', '7-Speed M/T', '10-Speed Automatic', '6-Speed Automatic', 'M/T', '5-Speed Automatic', 'CVT Transmission', '9-Speed A/T'], dtype=object) >>> cars['accident'].unique() array(['None reported', 'At least 1 accident or damage reported'], dtype=object) These are examples of *categorical* columns: that is, a column that takes only a limited (usually) fixed set of values. We can think of categorical columns as being comprised of labels. Some additional examples: * Cat, Dog * Green, Yellow, Red * Austin, Dallas, Houston * Accountant, Software Developer, Finance Manager, Student Advisor, Systems Administrator * Gold, Silver, Bronze In some cases, there is a natural (total) order relation on the values; for example, we could say: .. math:: Gold > Silver > Bronze These variables are called "ordinal categoricals" or just "ordinal" data. On the other hand, many categorical columns have no natural order -- for example, "Cat" and "Dog" or the position types of employees ("Accountant", "Software Developer", etc.). .. note:: Even in the case of ordinal categoricals, numeric operations (``+``, ``*``, etc) are not possible. This is another way to distinguish categorical data from numerical data. The type of categorical (ordinal or not) dictates which method we will use to convert to numeric data. For categorical data that is not ordinal, we will use a method called "One-Hot Encoding". One-Hot Encoding ^^^^^^^^^^^^^^^^ The "One-Hot Encoding" terminology comes from digital circuits -- the idea is to encode data using a series of bits (1s and 0s) where, for a given value to be encoded, only one bit takes the value 1 and the rest take the value 0. How could we devise such a scheme? Suppose we have the labels "Cat" and "Dog". One approach would be to simply use two bits, say a "Cat" bit and a "Dog" bit. If the "Cat" bit is the left bit and the "Dog" bit is the right one, then we would have a mapping: .. math:: Cat \rightarrow 1 0 Dog \rightarrow 0 1 We could devise a similar scheme for the colors of a traffic light (Green, Yellow, Red) with three bits: .. math:: Green \rightarrow 1 0 0 Yellow \rightarrow 0 1 0 Red \rightarrow 0 0 1 This seems like a pretty good approach, but if we look carefully at the above schemes, we might notice that we never used the "all 0s" bit value. And in fact we could do slightly better: we can actually save one bit by noticing that the last label can be represented as the "absence" of all other labels. For example, .. math:: Cat \rightarrow 1 Dog \rightarrow 0 where we can think of the above as mapping the "Dog" label to "not Cat". Similarly, .. math:: Green \rightarrow 1 0 Yellow \rightarrow 0 1 Red \rightarrow 0 0 where we have mapped "Red" to "not Green, not Yellow". In general, a One-Hot Encoding scheme needs a total number of bits that is 1 less than the total possible values in the data set. We can use this technique to expand a categorical column into :math:`n-1` columns of bits (0s and 1s) where :math:`n` is the number of possible values in the column. First, we need to cast the column values to the type ``category``, a special pandas type for categorical data, using the ``astype()`` function. Before, we can perform one hot encoding on the ``fuel_type`` we need to handle the missing values for that column. Missing Values ~~~~~~~~~~~~~~ Let's return to the issue of missing values. We saw previously that the ``info()`` function that several rows had missing values. We could tell this from the columns with non-null totals less than the total number of rows in the DataFrame: .. code-block:: python3 >>> cars.info() RangeIndex: 101 entries, 0 to 100 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 brand 101 non-null object 1 model 101 non-null object 2 model_year 101 non-null int64 3 mileage 101 non-null int64 4 fuel_type 94 non-null object 5 engine 101 non-null float64 6 transmission 101 non-null object 7 ext_col 101 non-null object 8 int_col 98 non-null object 9 accident 101 non-null object 10 clean_title 101 non-null object 11 price 101 non-null int64 dtypes: float64(1), int64(3), object(8) memory usage: 9.6+ KB Another way to check for nulls is to use the ``isnull()`` method together with ``sum()``: .. code-block:: python3 >>> cars.isnull().sum() brand 0 model 0 model_year 0 mileage 0 fuel_type 7 engine 0 transmission 0 ext_col 0 int_col 3 accident 0 clean_title 0 price 0 dtype: int64 Strategies for Missing Values ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ There are many ways to deal with missing values, referred to as *imputation* (to *impute* something means to represent it, and, in the context of data science, to *impute* a missing value is to fill it in using some method). We will cover just the basics here. **Removing Rows with Missing Data.** The simplest approach is to just remove rows with missing data from the dataset. However, from a machine learning perspective, this approach discards potentially valuable data. Usually, we will want to avoid this strategy. **Univariate Imputation.** In this approach, we use only information about the column (or "variable") to fill in the missing values. Some examples include: 1. Fill in all missing values with a statistical mean 2. Fill in all missing values with a statistical median 3. Fill in all missing values with the most frequent value 4. Fill in all missing values with some other constant value **Multivariate Imputation.** With multivariate imputation, the algorithm uses all columns in the dataset to determine how to fill in the missing values. For example: 1. Fill in the missing value with the average of the $k$ nearest values, for some definition of "nearest" (requires providing a metric on the data elements -- we'll discuss this more in Unit 2). 2. Iterative Imputation -- this method involves iteratively defining a function to predict the missing values based on values in other rows and columns. Replacing Missing Values with Pandas ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ This is a simple approach of filling the missing values with the mean. The ``fillna()`` function works on a Series or DataFrame and takes the following arguments: * The value to use to fill in the missing values with. * An optional ``inplace=True`` argument. You can read more about the function in the documentation [2]. Fill in the missing values for the ``fuel_type`` column using Pandas ``groupby`` The ``groupby`` function is a powerful method for grouping together rows in a DataFrame that have the same value for a column. Its most simplistic form looks like this: .. code-block:: python3 >>> df.groupby(['']).*additional_functions()* What constitutes "similar"? There are many ways we could try to define it. In this case, we'll say that two cars are "similar" if they have the same values for some of the features. For example, we could say two cars are similar if they have the same model. We can also use ``groupby`` to group rows by multiple columns -- we simply list additional column names, like so: .. code-block:: python3 >>> df.groupby(['>> cars.isnull().sum() brand 0 model 0 model_year 0 mileage 0 fuel_type 0 engine 0 transmission 0 ext_col 0 int_col 3 accident 0 clean_title 0 price 0 Here's how that looks for the ``fuel_Type`` column. First, we do the cast: .. code-block:: python3 >>> cars['fuel_type'] = cars['fuel_type'].astype("category") Using ``info()`` we see the column was converted: .. code-block:: python3 >>> cars.info() RangeIndex: 101 entries, 0 to 100 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 brand 101 non-null object 1 model 101 non-null object 2 model_year 101 non-null int64 3 mileage 101 non-null int64 4 fuel_type 101 non-null category 5 engine 101 non-null float64 6 transmission 101 non-null object 7 ext_col 101 non-null object 8 int_col 98 non-null object 9 accident 101 non-null object 10 clean_title 101 non-null object 11 price 101 non-null int64 dtypes: category(1), float64(1), int64(3), object(7) memory usage: 9.1+ KB We can get rid of missing values in ``int_col`` in a simpler way as this is not a very important column of the dataset: .. code-block:: python3 cars['int_col'].fillna('black', inplace=True) .. code-block:: python3 >>> cars.info() RangeIndex: 101 entries, 0 to 100 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 brand 101 non-null object 1 model 101 non-null object 2 model_year 101 non-null int64 3 mileage 101 non-null int64 4 fuel_type 101 non-null category 5 engine 101 non-null float64 6 transmission 101 non-null object 7 ext_col 101 non-null object 8 int_col 101 non-null object 9 accident 101 non-null object 10 clean_title 101 non-null object 11 price 101 non-null int64 dtypes: category(1), float64(1), int64(3), object(7) memory usage: 9.1+ KB All the missing values have now been taken care of. We will now use the ``pandas.get_dummies()`` function to convert the categorical columns to a set of bit columns. Notes on the ``get_dummies()`` function: * It lives in the global pandas module space -- reference it as ``pd.get_dummies()`` * It takes a DataFrame as input. * It takes a ``columns=[]`` argument, which should be a list of column names to apply the encoding to. * It can optionally take a ``drop_first=True`` argument, in which case it will produce ``n-1`` columns for each categorical column, where ``n`` is the number of distinct values in the categorical column. We will do the one-hot encoding only for one colum ``fuel_type``. .. code-block:: python3 :emphasize-lines: 19-24 >>> cars = pd.get_dummies(cars, columns=["fuel_type"], drop_first=True) >>> cars.info() RangeIndex: 101 entries, 0 to 100 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 brand 101 non-null object 1 model 101 non-null object 2 model_year 101 non-null int64 3 mileage 101 non-null int64 4 engine 101 non-null float64 5 transmission 101 non-null object 6 ext_col 101 non-null object 7 int_col 101 non-null object 8 accident 101 non-null object 9 clean_title 101 non-null object 10 price 101 non-null int64 11 fuel_type_E85 Flex Fuel 101 non-null bool 12 fuel_type_Electric 101 non-null bool 13 fuel_type_Gasoline 101 non-null bool 14 fuel_type_Hybrid 101 non-null bool Notice that it automatically removed the categorical columns of type object and replaced each of them with :math:`n-1` new ``bool`` columns. It used the values of the object column in the names of the new boolean columns. .. note:: When we introduce Scikit-Learn, we'll learn a different function for converting categorical data using One-Hot Encoding. .. note:: The use of "dummies" in the function name ``get_dummies`` comes from statistics and related fields that refer to the columns of a one-hot encoding as "dummy" variables (or "dummy" columns). Duplicate Rows ^^^^^^^^^^^^^^ We can check for and remove duplicate rows. In most machine learning applications, it is desirable to remove duplicate rows because additional versions of the exact same row will not "teach" the algorithm anything new. (This will make more sense after we introduce machine learning). Pandas makes it very easy to check for and remove duplicate rows. First, the ``duplicated()`` function of a DataFrame returns a Series of booleans where a row in the Series has value ``True`` if that corresponding row in the original DataFrame was a duplicate: .. code-block:: python3 >>> cars.duplicated() # returns boolean Series with true if row is a duplicate 0 False 1 False 2 False 3 False 4 False ... Then, we can chain the ``sum()`` function to add up all ``True`` values in the Series. .. code-block:: python3 >>> cars.duplicated().sum() 0 This tells us there are no duplicated row. If there were duplicate rows you can call to ``drop_duplicates()``. Here are some important parameters to ``drop_duplicates``: * Pass ``inplace=True`` to change the DataFrame itself. * Pass ``ignore_index=True`` to ensure the resulting DataFrame is reindexed :math:`0, 1, ..., n`, where *n* is the length of the resulting DataFrame after dropping all duplicate rows. .. code-block:: python3 >>> cars.drop_duplicates(inplace=True, ignore_index=True) Univariate Analysis ~~~~~~~~~~~~~~~~~~~ In univariate analysis we explore each column of variable of the dataset independently with the purpose of understanding how the values are distributed. It also helps identify potential problems with the data, such as impossible values, as well as to identify *outliers*, or values that differ significantly from all other values. A first step to performing univariate analysis is to compute some basic statistics of the variables. Pandas provides a convenient function, ``describe()``, for computing statistical values of all numeric types in a DataFrame. .. code-block:: python3 >>> cars.describe() .. figure:: ./images/cars-describe.png :width: 1000px :align: center The output shows a number of statistics for each column, including: * count: Total number of values for the column. * mean: Average of values for the column. * std: The standard deviation of values for the column. This is one way to measure the amount of variation in a variable. The larger the standard deviation, the greater the amount of variation. * min: The minimum value of all values for the column. * max: The maximum value of all values for the column. * 25%, 50%, 75%: The percentile, i.e., the value below which the given percentage of values fall, approximately. For example, the output above indicates that approximately 25% of cars were created during or before the year 2012. This information helps us to see how the values of a particular column are distributed. For example, the data indicate that: We can also use graphical tools for this purpose. Matplotlib and Seaborn ^^^^^^^^^^^^^^^^^^^^^^ We recommend two libraries -- ``matplotlib`` and ``seaborn`` -- for generating data visualizations. Roughly speaking, you can think of ``matplotlib`` as the lower level library, providing more controls at the expense of a more complicated API. On the other hand, ``seaborn`` provides a relatively simple (by comparison to ``matplotlib``), high-level API for common statistical plots. In fact, ``seaborn`` is built on top of ``matplotlib``, so it is fair to think of it as a high-level wrapper. It also integrates closely with pandas. In this lecture, we'll use a few plots from the ``seaborn`` and one from ``matplotlib``. Installing matplotlib and seaborn can be done with pip: .. code-block:: console [container/virtualenv]$ pip install matplotlib [container/virtualenv]$ pip install seaborn Both are already installed in the class container. Once installed, seaborn is typically imported as follows: .. code-block:: python3 >>> import seaborn as sns The most commonly used matplotlib utilities are under the pyplot module, usually imported like so: .. code-block:: python3 >>> import matplotlib.pyplot as plt Histograms ^^^^^^^^^^ The first plot we'll look at is the *histogram*, provided by the ``histplot`` function. In its simplest form, we tell it what data to plot. For example, we can have it plot the ``Year`` Series of the cars DataFrame: .. code-block:: python3 >>> sns.histplot(data=cars['model_year'] ) .. figure:: ./images/hist-cars-year-1.png :width: 1000px :align: center As we see, the histogram plots the counts of each value for the dataset. From this single line of code, we can already see that the distribution of cars is weighted heavily towards the recent years. We can also use a different number of bins with ``histplot``. The following code uses 5 bin. .. code-block:: python3 >>> sns.histplot(data=cars['model_year'], bins=5) .. figure:: ./images/hist-cars-year-2.png :width: 1000px :align: center If we look at the ``price`` column, the histogram reveals a very skewed distribution with a very long tail: .. code-block:: python3 >>> sns.histplot(data=cars['price'], kde=True) .. figure:: ./images/hist-cars-new-price.png :width: 1000px :align: center Count Plots ^^^^^^^^^^^^ Count plots are the second type of useful plot we will introduce. Count plots are used for categorical data in the same way that histograms are for numeric data. .. code-block:: python3 >>> sns.countplot(x=cars['brand']) >>> plt.xticks(rotation=45, ha='right') >>> plt.show .. figure:: ./images/count-plot-brand.png :width: 1000px :align: center Note that without rotation, the labels bunch together and become illegible. We can immediately see that more cars of brand Ford were in the dataset. Many aspects of the generated plot are configurable. We won't cover most of the configurations, but we do point out that configurations available on the ``matplotlib.pyplot`` object can be used directly on a seaborn plot. For example, rotating the labels for an axis: Box Plots ^^^^^^^^^^ Box plots (also "boxplots" or "box and whisker plots") are an effective way to quickly visualize the distribtion of data and look for outliers. Box plots depict *quartiles*, which are the quarterly percentiles (i.e., 25th percentile, 50th percentile, 75th percentile) of the data. Here is an example, labeled box plot: .. figure:: ./images/box-plot-generic.png :width: 1000px :align: center Here are some key points about the box plot: * The median is the median (or centre point), also called second quartile, of the data (resulting from the fact that the data is ordered). * Q1 is the first quartile of the data, i.e., 25% of the data lies between minimum and Q1. * Q3 is the third quartile of the data, i.e., 75% of the data lies between minimum and Q3. * The distance betwen Q1 and Q3 is called the Inter-Quartile Range or IQR. In the example above, the median is well-centered within the IQR of the dataset. * The values labled "Minimum" and "Maximum" are aslo called "whiskers" and are computed as: (Q1 - 1.5 * IQR) for the Minimum (or "lower whisker") and (Q3 + 1.5 * IQR) for the Maximum (or "upper whisker"). * Values to the less than the lower whisker or greater than the upper whisker are depicted as circles. In some cases, it might make sense to label these values as "outliers" and drop them from the dataset, but this approach could also result in loss of relevant observations. The seaborn library makes it easy to create box plots with the ``sns.boxplot()`` function. In the simplest form, we pass a DataFrame as the ``data`` parameter and the name of a column to plot, as a string, as the ``x`` parameter: .. code-block:: python3 >>> sns.boxplot(data=cars, x='model_year') .. figure:: ./images/box-plot-year.png :width: 1000px :align: center This plot depicts some observations we have made previously, such as: * The dataset is skewed towards older cars. * We see some outliers in the dataset. Let's plot the ``Price`` column: .. code-block:: python3 >>> sns.boxplot(data=cars, x='price') .. figure:: ./images/box-plot-price.png :width: 1000px :align: center What conclusions do you draw from this plot? Multivariate Analysis ~~~~~~~~~~~~~~~~~~~~~~ By contrast, multivariate analysis explores the relationships across multiple variables. Like univariate analysis, it also helps identify potential problems with the data, such as impossible values, as well as to identify outliers. Heat Maps ^^^^^^^^^ Fow now we'll introduce just one more plot, the heat map, which is a type of *bivariate* analysis (bivariate meaning two variables). A heat map is an excellent way to visualize the extent to which pairs of variables are corrolated. We'll use matplotlib to create a heatmap of the several of the variables in our DataFrame. This is done using the ``sns.heatmap`` function, supplying a set of columns of a DataFrame. We also show a few additional parameters: * ``annot=True``: Annotate the boxes with numberic corrolation values * ``vmin`` and ``vmax``: Adjusts the color range * ``fmt``: Adjusts the formatting of numeric values (``.2f`` means 2 decimal places). See the Python string formatting rules for more details [3]. * ``cmap``: The scheme used for mapping values to colors. .. code-block:: python3 # columns to corrolate corr_cols=['mileage','price','engine'] # increate the figure size plt.figure(figsize=(15, 7)) # the actual heat map sns.heatmap( cars[corr_cols].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral" ) # show the plot plt.show() .. figure:: ./images/cars-heatmap-1.png :width: 1000px :align: center Remember that corrolation values closer to 1 mean two variables are more corrolated while corrolation values closer to 0 mean two variables are less corrolated (with values closer to -1 meaning more negatively or oppositely corrolated). What are some observations we can make based on the heat map? 1. We can see that there is a negaive correlation between ``price`` and ``mileage``. More mileage on car the price will be lower. 2. There is a positive correlation between ``engine`` and ``price``, which is obvious. .. note:: If you are running version 0.12.x of seaborn, you may numeric values along only the top row, due to a bug in seaborn. Updating to 0.13.x fixes the issue. References and Additional Resources ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1. pandas.DataFrame.drop: Documentation (2.2.0). https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html 2. pandas.DataFrame.fillna: Documentation (2.2.0). https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html 3. Python String Format Specification. https://docs.python.org/3/library/string.html#formatspec