Classification: A First Look
============================

In this module we introduce the classification problem, work with some initial examples, and 
describe the training and testing phases involved in creating a classifier. We also 
take a first look at validation and overfitting. 

By the end of this module, students should be able to:

1. Describe the classification problem. 
2. Describe at a high level the linear classifier and the perceptron algorithm.
3. Describe the training and testing phases.
4. Plot and interpret a decision boundary and decision matrix to understand model performance. 

Introduction
------------

Classification is the problem of predicting an independent variable that takes values from a finite, discrete set. 
An ML model that predicts such a dependent variable is called a *classifier*. The possible values are called 
*classes*.

An important special case is a *binary classifier* which predicts a dependent variable which takes on just two 
possible values. That is, a binary classifier is a model which predicts two classes for the dependent variable.

For example, given a person's income and credit card history, a bank might want to predict if they will default.
The dependent variable here is whether or not the individual will default. There are two possible values:

* Will default
* Will not default

These are the two classes of the dependent variable, and an ML model that predicts this variable will thus be called
a binary classifier. 

Consider the following dataset depicting annual incomes and monthly credit card balances for a set of individuals. 

.. figure:: ./images/class_income_v_balance.png
    :width: 1000px
    :align: center
    :alt: Income vs Balance

    Sample data showing annual incomes and monthly credit card balances for a set of individuals. 
    The individuals who defaulted on their credit card payments are shown in orange, and those 
    who did not are shown in blue. Citation: [1].


We can use this data to build an ML model for predicting whether an individual will default. The process is 
as follows:

1. Collect and prepare historical data about income, balance and default status.
2. Train a model using some of the prepared data. 
3. Validate the model using some of the prepared data. 
4. Deploy the model to predict the default status for new individuals. 

How might we train a classifier? There are many techniques. The first one we will look at is called *linear 
classification*. Before we look at some of the details of linear classification, does anyone have an 
intuition about how one might proceed to determine which individuals will default?

Is the image above suggestive of a way to predict the default status? 


.. figure:: ./images/class_income_v_balance_decision_boundary.png 
    :width: 1000px
    :align: center
    :alt: Income vs Balance

    A linear decision boundary. Data points are classified based on which side of the line they fall. 

One approach is to use a linear equation (i.e., a line) to determine which class a data point belongs 
to. In the picture above we have drawn one possible line. Points on the left side of the line are 
classified as "will default" and points on the right are classified as "will not default". 

Linear Classification
---------------------

One approach to training classifiers makes use of a *decision function*. As we did in the previous module, 
we denote the independent variable with :math:`X` and the dependent variable with :math:`Y`. 

Given a class :math:`C` of the dependent variable, a 
*decision function*, :math:`f(X)`, is a function in the independent variable that can be used to 
determine if a point is in :math:`C`, as follows:

.. math:: 
    f(x) > 0  \iff x \in C

    f(x) <= 0 \iff x\not\in C

We define the set :math:`\{ x : f(x) = 0 \}` to be the *decision boundary* associated with :math:`f`. 

As the name suggests, with linear classification we define :math:`f(X)` using a linear function; that is, 
:math:`f(X)=mX + b` for some choice of :math:`m` and :math:`b`. 


In this case, the decision boundary is linear, :math:`\{ x : mX + b= 0 \}`. An example of a linear 
decision boundary was drawn in the income versus credit card balance plot above. 

How do we go about finding the optimal choice of :math:`m` and :math:`b`? Here, there are many techniques.
How well the different techniques perform depends in part on the data. 

For example, some techniques work better if the data are *linearly separable*, that is, if there 
exists a linear decision function that separates all data points that are in class :math:`C` from all 
points not in :math:`C`.

Perceptron Algorithm 
~~~~~~~~~~~~~~~~~~~~
We'll begin by looking at the Perceptron Algorithm which was originally conceived of by Frank Rosenblatt 
in his 1956 PhD thesis [2]. It can be slow to train, 
but it can be proven mathematically to always find a correct linear classifier when the data are 
linearly separable. 

Let us try to describe the Percetron Algorithm at a high level. 
Let :math:`X` denote the input features. In general, :math:`X=[x_1, ..., x_n]` live in an :math:`n`-dimensional
space; i.e., consists of :math:`n`-dimensional feature vectors.  The Perceptron Algorithm learns 
a set of *weights*, :math:`W=[w_1, ..., w_n]` as well as a *bias* term, :math:`b`, and defines a decision
function, as follows:

.. math:: 

  f(X) = W^T X + b 

To classify a point :math:`x\in X`, we check the sign of :math:`f(x)` --- if :math:`f(x)< 0` we classify 
:math:`x` as Class 1, and if :math:`f(x)>0` we classify as :math:`x` as Class 2.

The idea behind the algorithm is to iteratively compute the function :math:`f` on the inputs and 
update the values of :math:`W` based on two simple rules:

**Rule 1.** For a given input, :math:`x_i`, if :math:`f(x_i)` correctly assigns the class label, then do not change :math:`W`.

**Rule 2.** If :math:`f(x_i)` incorrectly assigns the class label, then update the :math:`W` 
as follows: :math:`w_i := w_i + α∗y ∗ x_i`. Here, :math:`y` is the correct class label (i.e., -1 or 1)
and :math:`\alpha` is a term called the *learning rate*. 

**Theorem.** (Perceptron Convergence Theorem) If the input data is linearly separable, then the 
algorithm described above learns a "correct" :math:`W` in a finite set of steps. In other words, it 
learns a function, :math:`f(X)`, that correctly classifies all points in the inputs space after 
a finite set of updates. 

We will not try to prove this theorem at this time, but we could give a proof in a future lecture. 
Intuitively, since the dataset is finite and the data are linearly
separable, there is some minimum, positive distance between the data points and the separating 
hyperplane. One approach to building a proof is to bound the total number of updates above and below and show 
that these two bounds converge. 


Linear Classification with Scikit Learn
---------------------------------------
Next we look at implementing a linear classifier using the ``sklearn`` package. In this first example, we'll 
illustrate the techniques on a classic dataset that describes iris flowers. We'll also introduce
helper functions for splitting data into a training data and testing data and computing the accuracy of our
trained models. 

First, let us begin with a description of our dataset. The Iris Flower Dataset or Fisher's Iris Dataset was published
in a paper by British biologist Ronald Fisher in the paper, *The use of multiple measurements in taxonomic 
problems* (see [3]). The dataset includes four features for 150 samples of three species of 
iris: setosa, virginica and versicolor. The features are: sepal length, sepal width, petal length, petal width,
all measured in cm.

Loading the Data
~~~~~~~~~~~~~~~~
First, let us begin by loading the dataset. We'll use a Jupyter notebook for this portion since we will want 
to make use of some visualization. 

The ``sklearn`` package provides a convenience method for loading several classical datasets, including the Iris 
Flower Dataset:

.. code-block:: python3 

  from sklearn import datasets

  iris = datasets.load_iris()


As mentioned, this dataset contains 4 features for 150 samples of three different species of iris. 
Like all ``datasets`` objects from ``sklearn``, the ``iris`` object contains a ``data`` attribute 
holding the independent variables as well as a ``target`` attribute containing the dependent variable for
each sample. Each attribute as a ``numpy.ndarray``. There are also attributes ``features_names`` and ``target_names``
which contain the names of the independent and dependent variables, respectively. 

We can explore the dataset with Python: 

.. code-block:: python3

  >>> iris.feature_names
  ['sepal length (cm)',
  'sepal width (cm)',
  'petal length (cm)',
  'petal width (cm)']

  >>> iris.data 
  array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       . . . 

  >>> type(iris.data)
  <class 'numpy.ndarray'>

  >>> iris.target_names
  array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

  >>> iris.target
  array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
  >>> type(iris.target)
  numpy.ndarray


Notice that the features are encoded as floats and that ``iris.data`` is a 2d-array of shape 150x4.
Similarly, the target classes are encoded with integers (0, 1, and 2) for the 3 different species, and 
that ``iris.target`` is a 1d-array of shape 150x1.  

To simplify our initial discussion, we are going to consider the subset of the data consisting of all 
samples in the first two classes (0 and 1), and we will also only consider the petal length and petal width 
features (columns 3 and 4). Notice that the first 100 data points belong to classes 0 and 1 (the last 50 
belong to class 2), so we can construct our dataset as follows:

.. code-block:: python3

    # only use the first 100 rows and the last two columns 
    X = iris.data[0:100,2:4]

    # only use first 100 rows
    y = iris.target[0:100]

Note that we have organized the data into the objects ``X`` and ``y`` for the independent and dependent variables,
respectively. This is a common convention we will use throughout the course. 

Training the Model
~~~~~~~~~~~~~~~~~~
Let us take a moment to recall the general strategy for working with ML models. 

1. Collect and prepare data with labels.
2. Train a model using some of the prepared data. 
3. Validate the model using some of the prepared data. 
4. Deploy the model to predict the default status for new individuals. 

We have completed step 1 for the iris dataset --  we are ready to move to step 2. 

We need to use some of the data for training and reserve some for testing how well the trained model 
performs on *data it hasn't seen*. This is a very important aspect of machine learning. You may 
read on the internet that there are ways to use all of the data for training, but don't be fooled by them!
If you use data the model has already seen in training to test it, you are undermining the integrity of 
the test.

In general, we'll want to train the model using "most" of the data and only hold back a relatively 
small amount to use as validation. What is "most" and how do we decide what to hold back for testing? 
There are a lot of aspects to this question, and we will revisit the topic throughout the semester, 
but for now, we'll split the data using 70% for training and 30% for testing. 

We'll also use a "stratification" technique to ensure (as much as possible) that the split preserves 
proportions of the target class. Fortunately, sklearn has a function to do the work for us:


The ``train_test_split`` function (cf., [4]) from ``sklearn`` is very helpful here:

.. code-block:: python3 

    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=1)


In the code above, we are collecting 4 new objects: ``X_train``, ``X_test``, ``y_train``, ``y_test`` representing
a splitting of the ``X`` and ``y`` data. The ``0.3`` specifies that we want 30% of the data to be used for test 
data and 70% to be used for training. 

Next, we specify ``stratify=y``. This is a very important parameter. Conceptually, it instructs sklearn to 
split the data in a way that preserves the frequency of occurrence of different target classes. In our case, we 
have an equal number of samples for each target (50 each), so a random splitting is likely fine. But in general,
using a stratified split will ensure a proportional splitting even when the samples are imbalanced. 

Finally, we specify ``random_state=1``. This controls the randomization that is used in a way that guarantees 
deterministic results. That is, when setting a value for ``random_state``, repreated calls to ``train_test_split``
will always result in the same splitting for the same input data. This has important consequences for 
reproducibility, a topic we will revisit throughout the semester. 

Having split the data, we are ready to train our model. We'll use the  ``SGDClassifier`` class from the 
``sklearn.linear_model`` module. The "SGD" stands for "Stochastic Gradient Descent" and the ``SGDClassifier``
provides a family of models based on an associated family of Gradient Descent algorithms.

In the code below, we first instantiate the SGDClassifier object, specifying some configurations. Then we actually 
perform the model training using the ``fit`` function. 

Naturally, we use the training data when calling ``fit``:

.. code-block:: python3

    from sklearn.linear_model import SGDClassifier

    # the alpha is multiplied by the regularization term, which can impact overfitting vs underfitting, 
    # something we haven't discussed yet, but just note that a higher value of alpha more likely
    # to underfit. Can try experimenting with different values of alpha (e.g., alpha=0.10) 
    # if the model doesn't achieve 100% accuracy. 
    clf = SGDClassifier(loss="perceptron", alpha=0.05, random_state=1)

    clf.fit(X_train, y_train)


Note that we specify ``loss="perceptron"`` to indicate we want to use the Perceptron algorithm, as discussed 
in class previously. The ``SGDClassifier`` supports several other algorithms (e.g., "hinge", "squared_hinge", 
"log_loss", etc.). 

The ``alpha`` parameter deals with something called regularization, which we haven't discussed yet -- ignore it 
for now. 

The ``clf`` object is the trained model, and it can be used to predict the species of iris samples using the 
``clf.predict()`` method`. 

Validation 
~~~~~~~~~~
Now that the model has been trained we can proceed to step 3 -- validation. Our goal here is to compute the 
accuracy of our model against the test dataset (i.e., the ``test_*`` data objects above). We'll also compute 
the accuracy of the model against the training data to see how they compare. 

For validation, we'll make use of another helpful function: the ``accuracy_score`` from the ``sklearn.metrics``
module. The basic usage is straightforward:

.. code-block:: python3

    from sklearn.metrics import accuracy_score

    # Check the accuracy on the test data
    accuracy_test=accuracy_score(y_test, clf.predict(X_test))

    # Check accuracy on the training data
    accuracy_train=accuracy_score(y_train, clf.predict(X_train))

As suggested by the code above, the ``accuracy_score`` function takes two parameters: the target (dependent)
variables (i.e., the predictions) and the independent variables. Our dependent variables are just the 
``y_test`` and ``y_train`` objects defined before, and for the preduction, we apply the ``clf.predict`` function 
to each of the ``X_test`` and ``X_train`` arrays, respectively. 

The result returned by ``accuracy_score`` is simply a float from 0 to 1 containing the fraction of correctly 
classified samples. 

How did our model do?


.. code-block:: python3

  >>> accuracy_train 
  1.0

  >>> accuracy_test
  1.0


In fact, our model was perfect on both the test and training data! One way to understand this is to visualize
the data -- the Iris dataset is linearly separable, as we will see. 

Additional Properties of the Model
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

* ``clf.classes_``: These are the possible target class values the model is trying to predict. 
* ``clf.decision_function()``: This function computes the actual decision value for a given `X`
  that is used by the ``predict()`` function. Note that it requires an array of the same shape as the 
  data on which it was trained. 
* ``clf.coef_``: The coefficients learned. Note that when the target (dependent variable) is 1-dimensional, 
  as in the case above, the ``coef_`` attribute will be a 1-D array of length equal to the number of 
  features. 
* ``clf.intercept_``: The y-intercept learned. Together with ``clf.coef_``, this determines the 
  ``clf.decision_function``.

Examples: 

.. code-block:: python3 

    >>> clf.classes_
    array([0, 1])

    >>> clf.coef_
    array([[0.71421357, 0.25331713]])

    >>> clf.intercept_
    array([-2.36966222])

    # consider one data point; it's a 1-D array with two values:
    >>> X_train[0]
    array([1.5, 0.2])

    # apply the decision_function to a single value (note the shape of the input):
    >>> clf.decision_function([X_train[0]])
    array([-1.24767844])

    # this is the same as computing the linear combination of the coef_ and intercept_:
    >>> np.sum( clf.coef_*X_train[0]) + clf.intercept_
    array([-1.24767844])

    >>> clf.decision_function(X_train)
    array([-1.24767844,  1.22427453, -1.22234673, -1.3190998 , -1.39052115,
       -1.55869558,  1.24503246, -1.24767844,  1.51453374, -1.19701502,
       . . .

    # note that class predictions agree with the associated sign (positive or negative) of 
    # the decision_function above
    >>> clf.predict(X_train)
    array([0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1,
       1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0,
       1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0,
       0, 0, 1, 0])


Visualizing the Decision Boundary
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We'll use the ``DecisionBoundaryDisplay`` class from the ``sklearn.inspection`` in conjunction with ``matplotlib``
to create a visualization of the decision boundary.

Note that this technique only works in 2 dimensions, which is why we artificially 
restricted our dataset to two independent variables. 


.. code-block:: python3

    import matplotlib.pyplot as plt
    %matplotlib inline

    from sklearn.inspection import DecisionBoundaryDisplay

    # get current axis (gca) or create new ones if none exist. 
    ax = plt.gca()

    # use the DecisionBoundaryDisplay
    DecisionBoundaryDisplay.from_estimator(
        clf, # the trained model
        X,   # the independent variables -- must be 2D!!
        cmap=plt.cm.Paired, # the color map 
        ax=ax,  # the axis 
        response_method="predict",  # the prediction method 
        xlabel="petal length (cm)",  # lables 
        ylabel="petal width (cm)",
    )

The above code draws the decision boundary. We also plot the dataset using the following code: 

.. code-block:: python3

    import numpy as np 
    # we use two colors because there are two target classes ('setosa', 'versicolor')
    colors = "br"
    # Plot also the training points:
    # iterate over each of the classes (and colors) and make a plot 
    for i, color in zip(clf.classes_, colors):
        # pick out the indexes where the dependent var equals i 
        idx = np.where(y == i)
        plt.scatter(
            X[idx, 0],
            X[idx, 1],
            c=color,
            cmap=plt.cm.Paired,
            edgecolor="black",
            s=20,
        )
    plt.title("Linear Decision Boundary for Linearly Separable Iris Data")
    # Set limits just large enough to show all data, then disable further autoscaling.
    plt.axis("tight")

The result should look similar to the following: 

.. figure:: ./images/Iris_linear_decision_boundary.png 
    :width: 1000px
    :align: center
    :alt: Income vs Balance

    Resulting plot of the linear decision boundary for the Iris dataset. 

Training on the Full Dataset 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Let's go back and train on the full dataset with all of the features. 

How should we modify the code above? Implement the 
following high-level steps:

1. Create `X` and `y` variables pointing to your independent and dependent variables, respectively.
2. Split the data into training and test. 
3. Train the model 
4. Check the accuracy on the training and test data. 

How does the accuracy compare with the previous version? 

*Solution:*

.. code-block:: python3 

    # We want to use the entire dataset, so we set X and y differently: 
    X = iris.data
    y = iris.target

    # The rest is the same:
    # first, we split the data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=1)
    # next we 
    clf = SGDClassifier(loss="perceptron", alpha=0.01)
    clf.fit(X_train, y_train)

    # Check the accuracy on the test data
    accuracy_test=accuracy_score(y_test, clf.predict(X_test))
    # Check accuracy on the training data
    accuracy_train=accuracy_score(y_train, clf.predict(X_train))
    print(f"Train accuracy: {accuracy_train}; Test accuracy: {accuracy_test}")


Visualizing the Confusion Matrix 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A confusion matrix is a useful tool for understanding the performance of a model beyond 
just the accuracy rate. 

A confusion matrix compares the predicated label of a model against the actual label 
for all values in the target class. It can be used to quickly target specific classes that the 
model might be performing better or worse on. 

We can use the ``ConfusionMatrixDisplay.from_estimator()`` function to easily plot a confusion
matrix for a model we have fit. See the sample code below: 

.. code-block:: python3

    from sklearn.metrics import ConfusionMatrixDisplay
    cm_display = ConfusionMatrixDisplay.from_estimator(clf, X_test, y_test,
                                                   cmap=plt.cm.Blues,normalize=None)

.. figure:: ./images/confusion_matrix_iris.png
    :width: 1000px
    :align: center

The confusion matrix above shows that our model did well predicting the Setosa (label 0) 
and the Virginica (label 2) flower types, but "confused" the Versicolor (label 1) for the 
Setosa two times. 


A Real-World Application: Classifying Spam Email
-------------------------------------------------
In this section, we look at developing a linear classifier for spam 
email detection. We'll use the "Spambase" dataset, a classic dataset 
hosted on the UC Irvine Machine Learning Repository [5].

The dataset consists of metadata about a set of emails. A total of 58 attributes 
are provided for each email. 

The first 48 attributes are of the form ``word_freq_<WORD>``
for some word, <WORD>. These attributes contain the percentage of words (as a float between 
0 and 1) in the email that match <WORD>. 

Six attributes are of the form ``char_freq_<CHAR>`` for some character, <CHAR>. These 
attributes contain the the percentage of characters (as a float between 
0 and 1) in the email that match <CHAR>. 

The last three attributes are defined as follows:

* ``capital_run_length_average``: Average length of uninterrupted sequences of capital 
  letters.
* ``capital_run_length_longest``: Length of longest uninterrupted sequence of 
  capital letters. 
* ``capital_run_length_total``: Sum of the lengths of uninterrupted sequences of 
  capital letters. (i.e., total number of capital letters in the email)

Finally, a single attribute ``spam`` with values in :math:`\{ 0, 1 \}` is the target
(dependent) variable. A value of 1 indicates the email was spam while a value of 0
indicates that it was not spam.  

We provide an outline for how one would make use of the material above 
on linear classification as well as previous material on data analysis to build 
a model to classify email as either spam or not spam. 

We'll cover the basic outline in class and leave it as a take-home exercise


Outline and Take-Home Exercise For Building a Spam Email Detector
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Step 1.** *Getting and Preparing the Data*. To get started, we'll download 
the Spambase dataset. The dataset is available for download from the 
UCI Archive `here <https://archive.ics.uci.edu/dataset/94/spambase>`_. The primary 
dataset is described in a csv file called spambase.data. An auxillory file called 
spambase.names describes the dataset. 

1. Unpack and save the data file to your local machine. Read the data into a Pandas 
   dataframe.

2. Check and print the number of rows and columns in the dataset. 


**Step 2.** *Data Exploration*.

1. Compute standard statistics for each of the columns in the dataset, including:
   count, mean, standard deviation, min and max. 

2. Determine if there are any duplicate rows in the data set. If there are any 
   duplicate rows, remove them. 

3. Determine if there are any null/missing values in the data set. 

4. Determine how many rows are spam and how many are not spam.  


**Step 3.** *Split and Fit*. 

1. Split the data into training and test datasets. Make sure your split is 
   reproducible and that it maintains roughly the proportion of spam and non-spam emails.

2. Fit the data to a linear classifier using the Perceptron algorithm. 

**Step 4.** *Validation and Assessment*. 

1. Check the accuracy of your model on the test data set. 

2. Check the accuracy of your model on the training data set. 

3. Plot a confusion matrix for your model. How does the model perform with respect to different labels in 
   the target class? Do you think one is more important to minimize? 
   

References and Additional Resources
-----------------------------------
1. UC Berkeley CS189/289A: Introduction to Machine Learning. https://people.eecs.berkeley.edu/~jrs/papers/machlearn.pdf
2. Professor’s perceptron paved the way for AI – 60 years too soon. https://news.cornell.edu/stories/2019/09/professors-perceptron-paved-way-ai-60-years-too-soon
3. R. A. Fisher (1936). "The use of multiple measurements in taxonomic problems". Annals of Eugenics. 7 (2): 179–188. doi:10.1111/j.1469-1809.1936.tb02137.x. hdl:2440/15227
4. sklearn train_test_split documentation: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
5. Hopkins,Mark, Reeber,Erik, Forman,George, and Suermondt,Jaap. (1999). Spambase. UCI Machine Learning Repository. https://doi.org/10.24432/C53G6X.