Hands-on Neural Networks 
============================

In this section we will build a simple neural network, train it and validate it on a sample test data. For this excercise we will use a popular dataset from Keras,
known as the **MNIST** (Modified National Institute of Standards and Technology) dataset. This dataset is collection of around 70,000 images of size 28X28 pixels of handwritten digits from 0 to 9 and our goal is to accurately identify the digits by creating a Neural Network.

By the end of this module students should be able to:

1. Import the Keras MNIST dataset.
2. Pre-process images so they are suitable to be fed to the neural network.
3. Apply data preprocessing for converting output labels to one-hot encoded variables.
4. Build a sequential model neural network.
5. Evaluate the model's performance on test data.
6. Add more layers to the neural network and evaluate if the model's performance improved or degraded, 
   leveraging the same test data.


Step 1: Importing required libraries and data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We know that the **mnist** dataset is available in ``keras`` so we will first import keras and then import the dataset.
MNIST has a training dataset of 60,000, 28x28 grayscale images of handwritten digits 0-9, along with a test data of 10,000 grayscale images of size 28x28 pixels (0-9 digits).

.. code-block:: python3

    import keras
    from keras.datasets import mnist

We will load the training and test data directly, as below

.. code-block:: python3

    (X_train, y_train), (X_test, y_test) = mnist.load_data()

This returns a tuple of numpy arrays:  ``(X_train, y_train)``, ``(X_test, y_test)``.
We can inspect the shape:

.. code-block:: python3

    # Shape of training data. X_train contains train images and y_train contains output labels for train images
    print(X_train.shape)
    print(y_train.shape)

    # Shape of test data. X_test contains test images and y_test contains output labels for test images 
    print(X_test.shape)
    print(y_test.shape)

Observe that:

* ``X_train`` is 2-D array of 60,000 images of size 28 x 28 pixels.
* ``y_train`` is a 1D array with 60,000 labels.

Step 2: Image Pre-processing
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

A ``grayscale`` image is an image where each pixel is represented by a single scalar value 
indicating the brightness or intensity of that pixel.

Color images have three channels (e.g., red, green, and blue) whereas grayscale images have only one channel.
In a grayscale image, the intensity value of each pixel typically ranges from 0 to 255, where ``0`` 
represents black (no intensity) and ``255`` represents white (maximum intensity). 

Grayscale images are commonly used in various image processing and computer vision tasks, including 
image analysis, feature extraction, and machine learning. 
They are simpler to work with as compared to color images, as they have only one channel, 
making them computationally less expensive to process. 
Additionally, for certain applications where color information is not necessary, grayscale images 
can provide sufficient information for analysis.

Lets look at few sample images from the dataset along with their labels.

.. code-block:: python3

    import matplotlib.pyplot as plt
    plt.figure(figsize=(10, 1))
    for i in range(5):
        # Set the (i+1)st subplot in a plot with 5 images in 1 row. 
        plt.subplot(1, 5, i+1)
        plt.imshow(X_train[i], cmap="gray")
    print('label for each of the above image: %s' % (y_train[0:5]))

The first parameter of ``subplot`` represents the number of rows, the second represents the number of 
columns and the third represents the subplot index. Subplot indices start from 1, so ``i+1`` ensures 
that the subplot position starts from 1 and increases by 1 in each iteration.

.. figure:: ./images/digits.png
    :width: 700px
    :align: center
    :alt: 

Each image has a total of :math:`28 * 28=784` pixels representing intensities between 0-255. Each of these pixel values 
is treated as an independent feature of the images. So the total number of input dimensions/features of the 
images is equal to 784. But the image provided to us is 2D array of size 28x28. We will have to reshape/flatten it
to generate a 1D vector of size 784 so it can be fed to the very first dense layer of the neural network.
We will use the ``reshape`` method to transform the array to desired dimension.

.. code-block:: python3

    # Flatten the images
    image_vector_size = 28*28
    X_train = X_train.reshape(X_train.shape[0], image_vector_size)
    X_test = X_test.reshape(X_test.shape[0], image_vector_size)

``reshape`` is a numpy array method that changes the shape of the given array without changing the
data. By reshaping ``X_train`` with the specified shape (i.e., ``image_vector_size``), 
each image in the training dataset is flattened into a one-dimensional array of size ``image_vector_size``.

Next, we normalize the image pixels, which is a common preprocessing step in machine learning tasks, 
particularly in computer vision, where it helps improve the convergence of models during training. 
Normalization typically involves scaling the pixel values to be within a specific range, such as [0, 1] 

You can either use the Keras preprocessing API to rescale or simply divide the number of pixels by 255.
For this example, we are adopting the later approach

.. code-block:: python3

    X_train_normalized = X_train / 255.0    
    X_test_normalized = X_test / 255.0

Step 3: Data pre-processing on output column.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We see that the dependent or target variable (``y_train``) that we want to predict is a 
categorical variable and holds labels 0 to 9. We have previously seen that we can one-hot encode
categorical variables. Here we use utility function from ``keras.util`` to convert to 
one-hot encoding using the ``to_categorical`` method.

.. code-block:: python3

    from tensorflow.keras.utils import to_categorical

    # Convert to "one-hot" vectors using the to_categorical function
    num_classes = 10
    y_train_cat = to_categorical(y_train, num_classes)

Question: Can you guess what ``y_train_cat[0]`` will be? How about ``y_train_cat[1]``?

Step 4: Building a Sequential Neural Network 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Let's now create a neural network. We will create a neural network with one input layer, one 
hidden layer and one output layer and check its prediction accuracy on the test data.

We will need to import Sequential and Dense from Keras.

.. code-block:: python3

    # Importing libraries needed for creating neural network,
    from tensorflow.keras import Sequential
    from tensorflow.keras.layers import Dense

    image_size=28*28

    # create model
    model = Sequential()  
    # input layer
    model.add(Dense(784, activation='relu',input_shape=(image_size,))) 

    # Hidden layer
    model.add(Dense(128, activation='relu')) 

    # Softmax activation function is selected for multiclass classification
    model.add(Dense(10, activation='softmax')) 

There are a few key points in the above architecture. 
First, we have an input layer with 784 perceptrons, the ``input_shape`` equal to the flattened (i.e., 1-dimensional) 
array of the image size, and the ``relu`` activation function. It is very 
common to see the input layer specified in this way, with the number of perceptrons equal to the 
input dimension. But note that this is not strictly required -- we could have used any number of 
perceptrons; the only requirement is tha the input dimension equal the dimension of our inputs (images, 
in our case). 

In the hidden layer, we specified 128 perceptrons. This is not an uncommon choice, but again, we could
have chosen any number here. The only requirement is that the input dimension of each perceptron 
equal the output dimension of the previous layer. But we are not specifying the input dimension, as 
Keras will determine that automatically for us. 

Note also that in both the input and hidden layer, we are using the ``relu`` activation function. Again, 
this is a common choice, but other options could have been chosen. 

Finally, notice that we use the ``softmax`` activation function in the output layer.
The softmax activation function is commonly used in the output layer of a neural network, 
especially in multiclass classification problems. 
It normalizes the output of a neural network into a probability distribution over multiple classes, 
ensuring that the sum of the probabilities of all classes is equal to 1.

Neural Network Architectures
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
We have a lot of options when designing an ANN. How many layers should we use? How many 
perceptrons should each layer have? In general, these are complicated questions and 
there are no simple "recipes" for determining the optimal values. 

However, there are some general guidelines when can use to approach these questions. These 
include answering the following questions:

1. How complicated and/or sophisticated is the underlying pattern that the model is trying to learn?
2. How many computational resources are available to the project? 
3. How much data is available for training? 
4. How important is the accuracy of the final model? 

This is where engineering design and tradeoffs come into play. More complicated patterns 
typically require larger neural networks to achieve higher accuracy. For example, our 
task above of classifying an image as 1 of 10 digits is a much simpler task than trying to 
classify all characters in the Roman alphabet, and that, in turn, is a much simpler task than 
classifying all characters in Kanji. 
Similarly, trying to classify an image as a cat or a dog is much simpler than trying to classigy
any species on the planet. 

Along the same lines, training a larger neural network requires additional computational
resources and more high-quality data than a smaller one. 

Finally, for some problems, accuracy is less critical than others. Imagine a recommendation 
system that predicts music that a listener will enjoy. It may be less critical that this system 
achieve a high accuracy as compared to the OCR system used to load non-digitized data into 
the music catalog. Typically, organizations have a finite set of resources and must be careful 
in how they choose to spend them. 


Model Training
^^^^^^^^^^^^^^

Let's compile and fit the model.

.. code-block:: python3

    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
    model.fit(X_train_normalized, y_train_cat, validation_split=0.2, epochs=5, batch_size=128, verbose=2)

Here we use the following parameters to the ``compile`` method: 

* ``optimizer=adam``: As mentioned previously, this is a good default choice.   
* ``loss=categorical_crossentropy``: As mentioned previously, this is an appropriate choice for categorical 
  problems. 
* ``metrics=["accuracy"]``: Here, we specify accuracy as the metric to track. 

And these to the ``fit`` method: 

* ``validation_split=0.2``: specifies the fraction of the training data to use for validation. In this 
  case, 20% of the training data will be used for validation during training, and the remaining 80% 
  will be used for actual training.
* ``epochs=5``: The number of epochs (iterations over the entire training dataset) to train the model. 
  In this case, the model will be trained for 5 epochs.

.. code-block:: python3

    Epoch 1/5
    375/375 - 3s - loss: 0.0598 - accuracy: 0.9095 - val_loss: 0.0272 - val_accuracy: 0.9594 - 3s/epoch - 8ms/step
    Epoch 2/5
    375/375 - 2s - loss: 0.0202 - accuracy: 0.9693 - val_loss: 0.0188 - val_accuracy: 0.9708 - 2s/epoch - 5ms/step
    Epoch 3/5
    375/375 - 2s - loss: 0.0129 - accuracy: 0.9816 - val_loss: 0.0150 - val_accuracy: 0.9766 - 2s/epoch - 5ms/step
    Epoch 4/5
    375/375 - 2s - loss: 0.0089 - accuracy: 0.9879 - val_loss: 0.0149 - val_accuracy: 0.9763 - 2s/epoch - 5ms/step
    Epoch 5/5
    375/375 - 2s - loss: 0.0061 - accuracy: 0.9921 - val_loss: 0.0154 - val_accuracy: 0.9776 - 2s/epoch - 5ms/step

Let's break down the output: 

* ``375/375``: Indicates that the training process has completed 375 batches out of a total of 375 batches. 
  This suggests that the entire training dataset has been processed in 375 batches during the training process.

* ``Time in seconds`` indicates that the training process took approximately 2/3 seconds to complete that epoch.

* ``loss`` indicates the value of the loss function (typically categorical cross-entropy loss for 
  classification tasks) computed on the training dataset. 

* ``accuracy`` Represents the accuracy of the model on the training dataset. The accuracy value of 
  approximately 0.99 indicates that the model correctly predicted 98% of the training samples.

* ``val_loss`` Represents the value of the loss function computed on the validation dataset. 

* ``val_accuracy`` Represents the accuracy of the model on the validation dataset. The validation 
  accuracy value of approximately 0.98.

* ``5ms/step``  This indicates the average time taken per training step (one forward and backward pass 
  through a single batch) during training.

We can next print the model summary. It shows how many trainable parameters are in the Model

.. code-block:: python3

    model.summary()

.. figure:: ./images/model_summary.png
    :width: 700px
    :align: center
    :alt: 

Here the total parameters and number of trainable parameters is same which is 717,210.
It is calculated as follows: Total weights from previous layer + Total bias for each neuron in 
current layer, or, :math:`784*784 + 784 = 615,440`.


**Optional:**
In order to see the bias and weights at each epoch we can use the helper function below:

.. code-block:: python3

    from tensorflow.keras.callbacks import LambdaCallback
     # Define a callback function to print weights and biases at the end of each epoch
    def print_weights_and_biases(epoch, logs):
        if epoch % 1 == 0:  
            print(f"\nWeights and Biases at the end of Epoch {epoch}:")
            for layer in model.layers:
                print(f"Layer: {layer.name}")
                weights, biases = layer.get_weights()
                print(f"Weights:\n{weights}")
                print(f"Biases:\n{biases}")

    # Create a LambdaCallback to call the print_weights_and_biases function
    print_weights_callback = LambdaCallback(on_epoch_end=print_weights_and_biases)

When we fit the model, we will specify the ``callback parameter``

.. code-block:: python3

    model.fit(X_train_normalized, y_train_cat, validation_split=0.2, epochs=5, batch_size=128, verbose=2,callbacks=[print_weights_callback])

This will print all the weights and biases in each epoch. 

Once we fit the model, next important step is predicting on the test data.


.. warning:: 

    Be careful with using computational resources on the VM. It is easy to build 
    large networks that exhaust all of the resources and/or to write training 
    loops that take a long time (hour or even days) to complete. 

    Plan your development and training work for the projects carefully! 

    
Step 5: Evaluate the Model's Performance on Test
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We can use the ``model.predict()`` method directly on the entire test dataset. Remember that 
we want to use the normalized data: 

.. code-block:: python3
    
    >>> y_pred = model.predict(X_test_normalized)

We can see the predictions by printing the ``y_pred`` values. For example: 

.. code-block:: python3

    >>> y_pred[0]

    array([7.8945732e-11, 1.6350994e-10, 4.3761141e-09, 2.2113424e-08,
           3.7417313e-17, 1.5567046e-12, 5.6684709e-17, 9.9999994e-01,
           1.9483424e-11, 1.0344545e-08], dtype=float32)

As you can see, the output values are probabilities. How many probability values do we 
expect there to be? And how should we use these to predict the class label? 


Remember the notion of *decision functions* that we have discussed throughout Unit 2 and 3. 
Decision functions provide values that determine whether an instance is in a particular class. 
Thus, there is one decision function, and hence, one value, for each class label. 
In this case, since we used ``softmax`` as the output activation function, the value corresponds 
to the probability that the instance is in that particular class. Therefore, we will get the output class 
from these probablities by getting the maximum value:

.. code-block:: python3

    import numpy as np
    y_pred_final=[]
    for i in y_pred:
        # return the index with the highest probability 
        y_pred_final.append(np.argmax(i))


Visualizing Accuracy with the Confusion Matrix
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
With a confusion matrix we can see how many correct vs incorrect predictions were made using
the model above.

.. code-block:: python3

    from sklearn.metrics import confusion_matrix
    import seaborn as sns

    cm=confusion_matrix(y_test,y_pred_final)

    plt.figure(figsize=(10,7))
    sns.heatmap(cm,annot=True,fmt='d')
    plt.xlabel('Predicted')
    plt.ylabel('Truth')
    plt.show()

Output of the above confusion matrix is as follows

.. figure:: ./images/cm_digits.png
    :width: 700px
    :align: center
    :alt: 

The numbers highlighted accross the diagonals are correct predictions. While the numbers in
black squares are number of incorrect predictions.

Let's also print the accuracy of this model using code below

.. code-block:: python3

    from sklearn.metrics import classification_report
    print(classification_report(y_test,y_pred_final))

As you can see the accuracy of the above model is 98%. 98% of the times this model predicted
with correct label on the test data.

..
    Let's now see if we can improve the model's training by adding more layers in the neural network.

    ``Can we improve this model by increasing the training parameters? Let's find out.``

    Step 6: Adding one or more hidden layers to the above neural network
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    .. code-block:: python3

        from tensorflow.keras import Sequential
        from tensorflow.keras.layers import Dense

        image_size=28*28

        # create model
        model2 = Sequential()  

        model2.add(Dense(256, activation='relu',input_shape=(image_size,))) ###Multiple Dense units with Relu activation
        model2.add(Dense(64, activation='relu'))
        model2.add(Dense(64, activation='relu'))
        model2.add(Dense(32, activation='relu'))

        model2.add(Dense(num_classes, activation='softmax'))
        model2.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
        model2.fit(X_train, y_train_cat, validation_split=0.2, epochs=5, batch_size=128, verbose=2,callbacks=None)
        model2.summary()


    Total params: 223978 (874.91 KB)
    Trainable params: 223978 (874.91 KB)
    Non-trainable params: 0 (0.00 Byte)

    ``From the model summary can you tell how many trainable parameters are present at each layer?``

    Let's look at our model predictions.

    .. code-block:: python3
    
        import numpy as np
        # predicting the model on test data
        y_pred=model2.predict(X_test)

        # As our outputs are probabilities so we will try to get the output class from these probablities by getting the maximum value
        y_pred_final=[]
        for i in y_pred:
            y_pred_final.append(np.argmax(i))


    Next with the help of confusion matrix we can see how many correct vs incorrect predictions were made using the model above.

    .. code-block:: python3

        from sklearn.metrics import confusion_matrix
        import seaborn as sns

        cm=confusion_matrix(y_test,y_pred_final)

        plt.figure(figsize=(10,7))
        sns.heatmap(cm,annot=True,fmt='d')
        plt.xlabel('Predicted')
        plt.ylabel('Truth')
        plt.show()


    .. code-block:: python3

        from sklearn.metrics import classification_report
        print(classification_report(y_test,y_pred_final))

    ``output``
        accuracy                           0.95     10000

    We certainly see an improvement in prediction accuracy. From the confusion matrix we can 
    conclude that the new model has improved on recognizing many digits.

    This concludes all the steps for building a 95% accurate neural network for identifying hand-written digits
    between 0-9.

**In-Class/Take Home Exercise.**
Let's now repeat the hands-on part for MNIST Fashion dataset. MNIST Fashion dataset has 10 categories 
for apparel and accessories. Our goal is to accurately classify the images in test dataset by creating the ANN model

.. code-block:: python3

        #0 T-shirt/top
        #1 Trouser
        #2 Pullover
        #3 Dress
        #4 Coat
        #5 Sandal
        #6 Shirt
        #7 Sneaker
        #8 Bag
        #9 Ankle boot

Note that in Step 1, loading the data, the source of dataset will change to:

.. code-block:: python3

     # Loading the data
    from tensorflow.keras.datasets import fashion_mnist
    (X_train, y_train), (X_test, y_test) = fashion_mnist.load_data()

From Step 1, you may check the shape of ``X_train``, ``y_train``. Run through Steps 2 to 5. 

Questions: 

* How confident are you about the model? 
* Does the validation accuracy improve if you run for more number of epochs?
* Experiment with different network architectures. How does the performance of the model change 
  if you use a different number of perceptrons in the hidden layer? 
  Does adding more hidden layers help?