Exam Study Guide 
================

The exam will cover Unit 1, Unit 2 and Unit 3 up to and including the basics of 
Artificial Neural Networks (ANNs) and dense (Fully-connected) networks. It will 
not include CNNs or any other material after. 

There are two sections to the guide: a general "exam preparation" section, and a 
"sample exam questions" section. The preparation section is intended to be comprehensive 
and give you a guide for the kinds of topics you should understand. The example 
questions section is intended to give you a flavor of the kinds of questions 
you are likely to see on the exam. Note that the examples do not constitute 
a comprehensive set of questions. 


General Exam Preparation 
------------------------

Unit 1
^^^^^^
* Explain the difference between a Pandas Series and a DataFrame. To load data 
  from a CSV file, which would you use? What does the Pandas Dataframe correspond 
  to with respect to the CSV file? What does the Series correspond to? 
* Describe some of the sources for outliers in data. In which cases should outliers 
  be discarded from the dataset, and in which cases should they be included? 
* When do you need to use imputation on a dataset? What is imputation and 
  what is the difference between univariate and multivariate imputation?
* Describe some examples of univariate imputation and some examples of multivariate 
  imputation. What ``numpy`` functions might be used in an implementation? 
* What does the ``groupby`` function do in Pandas? How would you use ``groupby`` 
  to implement a univariate imputation? 

Unit 2
^^^^^^
* Describe the difference between supervised and unsupervised learning. Under which 
  scenarios would you use one versus the other? What are the main advantages and 
  disadvantages of each? Given specific scenarios and datasets, make sure you can 
  identify which should be used.
* What is the difference between classification and regression? Given specific 
  scenarios and datasets, make sure you can identify which should be used.
* How would you identify a linearly separable dataset from a pictorial example? 
  If a dataset is linearly separable, what guarantees could you make? 
* How is a decision function used in a machine learning model? Is it used in classification, 
  regression or both? 
* What is the difference between accuracy, recall, precision, and F-1? Make sure you can 
  identify which metric is most important for a given scenario and be able to justify 
  your answer.  in each of the following examples? Given specific scenarios and datasets, make sure you can identify which should be used and be able to explain your answer. 
* Describe the two primary methods we have discussed in class to optimize different 
  classification metrics. Be sure you are able to answer questions about how to 
* What is the meaning of the “k” in the K-nearest neighbor algorithm? What is a hyperparameter in a machine learning model? What standard method do we use to determine the optimal values of hyperparameters?
* Describe the advantages and disadvantages of K-nearest neighbor versus Linear 
  Classification.
* As the value of “k” in the K-nearest neighbor algorithm increases, how does the model’s 
  sensitivity to outliers change? 
* What is the definition of a hypyerparameter? Describe some example of hypyerparameters in the 
  models we have studied, and know about the ways in which their values can impact the model's 
  performance. 
* Describe cross validation and what it is used for. 
* Describe advantages and disadvantages of the Decision Tree algorithm. How does it 
  compare with Linear Classification? KNN? 
* What is the relationship between Decision Trees and Random Forests? What is the 
  advantage of Random Forests compared to Decision Trees? What is the disadvantage? 
* To what extent is Random Forest an example of an ensemble method? What are other examples 
  of ensemble methods? 

Unit 3 
^^^^^^
* What is the mathematical definition of a perceptron? How do the weights, biases, 
  and activation function factor into the definition?
* What is the role of the activation function in an ANN? 
* What are the advantages of ANNs over the methods we have discussed in Unit 2? 
  What are some of the disadvantages? 
* For a fully connected ANN, how do the input dimensions of layer depends on the 
  output dimension of other layers? 
* How does the input dataset put constraints on the architecture of the ANN? 
  Which layer(s) are constrained? And which parts of the layer(s)? 
* For a classification problem, what are the constraints on dimension on an ANN?

  
Example Exam Questions 
----------------------

.. warning:: 

  This set of example questions is **not** intended to be a comprehensive study guide. Rather,
  it is only intended to give you a sense of the format of questions you will be asked. 
  Be sure to review all of the topics in the previous section. 


Short Answer 
^^^^^^^^^^^^

1. For each of the following scenarios, specify whether the problem would best be solved 
   as a supervised or unsupervised learning problem.
  * An e-commerce company wants to build a model to segment its customers into groups with 
    similar purchasing patterns, without any predefined categories.
  * A hospital has a dataset of patient blood test results, and each record indicates whether or not 
    the patient was later diagnosed with diabetes. They want to train a model to predict 
    if a new patient has diabetes based on their blood test results.
  * A speech recognition team wants to build a model that converts spoken audio clips 
    into text, using a dataset of audio clips paired with their corresponding 
    transcriptions.
  * An online streaming music company has the listening history of each of its users. It 
    would like to build a model to identify groups of users with similar listening habits. 

2. True/False
  * A Linear Regression model makes use of decision functions and the perceptron algorithm. 
  * A dataset contains information about the lengths of flower pedals, but some of the values 
    are missing. Replacing the missing flower pedal lengths with the median of all lengths is 
    an example of univariate imputation. 
  * Decision Trees are an example of an ensemble method in machine learning. 
  * A medical lab is training a machine learning model to predict whether a patient will be 
    eligible for a new treatment. The treatment is cheap and very safe. The lab should 
    evaluate and select the best model using the recall metric to minimize false negatives.

3. Multiple choice 
  * The K-nearest Neighbor algorithm:

    a) Is one of the most accurate machine learning models
    b) Can learn non-linear decision boundaries
    c) Does not have any hyperparameters, so it is fast to train
    d) Works best with recall
    e) None of the above 

  * The F-1 metric minimizes: 

    a) False positives 
    b) False negatives 
    c) Precision 
    d) Recall 
    e) None of the above  

  * When defining a ``Dense`` layer object in a ``Sequential`` Artificial 
    Neural Network (ANN) using the Keras API, one must always:

    a) Pass the input dimension as an argument
    b) Ensure the layer has more perceptrons than its input dimension
    c) Specify the batch size to use
    d) Specify the number of perceptrons in the layer
    e) None of the above 

Longer Answer 
^^^^^^^^^^^^^

1. What is the purpose of cross validation in machine learning?  

2. In the context of machine learning explain the difference between 
   training accuracy, validation accuracy and test accuracy.

3. Explain the difference between the Decision Tree algorithm and the Random Forrest 
   algorithm. What are the strengths and weaknesses of each? 

4. When would you use ``RobustScaler`` and when would you use ``StandardScalar`` on 
   a data set? 
   

Code Analysis 
^^^^^^^^^^^^^

1. What will be the output of the following code? 

.. code-block:: python3 

  df = pd.DataFrame({
    "A": [1, 2, 3],
    "B": [10, 20, 30]
  })

  df["A"] = df["A"].map(lambda x: x * 2)
  print(df)

2. What is the output of the following code?

.. code-block:: python3 

  cars = pd.DataFrame({
  "brand": ["Toyota", "Toyota", "Tesla", "Tesla"],
  "price": [20000, 25000, 80000, 90000]
  })
  print(cars.groupby("brand")["price"].apply(sum))

3. Your friend tells you they just figured out a clever way to 
   improve the accuracy on their class project that uses K-nearest
   Neighbor. They show you their main loop:

.. code-block:: python3 

      X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=1)
      
      best_model = none 
      best_k = 0
      best_accuracy = 0

      for k in np.arange(1, 100):
        m = knn = KNeighborsClassifier(n_neighbors=k)
        m.fit(X_train, y_train)
        accuracy = accuracy_score(y_test, m.predict(X_test))
        if accuracy > best_accuracy:
          best_model = m 
          best_k = k 
          best_accuracy = accuracy 
      print(f"The best model is: {best_model}")  
  
The syntax is correct and the code produces a best model, but what is the 
flaw in your friend's approach? 

Code Authoring 
^^^^^^^^^^^^^^

1. You want to develop an Artificial Neural Network to classify images as containing 
   cats, dogs or neither. You have a set of labeled images that grey scale (i.e., one 
   channel of intensity with value between 0 and 255) of size 1000x728 pixels.

   Your network should be a dense (i.e., full-connected) network with three layers: 
   one input layer, one hidden layer, and one output layer. Write the code to construct 
   such an ANN using the Keras API. In this section, you do not need to train your model, 
   only define the archtiecture. You can use the following code snippets, but note, you 
   may not need or want to use all of them. 

   .. code-block:: python3

    Dense(?, input_dimension=(?), activation=?)
    
    from tensorflow.keras import Sequential

    m = Sequential()

    image_size = 1000*728

    image_dimension = 1000*728*255

    from tensorflow.keras.layers import Dense

    "tanh", "relu", "softmax", optimizer="adam", loss='categorical_crossentropy'
    
    from tensorflow.keras.utils import to_categorical

    m.add(?)

2. You want to build a model to detect whether AI was used on an exam, which is not 
   allowed and constitutes academic dishonesty (cheating).

   Write code to perform a hyperparameter search for the optimal Random Forest classifier 
   model. Your search should use explore a space that includes random forests containing 
   anywhere from 2 to 100 trees, that have a maximum depth between 2 and 10 levels and 
   that consider leaves with a minimum set of samples between 2 and 5.   

   What is the best metric to use for this use case? Explain your answer and optimize 
   your grid search for this metric. 

   You may want to use some of (but not all of) the following code snippets in your solution: 

  .. code-block:: python3

    gscv = GridSearchCV(model, param_grid, cv=?, n_jobs=4, scoring=?)

    from sklearn.tree import DecisionTreeClassifier

    param_grid = { . . . } 
    
    "min_samples_leaf": np.arange(start=?, stop=?)

    gscv.fit(X_train, y_train)
    
    from sklearn.ensemble import RandomForestClassifier

    model = DecisionTreeClassifier(random_state=1).fit(X_train, y_train)
    
    model = RandomForestClassifier(random_state=?)
    
    from sklearn.model_selection import GridSearchCV

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=1)
    
    "n_estimators": np.arange(start=?, stop=?, step=?)
    
    "max_depth": np.arange(start=2, stop=20),