Cheatsheet
======================


Activation Functions
~~~~~~~~~~~~~~~~~~~~~
One of the most important properties of activation functions is their non-linearity, which enables the models to learn complex relationships from the data that go beyond simple, linear relationships[1].

**ReLU (Rectified Linear Unit)**:
When to use: Most commonly used activation function in deep learning today. It’s simple yet effective. Default for hidden layers in most neural nets — fast and avoids vanishing gradients.  
``Dense(128, activation='relu')``

**Leaky ReLU / ELU**
When to use: Leaky ReLU addresses the dying ReLU [2] problem by allowing a small negative slope instead of zero.    
``LeakyReLU(alpha=0.1) #Add LeakyReLU as a separate layer``

**Sigmoid**
When to use: For binary classification output (probability between 0 and 1).Sigmoid squashes values between 0 and 1, making it useful for outputs that represent probabilities.  
``Dense(1, activation='sigmoid')``

**Tanh (Hyperbolic Tangent)**
When to use: Similar to sigmoid but when you want outputs between -1 and 1. It can help with convergence in some networks.   
``Dense(1, activation='tanh')``

**Softmax**
When to use: For multi-class classification to get class probabilities.  
``Dense(10, activation='softmax')``

**Linear**
When to use: For regression tasks where output can take any real value.  
``Dense(1, activation='linear')``

Optimizers
~~~~~~~~~~~
Optimizers are algorithms or methods used to adjust the weights and biases of a neural network to minimize the loss function during training. By iteratively updating these parameters, optimizers ensure that the model learns effectively from the data, improving its predictions.[3]

**SGD (Stochastic Gradient Descent)**
When to use: SGD is a foundational optimizer that updates weights using a single data point at a time. While simple, it’s prone to oscillations in the loss function.; good for large datasets with steady convergence. 
 ``optimizer='sgd'``

**SGD + Momentum**
When to use: When plain SGD is too slow or oscillates; adds memory of past gradients. 
 ``SGD(learning_rate=0.01, momentum=0.9)``

**Adam**
When to use: Default choice for most deep learning models; adaptive learning rates and fast convergence. 
 ``optimizer='adam'``

**RMSprop**
When to use: Works well for RNNs or non-stationary problems (changing gradients).  
``optimizer='rmsprop'``


References:

1. Activation Functions [https://towardsdatascience.com/activation-functions-in-neural-networks-how-to-choose-the-right-one-cb20414c04e5/]

2. Dying ReLU problem [https://pythonguides.com/pytorch-leaky-relu/]

3. Optimizers [https://akridata.ai/blog/optimizers-in-deep-learning/]