Cheatsheet
Activation Functions
One of the most important properties of activation functions is their non-linearity, which enables the models to learn complex relationships from the data that go beyond simple, linear relationships[1].
ReLU (Rectified Linear Unit):
When to use: Most commonly used activation function in deep learning today. It’s simple yet effective. Default for hidden layers in most neural nets — fast and avoids vanishing gradients.
Dense(128, activation='relu')
Leaky ReLU / ELU
When to use: Leaky ReLU addresses the dying ReLU [2] problem by allowing a small negative slope instead of zero.
LeakyReLU(alpha=0.1) #Add LeakyReLU as a separate layer
Sigmoid
When to use: For binary classification output (probability between 0 and 1).Sigmoid squashes values between 0 and 1, making it useful for outputs that represent probabilities.
Dense(1, activation='sigmoid')
Tanh (Hyperbolic Tangent)
When to use: Similar to sigmoid but when you want outputs between -1 and 1. It can help with convergence in some networks.
Dense(1, activation='tanh')
Softmax
When to use: For multi-class classification to get class probabilities.
Dense(10, activation='softmax')
Linear
When to use: For regression tasks where output can take any real value.
Dense(1, activation='linear')
Optimizers
Optimizers are algorithms or methods used to adjust the weights and biases of a neural network to minimize the loss function during training. By iteratively updating these parameters, optimizers ensure that the model learns effectively from the data, improving its predictions.[3]
SGD (Stochastic Gradient Descent) When to use: SGD is a foundational optimizer that updates weights using a single data point at a time. While simple, it’s prone to oscillations in the loss function.; good for large datasets with steady convergence.
optimizer='sgd'
SGD + Momentum When to use: When plain SGD is too slow or oscillates; adds memory of past gradients.
SGD(learning_rate=0.01, momentum=0.9)
Adam When to use: Default choice for most deep learning models; adaptive learning rates and fast convergence.
optimizer='adam'
RMSprop
When to use: Works well for RNNs or non-stationary problems (changing gradients).
optimizer='rmsprop'
References:
Activation Functions [https://towardsdatascience.com/activation-functions-in-neural-networks-how-to-choose-the-right-one-cb20414c04e5/]
Dying ReLU problem [https://pythonguides.com/pytorch-leaky-relu/]
Optimizers [https://akridata.ai/blog/optimizers-in-deep-learning/]