Skip to content

Deep Learning Fundamentals

Activation Functions

tanh (Hyperbolic Tangent)

Definition:

$$\tanh(x) = \frac{e^x - e{-x}}{ex + e^{-x}}$$

Properties:

Property Value
Output range (-1, 1)
Center 0 (zero-centered)
Monotonic Yes, strictly increasing
Derivative 1 − tanh²(x)
Max derivative 1 (at x = 0)
Saturates at ±1 for large |x|

Relationship to sigmoid:

tanh(x) = 2 · sigmoid(2x) − 1

Role in deep learning:

  • Zero-centered output — gradients don't have a systematic positive/negative bias, leading to faster convergence compared to sigmoid
  • LSTM core component — used in cell state updates: the candidate cell state (C̃) passes through tanh to keep values in [-1, 1]
  • Stronger gradients — derivative peaks at 1 (vs 0.25 for sigmoid), so gradients flow better during backpropagation
  • Still suffers vanishing gradients — for |x| > 3, the gradient ≈ 0, which is why LSTM/GRU gates were invented to mitigate this

Where tanh is used in LSTM:

C̃_t = tanh(W_c · [h_{t-1}, x_t] + b_c)    ← candidate cell state
h_t  = o_t * tanh(C_t)                      ← output hidden state

When to use tanh vs alternatives:

Use case Recommendation
LSTM/GRU internals tanh (by design)
Hidden layers (general) ReLU or GELU preferred (no saturation)
Output layer, range [-1,1] tanh
Output layer, range [0,1] sigmoid
Classification output softmax / sigmoid

Worked Example: tanh(12039) → 1.0

Step 1 — Plug into the formula:

$$\tanh(12039) = \frac{e^{12039} - e{-12039}}{e$$} + e^{-12039}

Step 2 — Compute the exponentials:

Term Value
e^12039 ≈ 10^5228 (a number with over 5000 digits)
e^-12039 ≈ 10^-5228 (over 5000 zeros after the decimal point)

Step 3 — Simplify:

Numerator   = e^12039 - e^-12039 ≈ 10^5228   (subtracting ~0 changes nothing)
Denominator = e^12039 + e^-12039 ≈ 10^5228   (adding ~0 changes nothing)

tanh(12039) = 10^5228 / 10^5228 = 1.0

Step 4 — The general intuition for large x:

e^-x → 0  when x is large

         e^x - 0     e^x
tanh = --------- = ----- = 1
         e^x + 0     e^x

Step 5 — Convergence table:

x e^x e^-x Numerator Denominator tanh(x)
0 1 1 0 2 0.0000
1 2.718 0.368 2.350 3.086 0.7616
2 7.389 0.135 7.254 7.524 0.9640
3 20.086 0.050 20.036 20.136 0.9951
5 148.41 0.0067 148.40 148.42 0.9999
10 22026 0.0000454 22026 22026 0.99999999..
12039 10^5228 ≈ 0 10^5228 10^5228 1.0

Why the computer says exactly 1.0

Mathematically, tanh never reaches exactly 1 — it only approaches it asymptotically. But computers use 64-bit floating point (float64), which has a smallest representable positive number of ≈ 5×10^-324. Since e^-12039 ≈ 10^-5228 is far below that threshold, the computer stores it as 0. In practice, tanh(x) = 1.0 exactly for any x ≥ ~19.

Python verification:

import numpy as np

print(f"tanh(12039) = {np.tanh(12039)}")        # 1.0
print(f"e^-12039    = {np.exp(-12039)}")         # 0.0 (underflow)
print(f"1 - tanh(5) = {1 - np.tanh(5):.2e}")    # 1.81e-04
print(f"1 - tanh(10)= {1 - np.tanh(10):.2e}")   # 8.27e-09
print(f"1 - tanh(20)= {1 - np.tanh(20):.2e}")   # 0.00e+00 ← exact 1.0

Python APIs for tanh

Library Usage GPU Autodiff Best for
math.tanh(x) scalar Simple calculations
np.tanh(arr) ndarray Data preprocessing
torch.tanh(t) Tensor PyTorch model training
tf.math.tanh(t) Tensor TensorFlow model training

ReLU (Rectified Linear Unit)

Definition:

$$\text{ReLU}(x) = \max(0, x)$$

Properties:

Property Value
Output range [0, ∞)
Center Not zero-centered
Monotonic Yes
Derivative 0 if x < 0, 1 if x > 0
Saturates at Never (for positive x)
Computation Very cheap (comparison only)

Why ReLU solves the vanishing gradient problem:

With tanh, gradients are always in (0, 1). During backpropagation, gradients get multiplied through layers — they shrink exponentially:

tanh:  Layer 3 gradient = 0.2 × 0.3 × 0.1 = 0.006  (almost zero!)
ReLU:  Layer 3 gradient = 1   × 1   × 1   = 1      (full signal!)

Deep layers with tanh receive near-zero gradients and barely learn. ReLU passes gradients as either 0 or 1 — no shrinking — so all layers learn effectively.

ReLU variants:

Variant Formula Fixes dead neurons?
ReLU max(0, x)
LeakyReLU max(0.01x, x) ✅ Small negative slope
ELU x if x > 0, α(eˣ-1) if x ≤ 0 ✅ Smooth negative region
GELU x · Φ(x) ✅ Used in Transformers

Dead neuron problem

If a ReLU neuron's input is always negative, its output is always 0 and it stops learning permanently. Use LeakyReLU(0.01) or ELU if this occurs.

tanh vs ReLU: When to Use Each

Criterion tanh ReLU
Output range [-1, 1] (bounded) [0, ∞) (unbounded)
Gradient flow Shrinks through layers Passes through unchanged
Vanishing gradient? Yes — deep layers starve No — gradient = 0 or 1
Training speed Slower (exp computation) ~6× faster (comparison)
Dead neurons? No Possible (use LeakyReLU)
Best for LSTM/GRU gates, output [-1,1] Hidden layers, regression, CNNs
Network depth Shallow (1–2 layers) Deep (3+ layers)

Practical example — ozone prediction model:

Setting tanh ReLU
Converged at epoch 131 294
Best val_mae 8.02 7.75 (−3.4%)
Best val_loss 105.99 99.98 (−5.7%)
Train MAE 8.98 8.55 (−4.8%)

ReLU trained longer because gradients kept flowing to deeper layers, allowing the model to keep learning past tanh's ceiling.


Data Preprocessing for Time Series

Scalers from sklearn.preprocessing

Scaler Method Good for LSTM? Why
MinMaxScaler Scales to [0,1] range ✅ Best choice LSTM sigmoid/tanh activations work best with bounded inputs
StandardScaler Zero mean, unit variance ⚠️ Use with caution Assumes stationary data; most time series are non-stationary
RobustScaler Uses median & IQR ⚠️ Situational Good for outliers/spikes, but unbounded output can cause issues
MaxAbsScaler Scales by max absolute value ⚠️ Situational Preserves sparsity and sign; rare fit for time series

Best practice

Always fit() on training data only, then transform() on both train and test. Fitting on the full dataset causes data leakage.

Recommended pattern:

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(0, 1))
train_scaled = scaler.fit_transform(train_data)    # fit on train only
test_scaled = scaler.transform(test_data)           # transform test

# After prediction
predictions = scaler.inverse_transform(pred_scaled)

Watch out for non-stationarity

If test data has values outside the training range, MinMaxScaler will produce values outside [0,1]. Consider differencing the series first, using a rolling window scaler, or clipping outliers.


LSTM Input Shape

Why LSTM Requires 3D Input: (samples, timesteps, features)

Dimension Meaning Example (stock prediction)
samples Number of independent sequences (batch size) 1000 different 30-day windows
timesteps Length of each sequence (lookback window) 30 days per window
features Variables measured at each timestep price, volume, RSI = 3 features

At each timestep, the LSTM cell receives one slice of features and combines it with memory from previous steps:

Timestep:    t=0         t=1         t=2        ...    t=29
              │           │           │                  │
Input:    [p,v,rsi]   [p,v,rsi]   [p,v,rsi]         [p,v,rsi]
              ▼           ▼           ▼                  ▼
           ┌──────┐   ┌──────┐   ┌──────┐          ┌──────┐
    h₀ ──▶ │ LSTM │──▶│ LSTM │──▶│ LSTM │──▶ ... ──│ LSTM │──▶ output
           └──────┘   └──────┘   └──────┘          └──────┘

Reshaping 2D data for LSTM:

lookback = 30
X = []
for i in range(lookback, len(data)):
    X.append(data[i - lookback:i])  # slice 30 rows

X = np.array(X)  # shape: (970, 30, 3)

input_shape in the first LSTM layer

Only needs (timesteps, features) — Keras infers the samples dimension automatically from the batch.

LSTM Constructor Parameters (Keras)

tf.keras.layers.LSTM(
    units,                    # hidden state size (required)
    activation='tanh',        # activation for cell state
    recurrent_activation='sigmoid',  # activation for gates
    return_sequences=False,   # True = output at every timestep
    return_state=False,       # True = also return h and c
    dropout=0.0,              # input dropout rate
    recurrent_dropout=0.0,    # recurrent state dropout rate
    input_shape=(timesteps, features),  # only needed on first layer
)
Parameter Default Meaning
units Number of hidden units (memory width). Larger = more capacity
activation 'tanh' Applied to candidate cell state and output
recurrent_activation 'sigmoid' Applied to forget, input, and output gates
return_sequences False False = output only last timestep. True = output all timesteps (needed when stacking LSTM layers)
return_state False If True, returns (output, hidden_state, cell_state)
dropout 0.0 Fraction of input units to drop (regularization)
recurrent_dropout 0.0 Fraction of recurrent units to drop
input_shape (timesteps, features) — only on the first layer; Keras infers batch size

PyTorch equivalent

PyTorch's nn.LSTM(input_size, hidden_size, num_layers, batch_first, dropout, bidirectional) uses different parameter names but the same concepts. Key difference: set batch_first=True to match Keras' default (batch, timesteps, features) ordering.

Reshaping Data for LSTM: X.reshape(-1, timesteps, features)

LSTM requires 3D input. Raw tabular data is 2D, so reshaping is needed:

# Before: 2D array from lag construction
X.shape = (350, 9)          # 350 samples, 9 columns

# After: 3D array for LSTM
X = X.reshape(-1, 9, 1)     # univariate: 9 timesteps, 1 feature
X.shape = (350, 9, 1)

What each parameter means:

Parameter Value Meaning
-1 inferred "Keep all samples" — NumPy calculates this from total elements ÷ (timesteps × features)
timesteps lags + 1 Number of time steps in the lookback window
features 1 or n Number of variables measured at each timestep

Univariate vs multivariate:

# Univariate (ozone only): 1 feature per timestep
X = X.reshape(-1, lags + 1, 1)

# Multivariate (ozone + 4 weather features): 5 features per timestep
X = X.reshape(-1, lags + 1, 5)

Silent reshape bug

Using reshape(-1, 45, 1) instead of reshape(-1, 9, 5) produces the same total elements — NumPy won't complain. But the LSTM sees 45 timesteps of 1 feature instead of 9 timesteps of 5 features, mixing up temporal ordering and producing wrong results silently.

Reshaping y — the target:

y = y.reshape(-1, 1)    # 2D, not 3D — y is not LSTM input

y only needs 2D (samples, output_dim) because it's compared against the Dense output layer, not fed through the LSTM. The 1 means one prediction per sample (single target variable).


Dense Output Layer

What Dense Does

A Dense (fully connected) layer connects every input neuron to every output neuron with learnable weights:

$$y = \text{activation}(W \cdot x + b)$$

In LSTM models, Dense is the final layer that maps the LSTM's hidden state to the prediction:

LSTM hidden state (32 units) ──▶ Dense(1) ──▶ single prediction
     h = [h₁, h₂, ..., h₃₂]       W·h + b       ŷ (scalar)

Parameters

tf.keras.layers.Dense(
    units,                # number of output neurons (required)
    activation=None,      # output activation function
    use_bias=True,        # add bias term b
)
Parameter Meaning Typical value for regression
units Output dimension — how many values to predict 1 (single target)
activation Transform applied to output 'linear' or None (no squashing for regression)
use_bias Whether to add bias b True (default)

Choosing units and activation

Task units activation Why
Single-value regression (e.g., ozone) 1 'linear' Unbounded continuous output
Multi-target regression n_targets 'linear' One output per target
Binary classification 1 'sigmoid' Output ∈ [0, 1] as probability
Multi-class classification n_classes 'softmax' Probabilities summing to 1

How units Connects to y.reshape(-1, 1)

The Dense output shape must match the target shape:

# Target: predict one value (ozone)
y = y.reshape(-1, 1)                    # shape: (samples, 1)
model.add(Dense(units=1))               # output: (samples, 1) ✓

# Target: predict two values (ozone + temperature)
y = y.reshape(-1, 2)                    # shape: (samples, 2)
model.add(Dense(units=2))               # output: (samples, 2) ✓

If units doesn't match y.shape[1], Keras raises a shape mismatch error during training.

Why activation='linear' for regression

Using 'sigmoid' or 'tanh' would cap predictions to [0,1] or [-1,1], preventing the model from predicting values outside that range. For regression, the output must be unbounded — the scaler handles range mapping.


Building an LSTM Model (Keras Sequential API)

model = tf.keras.models.Sequential([
    tf.keras.layers.LSTM(50, return_sequences=True, input_shape=(timesteps, features)),
    tf.keras.layers.LSTM(50),
    tf.keras.layers.Dense(1),
])

model.compile(optimizer='adam', loss='mse')
model.fit(X_train, y_train, epochs=50, batch_size=32, validation_split=0.1)
predictions = model.predict(X_test)
API Use case
Sequential Linear stack of layers (most LSTM time series models)
Functional Multi-input/output, shared layers, skip connections
Subclassing Full custom forward pass logic

Training Workflow: Scale → Train → Predict → Inverse Transform

The model lives in scaled space. Only come back to original space for evaluation.

Raw data ──▶ scaler.fit_transform() ──▶ Scaled data ──▶ Train LSTM
Raw predictions ◀── scaler.inverse_transform() ◀── Scaled predictions
scaler = MinMaxScaler(feature_range=(0, 1))
train_scaled = scaler.fit_transform(train_data)   # fit + transform
test_scaled = scaler.transform(test_data)          # transform only

model.fit(X_train_scaled, y_train_scaled, epochs=50)
pred_scaled = model.predict(X_test_scaled)

# Inverse transform only for evaluation/display
predictions = scaler.inverse_transform(pred_scaled)
rmse = np.sqrt(np.mean((predictions - actual_values) ** 2))

Why you must train on scaled data

  • LSTM activations (sigmoid, tanh) have bounded ranges — raw values like 12039 saturate these functions, killing gradients
  • Large input values → exploding/vanishing gradients
  • Features on different scales cause the optimizer to zigzag
  • MSE loss on raw values would be dominated by large-scale features

Evaluation Metrics

Metric Functions

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

def metric(y_true, y_pred):
    mae = mean_absolute_error(y_true, y_pred)
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100
    r2 = r2_score(y_true, y_pred)
    print(f"MAE={mae:.2f} RMSE={rmse:.2f} MAPE={mape:.1f}% R²={r2:.4f}")

What Each Metric Means

Metric Meaning Poor Acceptable Good
MAE Average absolute error in original units Lower is better
RMSE Penalizes large errors more than MAE Close to MAE = consistent
MAPE Percentage error relative to true values > 20% 10–20% < 10%
Variance explained by model (1.0 = perfect) < 0.5 0.5–0.8 > 0.8

RMSE vs MAE gap

If RMSE >> MAE, the model has some predictions with very large errors (spikes). Investigate those outliers.

MAPE with small values

MAPE is inflated when true values are near zero. Use symmetric MAPE (sMAPE) instead:

smape = np.mean(2 * np.abs(y_true - y_pred) / (np.abs(y_true) + np.abs(y_pred))) * 100

Improving Underperforming Models (R² < 0.5)

Issue Fix
Underfitting Increase LSTM units, add layers, train more epochs
Insufficient features Add day_of_week, hour_of_day, lag features, rolling stats
Too short lookback Experiment with longer timesteps (48h, 72h, 168h)
No stationarity handling Difference the series or add trend features
Slow convergence Use ReduceLROnPlateau and EarlyStopping callbacks
# Improved model with dropout and callbacks
model = Sequential([
    LSTM(128, return_sequences=True, input_shape=(timesteps, features)),
    Dropout(0.2),
    LSTM(64),
    Dropout(0.2),
    Dense(1)
])

callbacks = [
    tf.keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True),
    tf.keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=5),
]
model.compile(optimizer='adam', loss='mse')
model.fit(X_train, y_train, epochs=100, callbacks=callbacks, validation_split=0.1)

Visualization

Plotting Predictions with Forecast Zone

nr_datapoints = 24 * 7                        # last 7 days
nr_datapoints_addon = nr_datapoints + 12       # + 12h forecast

y_true = y_val.flatten()[-nr_datapoints:]      # 168 points
y_pred = y_val_pred.flatten()[-nr_datapoints_addon:]  # 180 points

x_true = np.arange(0, nr_datapoints)
x_pred = np.arange(nr_datapoints_addon - len(y_pred), nr_datapoints_addon)

fig, ax = plt.subplots(figsize=(15, 5))
sns.lineplot(x=x_true, y=y_true, color=colors[0], label="true", ax=ax)
sns.lineplot(x=x_pred, y=y_pred, color=colors[1], label="predicted", ax=ax)

# Forecast zone
ax.axvline(x=nr_datapoints, color='gray', linestyle='--', alpha=0.5, label="forecast start")
ax.axvspan(nr_datapoints, nr_datapoints_addon, alpha=0.1, color='orange', label="forecast zone")

# Formatting
ax.set_xticks(np.arange(0, nr_datapoints_addon + 1, 24))
ax.legend(fontsize=14, loc='lower center')
ax.grid(alpha=0.3)

Training Convergence & Diagnostics

Reading Training Logs

A typical Keras training log shows four key values per epoch:

Metric Meaning
loss Training loss (MSE for regression)
mae Training mean absolute error
val_loss Validation loss
val_mae Validation MAE

Convergence Phases

Phase Behavior Action
Rapid drop Loss drops dramatically (e.g. 3090 → 157 in 3 epochs) Normal — initial learning
Steady descent Consistent improvement per epoch Let it train
Plateau Loss changes < 0.1 per epoch Stop or reduce LR
Divergence Loss increases LR too high, reduce it

Detecting Plateau

When loss improvement per epoch becomes negligible (< 0.1), the model has converged at its current capacity. Additional epochs waste compute.

# Example: plateau detection
# Epoch 220: val_loss=98.18, val_mae=7.73
# Epoch 246: val_loss=98.18, val_mae=7.73
# → 26 epochs, ~0 improvement → plateau confirmed

Overfitting vs Healthy Training

Signal Diagnosis
train_loss << val_loss (growing gap) Overfitting — add Dropout, reduce capacity, or get more data
train_loss ≈ val_loss Healthy — model generalizes well
val_loss < train_loss Normal with Dropout — Dropout is off during validation

Essential Callbacks

from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau

callbacks = [
    ReduceLROnPlateau(
        monitor='val_loss',
        factor=0.5,        # halve LR when stuck
        patience=10,        # wait 10 epochs before reducing
        min_lr=1e-6,
        verbose=1
    ),
    EarlyStopping(
        monitor='val_loss',
        patience=30,        # stop after 30 epochs without improvement
        restore_best_weights=True
    )
]
  • ReduceLROnPlateau — smaller steps can push past a plateau, often squeezing 5–15% more improvement
  • EarlyStopping — prevents wasted epochs; restore_best_weights=True ensures the best model is kept

Increasing Model Capacity

When the model plateaus and is not overfitting, a larger architecture may help:

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, BatchNormalization

model = Sequential([
    Dense(128, activation='relu', input_shape=(n_features,)),
    BatchNormalization(),
    Dropout(0.2),
    Dense(64, activation='relu'),
    BatchNormalization(),
    Dropout(0.2),
    Dense(32, activation='relu'),
    Dense(1)
])
  • BatchNormalization — stabilizes training, allows higher learning rates
  • Dropout — prevents overfitting from added capacity

Data requirements for larger models

More parameters need more data to generalize. But if the current model shows no overfitting (train ≈ val), there is headroom to increase capacity safely. Watch for val_loss diverging from train_loss after scaling up.

Interpreting Combined Metrics

Example from an ozone prediction model:

Stopped at epoch: 208 (EarlyStopping)
Best val_loss: 97.77    →  √97.77 = 9.89 (= RMSE ✓)
Best val_mae:  7.71
R² = 0.574              →  model explains 57% of variance
MAPE = 18.2%            →  average prediction off by ~18%

Cross-checking metrics:

Check Formula Confirms
val_loss → RMSE √val_loss = RMSE MSE and RMSE are consistent
RMSE vs MAE gap RMSE ≈ MAE → consistent errors Large gap → outlier predictions
R² interpretation 0.57 = moderate 43% variance unexplained

When to Change Strategy

Current R² Next Step
< 0.3 Check data quality, feature relevance, preprocessing
0.3–0.6 Feature engineering, try tree-based models (XGBoost/LightGBM)
0.6–0.8 Hyperparameter tuning, ensemble methods
> 0.8 Fine-tune, focus on edge cases

Tree-based models for tabular data

For tabular regression with R² < 0.7, gradient boosting (XGBoost, LightGBM) often outperforms neural networks with less tuning effort. NNs shine on sequential, image, and text data.