Deep Learning Fundamentals¶
Activation Functions¶
tanh (Hyperbolic Tangent)¶
Definition:
$$\tanh(x) = \frac{e^x - e{-x}}{ex + e^{-x}}$$
Properties:
| Property | Value |
|---|---|
| Output range | (-1, 1) |
| Center | 0 (zero-centered) |
| Monotonic | Yes, strictly increasing |
| Derivative | 1 − tanh²(x) |
| Max derivative | 1 (at x = 0) |
| Saturates at | ±1 for large |x| |
Relationship to sigmoid:
Role in deep learning:
- Zero-centered output — gradients don't have a systematic positive/negative bias, leading to faster convergence compared to sigmoid
- LSTM core component — used in cell state updates: the candidate cell state (C̃) passes through tanh to keep values in [-1, 1]
- Stronger gradients — derivative peaks at 1 (vs 0.25 for sigmoid), so gradients flow better during backpropagation
- Still suffers vanishing gradients — for |x| > 3, the gradient ≈ 0, which is why LSTM/GRU gates were invented to mitigate this
Where tanh is used in LSTM:
C̃_t = tanh(W_c · [h_{t-1}, x_t] + b_c) ← candidate cell state
h_t = o_t * tanh(C_t) ← output hidden state
When to use tanh vs alternatives:
| Use case | Recommendation |
|---|---|
| LSTM/GRU internals | tanh (by design) |
| Hidden layers (general) | ReLU or GELU preferred (no saturation) |
| Output layer, range [-1,1] | tanh |
| Output layer, range [0,1] | sigmoid |
| Classification output | softmax / sigmoid |
Worked Example: tanh(12039) → 1.0¶
Step 1 — Plug into the formula:
$$\tanh(12039) = \frac{e^{12039} - e{-12039}}{e$$} + e^{-12039}
Step 2 — Compute the exponentials:
| Term | Value |
|---|---|
| e^12039 | ≈ 10^5228 (a number with over 5000 digits) |
| e^-12039 | ≈ 10^-5228 (over 5000 zeros after the decimal point) |
Step 3 — Simplify:
Numerator = e^12039 - e^-12039 ≈ 10^5228 (subtracting ~0 changes nothing)
Denominator = e^12039 + e^-12039 ≈ 10^5228 (adding ~0 changes nothing)
tanh(12039) = 10^5228 / 10^5228 = 1.0
Step 4 — The general intuition for large x:
Step 5 — Convergence table:
| x | e^x | e^-x | Numerator | Denominator | tanh(x) |
|---|---|---|---|---|---|
| 0 | 1 | 1 | 0 | 2 | 0.0000 |
| 1 | 2.718 | 0.368 | 2.350 | 3.086 | 0.7616 |
| 2 | 7.389 | 0.135 | 7.254 | 7.524 | 0.9640 |
| 3 | 20.086 | 0.050 | 20.036 | 20.136 | 0.9951 |
| 5 | 148.41 | 0.0067 | 148.40 | 148.42 | 0.9999 |
| 10 | 22026 | 0.0000454 | 22026 | 22026 | 0.99999999.. |
| 12039 | 10^5228 | ≈ 0 | 10^5228 | 10^5228 | 1.0 |
Why the computer says exactly 1.0
Mathematically, tanh never reaches exactly 1 — it only approaches it asymptotically. But computers use 64-bit floating point (float64), which has a smallest representable positive number of ≈ 5×10^-324. Since e^-12039 ≈ 10^-5228 is far below that threshold, the computer stores it as 0. In practice, tanh(x) = 1.0 exactly for any x ≥ ~19.
Python verification:
import numpy as np
print(f"tanh(12039) = {np.tanh(12039)}") # 1.0
print(f"e^-12039 = {np.exp(-12039)}") # 0.0 (underflow)
print(f"1 - tanh(5) = {1 - np.tanh(5):.2e}") # 1.81e-04
print(f"1 - tanh(10)= {1 - np.tanh(10):.2e}") # 8.27e-09
print(f"1 - tanh(20)= {1 - np.tanh(20):.2e}") # 0.00e+00 ← exact 1.0
Python APIs for tanh¶
| Library | Usage | GPU | Autodiff | Best for |
|---|---|---|---|---|
math.tanh(x) |
scalar | ❌ | ❌ | Simple calculations |
np.tanh(arr) |
ndarray | ❌ | ❌ | Data preprocessing |
torch.tanh(t) |
Tensor | ✅ | ✅ | PyTorch model training |
tf.math.tanh(t) |
Tensor | ✅ | ✅ | TensorFlow model training |
ReLU (Rectified Linear Unit)¶
Definition:
$$\text{ReLU}(x) = \max(0, x)$$
Properties:
| Property | Value |
|---|---|
| Output range | [0, ∞) |
| Center | Not zero-centered |
| Monotonic | Yes |
| Derivative | 0 if x < 0, 1 if x > 0 |
| Saturates at | Never (for positive x) |
| Computation | Very cheap (comparison only) |
Why ReLU solves the vanishing gradient problem:
With tanh, gradients are always in (0, 1). During backpropagation, gradients get multiplied through layers — they shrink exponentially:
tanh: Layer 3 gradient = 0.2 × 0.3 × 0.1 = 0.006 (almost zero!)
ReLU: Layer 3 gradient = 1 × 1 × 1 = 1 (full signal!)
Deep layers with tanh receive near-zero gradients and barely learn. ReLU passes gradients as either 0 or 1 — no shrinking — so all layers learn effectively.
ReLU variants:
| Variant | Formula | Fixes dead neurons? |
|---|---|---|
| ReLU | max(0, x) | ❌ |
| LeakyReLU | max(0.01x, x) | ✅ Small negative slope |
| ELU | x if x > 0, α(eˣ-1) if x ≤ 0 | ✅ Smooth negative region |
| GELU | x · Φ(x) | ✅ Used in Transformers |
Dead neuron problem
If a ReLU neuron's input is always negative, its output is always 0 and it stops learning permanently. Use LeakyReLU(0.01) or ELU if this occurs.
tanh vs ReLU: When to Use Each¶
| Criterion | tanh | ReLU |
|---|---|---|
| Output range | [-1, 1] (bounded) | [0, ∞) (unbounded) |
| Gradient flow | Shrinks through layers | Passes through unchanged |
| Vanishing gradient? | Yes — deep layers starve | No — gradient = 0 or 1 |
| Training speed | Slower (exp computation) | ~6× faster (comparison) |
| Dead neurons? | No | Possible (use LeakyReLU) |
| Best for | LSTM/GRU gates, output [-1,1] | Hidden layers, regression, CNNs |
| Network depth | Shallow (1–2 layers) | Deep (3+ layers) |
Practical example — ozone prediction model:
| Setting | tanh | ReLU |
|---|---|---|
| Converged at epoch | 131 | 294 |
| Best val_mae | 8.02 | 7.75 (−3.4%) |
| Best val_loss | 105.99 | 99.98 (−5.7%) |
| Train MAE | 8.98 | 8.55 (−4.8%) |
ReLU trained longer because gradients kept flowing to deeper layers, allowing the model to keep learning past tanh's ceiling.
Data Preprocessing for Time Series¶
Scalers from sklearn.preprocessing¶
| Scaler | Method | Good for LSTM? | Why |
|---|---|---|---|
| MinMaxScaler | Scales to [0,1] range | ✅ Best choice | LSTM sigmoid/tanh activations work best with bounded inputs |
| StandardScaler | Zero mean, unit variance | ⚠️ Use with caution | Assumes stationary data; most time series are non-stationary |
| RobustScaler | Uses median & IQR | ⚠️ Situational | Good for outliers/spikes, but unbounded output can cause issues |
| MaxAbsScaler | Scales by max absolute value | ⚠️ Situational | Preserves sparsity and sign; rare fit for time series |
Best practice
Always fit() on training data only, then transform() on both train and test. Fitting on the full dataset causes data leakage.
Recommended pattern:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))
train_scaled = scaler.fit_transform(train_data) # fit on train only
test_scaled = scaler.transform(test_data) # transform test
# After prediction
predictions = scaler.inverse_transform(pred_scaled)
Watch out for non-stationarity
If test data has values outside the training range, MinMaxScaler will produce values outside [0,1]. Consider differencing the series first, using a rolling window scaler, or clipping outliers.
LSTM Input Shape¶
Why LSTM Requires 3D Input: (samples, timesteps, features)¶
| Dimension | Meaning | Example (stock prediction) |
|---|---|---|
| samples | Number of independent sequences (batch size) | 1000 different 30-day windows |
| timesteps | Length of each sequence (lookback window) | 30 days per window |
| features | Variables measured at each timestep | price, volume, RSI = 3 features |
At each timestep, the LSTM cell receives one slice of features and combines it with memory from previous steps:
Timestep: t=0 t=1 t=2 ... t=29
│ │ │ │
Input: [p,v,rsi] [p,v,rsi] [p,v,rsi] [p,v,rsi]
▼ ▼ ▼ ▼
┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐
h₀ ──▶ │ LSTM │──▶│ LSTM │──▶│ LSTM │──▶ ... ──│ LSTM │──▶ output
└──────┘ └──────┘ └──────┘ └──────┘
Reshaping 2D data for LSTM:
lookback = 30
X = []
for i in range(lookback, len(data)):
X.append(data[i - lookback:i]) # slice 30 rows
X = np.array(X) # shape: (970, 30, 3)
input_shape in the first LSTM layer
Only needs (timesteps, features) — Keras infers the samples dimension automatically from the batch.
LSTM Constructor Parameters (Keras)¶
tf.keras.layers.LSTM(
units, # hidden state size (required)
activation='tanh', # activation for cell state
recurrent_activation='sigmoid', # activation for gates
return_sequences=False, # True = output at every timestep
return_state=False, # True = also return h and c
dropout=0.0, # input dropout rate
recurrent_dropout=0.0, # recurrent state dropout rate
input_shape=(timesteps, features), # only needed on first layer
)
| Parameter | Default | Meaning |
|---|---|---|
units |
— | Number of hidden units (memory width). Larger = more capacity |
activation |
'tanh' |
Applied to candidate cell state and output |
recurrent_activation |
'sigmoid' |
Applied to forget, input, and output gates |
return_sequences |
False |
False = output only last timestep. True = output all timesteps (needed when stacking LSTM layers) |
return_state |
False |
If True, returns (output, hidden_state, cell_state) |
dropout |
0.0 | Fraction of input units to drop (regularization) |
recurrent_dropout |
0.0 | Fraction of recurrent units to drop |
input_shape |
— | (timesteps, features) — only on the first layer; Keras infers batch size |
PyTorch equivalent
PyTorch's nn.LSTM(input_size, hidden_size, num_layers, batch_first, dropout, bidirectional) uses different parameter names but the same concepts. Key difference: set batch_first=True to match Keras' default (batch, timesteps, features) ordering.
Reshaping Data for LSTM: X.reshape(-1, timesteps, features)¶
LSTM requires 3D input. Raw tabular data is 2D, so reshaping is needed:
# Before: 2D array from lag construction
X.shape = (350, 9) # 350 samples, 9 columns
# After: 3D array for LSTM
X = X.reshape(-1, 9, 1) # univariate: 9 timesteps, 1 feature
X.shape = (350, 9, 1)
What each parameter means:
| Parameter | Value | Meaning |
|---|---|---|
-1 |
inferred | "Keep all samples" — NumPy calculates this from total elements ÷ (timesteps × features) |
timesteps |
lags + 1 |
Number of time steps in the lookback window |
features |
1 or n |
Number of variables measured at each timestep |
Univariate vs multivariate:
# Univariate (ozone only): 1 feature per timestep
X = X.reshape(-1, lags + 1, 1)
# Multivariate (ozone + 4 weather features): 5 features per timestep
X = X.reshape(-1, lags + 1, 5)
Silent reshape bug
Using reshape(-1, 45, 1) instead of reshape(-1, 9, 5) produces the same total elements — NumPy won't complain. But the LSTM sees 45 timesteps of 1 feature instead of 9 timesteps of 5 features, mixing up temporal ordering and producing wrong results silently.
Reshaping y — the target:
y only needs 2D (samples, output_dim) because it's compared against the Dense output layer, not fed through the LSTM. The 1 means one prediction per sample (single target variable).
Dense Output Layer¶
What Dense Does¶
A Dense (fully connected) layer connects every input neuron to every output neuron with learnable weights:
$$y = \text{activation}(W \cdot x + b)$$
In LSTM models, Dense is the final layer that maps the LSTM's hidden state to the prediction:
LSTM hidden state (32 units) ──▶ Dense(1) ──▶ single prediction
h = [h₁, h₂, ..., h₃₂] W·h + b ŷ (scalar)
Parameters¶
tf.keras.layers.Dense(
units, # number of output neurons (required)
activation=None, # output activation function
use_bias=True, # add bias term b
)
| Parameter | Meaning | Typical value for regression |
|---|---|---|
units |
Output dimension — how many values to predict | 1 (single target) |
activation |
Transform applied to output | 'linear' or None (no squashing for regression) |
use_bias |
Whether to add bias b |
True (default) |
Choosing units and activation¶
| Task | units |
activation |
Why |
|---|---|---|---|
| Single-value regression (e.g., ozone) | 1 |
'linear' |
Unbounded continuous output |
| Multi-target regression | n_targets |
'linear' |
One output per target |
| Binary classification | 1 |
'sigmoid' |
Output ∈ [0, 1] as probability |
| Multi-class classification | n_classes |
'softmax' |
Probabilities summing to 1 |
How units Connects to y.reshape(-1, 1)¶
The Dense output shape must match the target shape:
# Target: predict one value (ozone)
y = y.reshape(-1, 1) # shape: (samples, 1)
model.add(Dense(units=1)) # output: (samples, 1) ✓
# Target: predict two values (ozone + temperature)
y = y.reshape(-1, 2) # shape: (samples, 2)
model.add(Dense(units=2)) # output: (samples, 2) ✓
If units doesn't match y.shape[1], Keras raises a shape mismatch error during training.
Why activation='linear' for regression
Using 'sigmoid' or 'tanh' would cap predictions to [0,1] or [-1,1], preventing the model from predicting values outside that range. For regression, the output must be unbounded — the scaler handles range mapping.
Building an LSTM Model (Keras Sequential API)¶
model = tf.keras.models.Sequential([
tf.keras.layers.LSTM(50, return_sequences=True, input_shape=(timesteps, features)),
tf.keras.layers.LSTM(50),
tf.keras.layers.Dense(1),
])
model.compile(optimizer='adam', loss='mse')
model.fit(X_train, y_train, epochs=50, batch_size=32, validation_split=0.1)
predictions = model.predict(X_test)
| API | Use case |
|---|---|
Sequential |
Linear stack of layers (most LSTM time series models) |
Functional |
Multi-input/output, shared layers, skip connections |
Subclassing |
Full custom forward pass logic |
Training Workflow: Scale → Train → Predict → Inverse Transform¶
The model lives in scaled space. Only come back to original space for evaluation.
Raw data ──▶ scaler.fit_transform() ──▶ Scaled data ──▶ Train LSTM
│
▼
Raw predictions ◀── scaler.inverse_transform() ◀── Scaled predictions
scaler = MinMaxScaler(feature_range=(0, 1))
train_scaled = scaler.fit_transform(train_data) # fit + transform
test_scaled = scaler.transform(test_data) # transform only
model.fit(X_train_scaled, y_train_scaled, epochs=50)
pred_scaled = model.predict(X_test_scaled)
# Inverse transform only for evaluation/display
predictions = scaler.inverse_transform(pred_scaled)
rmse = np.sqrt(np.mean((predictions - actual_values) ** 2))
Why you must train on scaled data
- LSTM activations (sigmoid, tanh) have bounded ranges — raw values like 12039 saturate these functions, killing gradients
- Large input values → exploding/vanishing gradients
- Features on different scales cause the optimizer to zigzag
- MSE loss on raw values would be dominated by large-scale features
Evaluation Metrics¶
Metric Functions¶
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
def metric(y_true, y_pred):
mae = mean_absolute_error(y_true, y_pred)
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100
r2 = r2_score(y_true, y_pred)
print(f"MAE={mae:.2f} RMSE={rmse:.2f} MAPE={mape:.1f}% R²={r2:.4f}")
What Each Metric Means¶
| Metric | Meaning | Poor | Acceptable | Good |
|---|---|---|---|---|
| MAE | Average absolute error in original units | — | — | Lower is better |
| RMSE | Penalizes large errors more than MAE | — | — | Close to MAE = consistent |
| MAPE | Percentage error relative to true values | > 20% | 10–20% | < 10% |
| R² | Variance explained by model (1.0 = perfect) | < 0.5 | 0.5–0.8 | > 0.8 |
RMSE vs MAE gap
If RMSE >> MAE, the model has some predictions with very large errors (spikes). Investigate those outliers.
MAPE with small values
MAPE is inflated when true values are near zero. Use symmetric MAPE (sMAPE) instead:
Improving Underperforming Models (R² < 0.5)¶
| Issue | Fix |
|---|---|
| Underfitting | Increase LSTM units, add layers, train more epochs |
| Insufficient features | Add day_of_week, hour_of_day, lag features, rolling stats |
| Too short lookback | Experiment with longer timesteps (48h, 72h, 168h) |
| No stationarity handling | Difference the series or add trend features |
| Slow convergence | Use ReduceLROnPlateau and EarlyStopping callbacks |
# Improved model with dropout and callbacks
model = Sequential([
LSTM(128, return_sequences=True, input_shape=(timesteps, features)),
Dropout(0.2),
LSTM(64),
Dropout(0.2),
Dense(1)
])
callbacks = [
tf.keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True),
tf.keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=5),
]
model.compile(optimizer='adam', loss='mse')
model.fit(X_train, y_train, epochs=100, callbacks=callbacks, validation_split=0.1)
Visualization¶
Plotting Predictions with Forecast Zone¶
nr_datapoints = 24 * 7 # last 7 days
nr_datapoints_addon = nr_datapoints + 12 # + 12h forecast
y_true = y_val.flatten()[-nr_datapoints:] # 168 points
y_pred = y_val_pred.flatten()[-nr_datapoints_addon:] # 180 points
x_true = np.arange(0, nr_datapoints)
x_pred = np.arange(nr_datapoints_addon - len(y_pred), nr_datapoints_addon)
fig, ax = plt.subplots(figsize=(15, 5))
sns.lineplot(x=x_true, y=y_true, color=colors[0], label="true", ax=ax)
sns.lineplot(x=x_pred, y=y_pred, color=colors[1], label="predicted", ax=ax)
# Forecast zone
ax.axvline(x=nr_datapoints, color='gray', linestyle='--', alpha=0.5, label="forecast start")
ax.axvspan(nr_datapoints, nr_datapoints_addon, alpha=0.1, color='orange', label="forecast zone")
# Formatting
ax.set_xticks(np.arange(0, nr_datapoints_addon + 1, 24))
ax.legend(fontsize=14, loc='lower center')
ax.grid(alpha=0.3)
Training Convergence & Diagnostics¶
Reading Training Logs¶
A typical Keras training log shows four key values per epoch:
| Metric | Meaning |
|---|---|
loss |
Training loss (MSE for regression) |
mae |
Training mean absolute error |
val_loss |
Validation loss |
val_mae |
Validation MAE |
Convergence Phases¶
| Phase | Behavior | Action |
|---|---|---|
| Rapid drop | Loss drops dramatically (e.g. 3090 → 157 in 3 epochs) | Normal — initial learning |
| Steady descent | Consistent improvement per epoch | Let it train |
| Plateau | Loss changes < 0.1 per epoch | Stop or reduce LR |
| Divergence | Loss increases | LR too high, reduce it |
Detecting Plateau¶
When loss improvement per epoch becomes negligible (< 0.1), the model has converged at its current capacity. Additional epochs waste compute.
# Example: plateau detection
# Epoch 220: val_loss=98.18, val_mae=7.73
# Epoch 246: val_loss=98.18, val_mae=7.73
# → 26 epochs, ~0 improvement → plateau confirmed
Overfitting vs Healthy Training¶
| Signal | Diagnosis |
|---|---|
| train_loss << val_loss (growing gap) | Overfitting — add Dropout, reduce capacity, or get more data |
| train_loss ≈ val_loss | Healthy — model generalizes well |
| val_loss < train_loss | Normal with Dropout — Dropout is off during validation |
Essential Callbacks¶
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
callbacks = [
ReduceLROnPlateau(
monitor='val_loss',
factor=0.5, # halve LR when stuck
patience=10, # wait 10 epochs before reducing
min_lr=1e-6,
verbose=1
),
EarlyStopping(
monitor='val_loss',
patience=30, # stop after 30 epochs without improvement
restore_best_weights=True
)
]
- ReduceLROnPlateau — smaller steps can push past a plateau, often squeezing 5–15% more improvement
- EarlyStopping — prevents wasted epochs;
restore_best_weights=Trueensures the best model is kept
Increasing Model Capacity¶
When the model plateaus and is not overfitting, a larger architecture may help:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, BatchNormalization
model = Sequential([
Dense(128, activation='relu', input_shape=(n_features,)),
BatchNormalization(),
Dropout(0.2),
Dense(64, activation='relu'),
BatchNormalization(),
Dropout(0.2),
Dense(32, activation='relu'),
Dense(1)
])
- BatchNormalization — stabilizes training, allows higher learning rates
- Dropout — prevents overfitting from added capacity
Data requirements for larger models
More parameters need more data to generalize. But if the current model shows no overfitting (train ≈ val), there is headroom to increase capacity safely. Watch for val_loss diverging from train_loss after scaling up.
Interpreting Combined Metrics¶
Example from an ozone prediction model:
Stopped at epoch: 208 (EarlyStopping)
Best val_loss: 97.77 → √97.77 = 9.89 (= RMSE ✓)
Best val_mae: 7.71
R² = 0.574 → model explains 57% of variance
MAPE = 18.2% → average prediction off by ~18%
Cross-checking metrics:
| Check | Formula | Confirms |
|---|---|---|
| val_loss → RMSE | √val_loss = RMSE | MSE and RMSE are consistent |
| RMSE vs MAE gap | RMSE ≈ MAE → consistent errors | Large gap → outlier predictions |
| R² interpretation | 0.57 = moderate | 43% variance unexplained |
When to Change Strategy¶
| Current R² | Next Step |
|---|---|
| < 0.3 | Check data quality, feature relevance, preprocessing |
| 0.3–0.6 | Feature engineering, try tree-based models (XGBoost/LightGBM) |
| 0.6–0.8 | Hyperparameter tuning, ensemble methods |
| > 0.8 | Fine-tune, focus on edge cases |
Tree-based models for tabular data
For tabular regression with R² < 0.7, gradient boosting (XGBoost, LightGBM) often outperforms neural networks with less tuning effort. NNs shine on sequential, image, and text data.