Chapter 09: Advanced Transfer Learning Options#
Chapters 07 and 08 introduced the two main transfer learning methods in vangja: tune_method="parametric" and tune_method="prior_from_idata". Both methods take the posterior from a source model and use it — either summarized or in full — as the prior for a target model.
In practice, however, the default behavior may not always be optimal. Seasonal peaks may not align perfectly between source and target series. The transferred priors may allow the short series model to overfit by developing seasonal effects with unrealistically large amplitude. Or the default summary statistics (mean and standard deviation) may not be the best description of a skewed posterior.
Vangja exposes several advanced parameters that give practitioners fine-grained control over the transfer learning process. This chapter documents these options:
Bidirectional changepoints via ``delta_side`` — interpreting the slope parameter from the right end of the time range so that short series can inform it directly
Regularization via ``loss_factor_for_tune`` — preventing seasonal amplitude blow-up and trend drift
Phase alignment via ``shift_for_tune`` — learning a time shift to align seasonal peaks
Custom summary statistics via ``override_…`` parameters — using the mode or other statistics instead of the mean
Note: These features are more experimental than the core
parametricandprior_from_idatamethods. They were introduced during the research behind the paper “Long Horizons from Short Histories: A Bayesian Transfer Learning Framework for Forecasting Time Series” (Krajevski & Tojtovska Ribarski, 2026) and have shown promising results on stock market data. However, they add hyperparameters that require careful tuning. Use them with caution and always validate on held-out data.Note: This chapter is primarily documentation — it explains the concepts and provides code snippets for reference rather than running end-to-end examples. The ideas are most easily explored by adapting the transfer learning notebooks (Chapters 07–08) with the parameters described here.
1. Bidirectional Changepoints with delta_side#
The Problem: The Slope Parameter Represents the Wrong End of the Series#
In Facebook Prophet and TimeSeers, the piecewise linear trend is defined so that the slope parameter \(w\) represents the slope at the earliest time point. The changepoint deltas \(\delta_1, \delta_2, \ldots, \delta_S\) are then accumulated left to right — each one adjusts the slope at a later segment of the time series. The indicator matrix \(\mathbf{A}\) tells the model whether a point in time occurred after each changepoint:
where \(a_j(t) = \mathbf{1}[t > s_j]\) — i.e., the indicator activates for time points after changepoint \(s_j\).
This is perfectly fine for single-series forecasting. But it creates a serious problem when fitting multiple time series jointly with hierarchical or transfer learning, where a long source series (\(C\)) and several short target series (\(X_i\)) are combined into one model.
Consider the typical setup: you have a long time series spanning 2012–2017 and several short series covering only the last 3 months of 2017. All series share the same normalized time axis \(t \in [0, 1]\), with changepoints distributed across this range. With the default left-to-right formulation:
The slope parameter \(w\) represents the trend at \(\min(T_C)\) — the start of the long series (e.g., 2012)
The short series only occupy a small segment near \(\max(T_C)\) (e.g., mid-2017 to end-2017)
The short series have no data near \(\min(T_C)\), so they cannot directly inform \(w\)
When using hierarchical modeling with partial pooling on the slope, the global distribution \(w_0\) is influenced almost exclusively by the long series, since only the long series has observations near the start of the time range. The short series can influence the slope only indirectly, through their effect on the changepoint deltas — and even then, only if the deltas are modeled with complete or partial pooling.
The Solution: Interpreting Changepoints Right to Left#
Vangja introduces the delta_side parameter on LinearTrend, which controls the direction in which the slope parameter and changepoints are interpreted:
``delta_side=”left”`` (default, same as Prophet): \(w\) is the slope at the earliest time point. The indicator activates for \(t > s_j\) (time points after each changepoint).
``delta_side=”right”``: \(w\) is the slope at the latest time point. The indicator activates for \(t \leq s_j\) (time points before each changepoint).
Mathematically, the only difference is in the indicator matrix. With delta_side="right":
instead of \(a_j(t) = \mathbf{1}[t > s_j]\).
This reversal means the changepoint deltas now accumulate from right to left. The slope parameter \(w\) represents the trend at the end of the time range — precisely where the short target series have data. In a hierarchical model, this ensures that the global slope distribution \(w_0\) is informed by all series, not just the long one.
Why This Matters for Transfer Learning#
When fitting short and long series jointly:
|
|
|
|---|---|---|
:math:`w` represents slope at |
\(\min(T_C)\) — start of long series |
\(\max(T_C)\) — end of combined range |
Short series influence on :math:`w` |
Indirect only (via changepoints) |
Direct (short series have data here) |
Hierarchical :math:`w_0` informed by |
Primarily the long series |
All series |
This is especially important when:
Short series have a different recent trend than the long series’ historical trend
You are using
pool_type="partial"and want the shared slope to reflect the current regime, not a historical oneThe forecast horizon extends beyond \(\max(T_C)\), making the end-of-range slope the most relevant for extrapolation
Usage#
from vangja import LinearTrend, FourierSeasonality
# Source model on long series — use right-to-right changepoints
source_model = (
LinearTrend(n_changepoints=25, delta_side="right")
+ FourierSeasonality(365.25, 10)
)
source_model.fit(long_data, method="mapx")
# Target model on short series — set delta_side="right"
target_model = (
LinearTrend(
n_changepoints=25,
delta_side="right",
tune_method="parametric",
)
+ FourierSeasonality(365.25, 10, tune_method="parametric")
)
target_model.fit(
short_data,
method="mapx",
idata=source_model.trace,
t_scale_params=source_model.t_scale_params,
)
When used with hierarchical modeling over multiple short series fitted jointly with the long series:
# Joint hierarchical model with right-to-left changepoints
model = (
LinearTrend(
n_changepoints=25,
delta_side="right",
pool_type="partial",
shrinkage_strength=10,
)
+ FourierSeasonality(365.25, 10, pool_type="partial")
)
model.fit(combined_data, method="mapx")
Interaction with Transfer Learning#
When transferring the slope parameter with delta_side="right", Vangja automatically accounts for the reversed direction during the extraction of posterior statistics. Specifically, when computing the slope mean from the source model’s posterior, the sum of the changepoint deltas is added back to the slope to recover the effective end-of-range slope:
This correction ensures that the transferred slope prior correctly reflects the slope at the end of the time range, regardless of which delta_side was used in the source model.
When to Use#
Use delta_side="right" when:
Fitting multiple series jointly where short series only cover the recent end of the time range
The forecast horizon extends forward in time, making the recent slope more relevant than the historical slope
Using hierarchical modeling and wanting the global slope to be informed by all series
Use the default delta_side="left" when:
Fitting a single series (no advantage to either direction)
All series cover the same time range (no asymmetry in where data exists)
Working with the original Prophet formulation for compatibility
2. Regularization with loss_factor_for_tune#
The Problem: Overfitting Short Series#
When transferring seasonality from a long time series to a short one, the short series model can still overfit. Even with informed priors, the optimizer (MAP or MCMC) may push the Fourier coefficients to values that produce excessively large seasonal effects — fitting noise in the short training window rather than the true seasonal pattern.
Similarly, when transferring the trend slope, the short series model may drift the slope away from the value learned on the long series, especially if the short training window happens to coincide with an atypical period.
The Solution: PyMC Potentials as Soft Constraints#
Vangja addresses this with the loss_factor_for_tune parameter, available on both FourierSeasonality and LinearTrend. Under the hood, this adds a PyMC Potential — an arbitrary term added to the log-posterior — that penalizes deviations from what was learned on the source model.
How It Works for FourierSeasonality#
For seasonal components, the regularization prevents the model from learning seasonal effects with greater amplitude than those observed in the source model. The mechanism is:
Create a set of time points \(T_j\) spanning one full period \(p_j\) (e.g., 365 days for yearly seasonality)
Compute the seasonal effects using the source model’s posterior mean coefficients: \(\mathbf{fs}_{\text{source}} = \mathbf{F} \cdot \boldsymbol{\beta}_{\text{source}}^{MAP}\)
Compute the seasonal effects using the target model’s current coefficients: \(\mathbf{fs}_{\text{target}} = \mathbf{F} \cdot \boldsymbol{\beta}_{\text{target}}\)
Add a penalty that activates only when the target’s seasonal amplitude exceeds the source’s:
\[\text{penalty} = \phi_{\boldsymbol{\theta}} \cdot \lambda \cdot \min\left(0, \|\mathbf{fs}_{\text{source}}\|_2^2 - \|\mathbf{fs}_{\text{target}}\|_2^2\right)\]where:
\(\phi_{\boldsymbol{\theta}}\) is the
loss_factor_for_tunehyperparameter\(\lambda\) is an automatic scaling factor: \(\lambda = \frac{2 \cdot p}{n}\) if the period \(p\) is longer than twice the number of training data points \(n\), and \(0\) otherwise. This means the regularization only activates for seasonal components whose period is too long to be reliably estimated from the short series alone.
Key insight: the penalty is one-sided. Smaller seasonal effects than the source are allowed (or even encouraged), but larger effects are penalized. This makes physical sense: if the source model learned a yearly amplitude of \(\pm 20°F\) from temperature data, we don’t want the bike sales model to develop a yearly seasonality with amplitude exceeding what the data supports.
How It Works for LinearTrend#
For the trend slope, the regularization is simpler — a squared deviation penalty:
This keeps the target model’s slope close to what was learned on the source model. The larger \(\phi_{\mathbf{w}}\), the more tightly the slope is constrained.
Interestingly, negative values of \(\phi_{\mathbf{w}}\) can also be useful: they encourage the slope to deviate from the source, which may be appropriate when the source and target series have opposing trends.
Usage#
from vangja import LinearTrend, FourierSeasonality
model = (
LinearTrend(
tune_method="parametric",
loss_factor_for_tune=1, # Regularize the slope transfer
)
+ FourierSeasonality(
period=365.25,
series_order=6,
tune_method="parametric",
loss_factor_for_tune=1, # Regularize yearly seasonality amplitude
)
)
model.fit(short_data, method="mapx", idata=source_trace, t_scale_params=source_t_scale_params)
Guidance from the Paper#
In the experiments from the paper, the best combined model (hierarchical + transfer learning) achieved its best results without regularization (\(\phi = 0\)). The regularization potentials had a noticeable effect when using transfer learning without hierarchical modeling, where loss_factor_for_tune=1 improved results.
The takeaway: regularization is most useful when there is no hierarchical structure providing its own regularization via shrinkage. When combining transfer learning with partial pooling, the shrinkage mechanism already prevents overfitting, making the potentials less necessary.
3. Phase Alignment with shift_for_tune#
The Problem: Misaligned Seasonal Peaks#
Transfer learning assumes that the seasonal pattern from the source series has the same phase (timing of peaks and troughs) as the target series. But this is not always the case:
Temperature peaks in mid-July, but ice cream sales might peak in early August (lagged demand)
A stock index’s yearly seasonality might be shifted by a few weeks compared to an individual stock
Monthly billing cycles may cause seasonal peaks to shift between different business units
When the seasonal peaks are misaligned, directly transferring Fourier coefficients produces a seasonal pattern that is correct in shape but wrong in timing.
The Solution: Learning a Shift Parameter#
The shift_for_tune parameter on FourierSeasonality tells vangja to introduce a learnable time shift (in days) when computing the Fourier basis functions. Instead of:
the model computes:
where \(\Delta t\) is a new parameter that the model learns during fitting. This allows the transferred seasonal shape to slide along the time axis to find the best alignment with the target data.
Usage#
from vangja import FlatTrend, FourierSeasonality
model = (
FlatTrend()
+ FourierSeasonality(
period=365.25,
series_order=6,
tune_method="parametric",
shift_for_tune=True, # Learn a phase shift
)
)
model.fit(short_data, method="mapx", idata=source_trace, t_scale_params=source_t_scale_params)
After fitting, the learned shift is stored in the model’s trace under the key fs_{idx} - shift and is automatically applied during prediction.
When to Use#
Use shift_for_tune=True when:
You suspect the source and target series have similar seasonal shape but different timing
The seasonal peaks in the target data consistently appear shifted relative to the source
You have enough data in the short series to estimate a single shift parameter reliably
Avoid it when:
The seasonal patterns are truly identical in phase (adding an unnecessary parameter)
The short series is extremely short (fewer data points than the number of other parameters)
You are using
prior_from_idata, which already captures the full covariance structure
4. Custom Summary Statistics with override_... Parameters#
The Problem: The Mean Is Not Always the Best Summary#
When using tune_method="parametric", vangja extracts the mean and standard deviation of each parameter’s posterior from the source model. These become the location and scale of the new Normal prior:
But the posterior is almost certainly not Gaussian. It may be skewed, heavy-tailed, or multimodal. In such cases, the posterior mean might not even be a point of high probability density. A better choice could be the mode (the MAP estimate) — the single most probable value:
The paper showed that centering priors around the mode rather than the mean can improve results, particularly for the trend slope, because the mode corresponds to the region of highest posterior probability.
The Override Parameters#
Both FourierSeasonality and LinearTrend expose override_... parameters that let you inject custom values for the prior location and scale:
``FourierSeasonality``:
override_beta_mean_for_tune: Replace the posterior mean of the Fourier coefficients with custom values (e.g., the posterior mode)override_beta_sd_for_tune: Replace the posterior standard deviation with custom values
``LinearTrend``:
override_slope_mean_for_tune: Replace the posterior mean of the slopeoverride_slope_sd_for_tune: Replace the posterior standard deviation of the slopeoverride_delta_loc_for_tune: Replace the posterior mean (location) of the changepoint deltasoverride_delta_scale_for_tune: Replace the posterior scale of the changepoint deltas
Example: Using the Posterior Mode#
import numpy as np
from scipy import stats
# Fit the source model first
source_model.fit(long_data, method="nuts")
# Extract the posterior samples for the slope
slope_samples = source_model.trace["posterior"]["lt_0 - slope"].values.flatten()
# Compute the mode using kernel density estimation
kde = stats.gaussian_kde(slope_samples)
x_grid = np.linspace(slope_samples.min(), slope_samples.max(), 1000)
slope_mode = x_grid[np.argmax(kde(x_grid))]
# Compute the standard deviation (still use the full posterior for spread)
slope_std = slope_samples.std()
# Create the target model with overridden statistics
from vangja import LinearTrend, FourierSeasonality
target_model = (
LinearTrend(
tune_method="parametric",
override_slope_mean_for_tune=slope_mode, # Use mode instead of mean
override_slope_sd_for_tune=slope_std,
)
+ FourierSeasonality(period=365.25, series_order=6, tune_method="parametric")
)
target_model.fit(
short_data,
method="mapx",
idata=source_model.trace,
t_scale_params=source_model.t_scale_params,
)
When the Mode Differs from the Mean#
Consider a posterior that is left-skewed (long tail toward lower values). The mean is pulled toward the tail, while the mode sits at the peak of the distribution. Centering the new prior around the mode gives the target model a stronger starting point — it begins at the most probable parameter value rather than a tail-influenced average.
This distinction is especially relevant for:
Trend slope: Where the posterior may be skewed due to changepoint interactions
Changepoint deltas: Where the Laplace prior produces heavy-tailed posteriors with modes near zero
Using the Override for Different Fourier Coefficients#
You can also selectively override individual coefficients. For example, if you want to use the mode for the first few harmonics (which carry the most energy) but the mean for higher-order terms:
beta_samples = source_model.trace["posterior"]["fs_0 - beta(p=365.25,n=6)"].values.reshape(-1, 12)
# Compute mode for each coefficient
beta_modes = np.array([
x_grid[np.argmax(stats.gaussian_kde(beta_samples[:, i])(x_grid))]
for i, x_grid in enumerate(
np.linspace(beta_samples.min(axis=0), beta_samples.max(axis=0), 1000).T
)
])
model = FourierSeasonality(
period=365.25,
series_order=6,
tune_method="parametric",
override_beta_mean_for_tune=beta_modes,
)
Summary and Caveats#
The four advanced features discussed in this chapter provide fine-grained control over the transfer learning process:
Feature |
Parameter |
Purpose |
Adds Hyperparameters? |
|---|---|---|---|
Bidirectional changepoints |
|
Ensure slope parameter is informed by all series |
No (structural choice) |
Regularization |
|
Prevent seasonal amplitude blow-up and trend drift |
Yes (\(\phi\)) |
Phase alignment |
|
Align seasonal peaks between source and target |
Yes (\(\Delta t\) learned) |
Custom statistics |
|
Use mode or other statistics instead of mean |
No (replaces defaults) |
Experimental Status#
These features should be considered experimental. While they were validated in the paper’s experiments on stock market data (443 stocks, 730 time windows), they introduce additional complexity and hyperparameters:
``delta_side=”right”`` changes the structural interpretation of the trend. It does not add hyperparameters, but it does change which parameter represents the slope and how changepoint deltas are accumulated. Always use it consistently — mixing
delta_sidevalues between source and target models requires care.``loss_factor_for_tune`` adds a hyperparameter (\(\phi\)) that must be tuned. The paper found that only
0and1were reliably useful values, and that the feature was less necessary when hierarchical modeling with partial pooling was also used.``shift_for_tune`` adds a learned parameter, which increases model complexity. On very short series, this additional degree of freedom may not be identifiable.
``override_…`` parameters require the user to manually compute alternative statistics (like the mode via KDE), adding workflow complexity.
Practical Recommendations#
Start simple: Use
tune_method="parametric"ortune_method="prior_from_idata"with default settings first. These are well-tested and work well in most scenarios.Add regularization if you observe the target model producing unrealistically large seasonal effects or trend slopes that diverge from the source. Try
loss_factor_for_tune=1as a starting point.Try the mode if the source model’s posterior is visibly skewed (check with
az.plot_posterior()). This is a low-risk change that often helps.Use ``shift_for_tune`` only if domain knowledge suggests a phase mismatch. Validate by checking whether the learned shift is physically reasonable (e.g., a 2-week shift makes sense, a 6-month shift suggests a deeper modeling issue).
Prefer hierarchical modeling over manual regularization when possible. Partial pooling with
pool_type="partial"provides a principled form of regularization that adapts to the data.Use ``delta_side=”right”`` when fitting short and long series jointly. This is a structural improvement rather than an added hyperparameter, and is especially valuable with hierarchical modeling.
Further Reading#
Krajevski & Tojtovska Ribarski (2026): Long Horizons from Short Histories — The full paper with experimental results on stock market data
What’s Next#
Chapter 10 ties everything together by demonstrating a complete Bayesian workflow — prior predictive checks, convergence diagnostics, posterior predictive checks, model comparison, sensitivity analysis, and uncertainty quantification — applied to the transfer learning scenario from Chapters 07–08.