API Reference

API Reference#

This section provides detailed API documentation for all modules in vangja.

Time Series Models #

The time series module provides the base classes for building and composing forecasting models. Models are built by combining components using arithmetic operators.

Base Class #

class TimeSeriesModel[source]#

Bases: object

Base class for time series model components.

This class provides the foundation for building time series models in vangja. It handles data preprocessing, scaling, model fitting, and prediction. Model components can be combined using arithmetic operators (+, , *) to create complex models.

data#

The processed training data after fitting.

Type:: pd.DataFrame

y_scale_params#

Scaling parameters for the target variable y. Either a single dict for complete scaling or a dict of dicts for individual scaling.

Type:: YScaleParams | dict[int, YScaleParams]

t_scale_params#

Scaling parameters for the time variable t.

Type:: TScaleParams

group#

Array of group codes for each data point.

Type:: np.ndarray

n_groups#

Number of unique groups/series in the data.

Type:: int

groups_#

Mapping from group codes to series names.

Type:: dict[int, str]

model#

The PyMC model after fitting.

Type:: pm.Model

model_idxs#

Counter for component indices by type.

Type:: dict[str, int]

samples#

Number of posterior samples (for MCMC/VI methods).

Type:: int

method#

The inference method used for fitting.

Type:: Method

initvals#

Initial values for model parameters.

Type:: dict[str, float]

map_approx#

MAP parameter estimates (if using MAP inference).

Type:: dict[str, np.ndarray] | None

trace#

Posterior samples (if using MCMC/VI inference).

Type:: az.InferenceData | None

Examples

>>> from vangja import LinearTrend, FourierSeasonality
>>> # Create an additive model
>>> model = LinearTrend() + FourierSeasonality(period=365.25, series_order=10)
>>> model.fit(data)
>>> predictions = model.predict(horizon=30)

>>> # Create a multiplicative model
>>> model = LinearTrend() ** FourierSeasonality(period=7, series_order=3)
>>> model.fit(data)

Notes

Subclasses should implement: - definition: Add parameters to the PyMC model - _get_initval: Provide initial values for parameters - _predict_map: Predict using MAP estimates - _predict_mcmc: Predict using MCMC samples - _plot: Plot the component’s contribution

data: DataFrame#

y_scale_params: YScaleParams | dict[int, YScaleParams]#

t_scale_params: TScaleParams#

group: ndarray#

n_groups: int#

groups_: dict[int, str]#

model: Model#

model_idxs: dict[str, int]#

samples: int#

method: Literal['mapx', 'map', 'fullrank_advi', 'advi', 'svgd', 'asvgd', 'nuts', 'metropolis', 'demetropolisz']#

initvals: dict[str, float]#

map_approx: dict[str, ndarray] | None#

trace: InferenceData | None#

get_initval(initvals: dict[str, float], model: Model) → dict[source]#

Get the initval of the standard deviation of the Normal prior of y (target).

Parameters:

initvals (dict[str, float]) – Calculated initvals based on data.
model (pm.Model) – The model for which the initvals will be set.

fit(data: DataFrame, scaler: Literal['maxabs', 'minmax'] = 'maxabs', scale_mode: Literal['individual', 'complete'] = 'complete', t_scale_params: TScaleParams | None = None, sigma_sd: float = 0.5, sigma_pool_type: Literal['partial', 'complete', 'individual'] = 'complete', sigma_shrinkage_strength: float = 1, method: Literal['mapx', 'map', 'fullrank_advi', 'advi', 'svgd', 'asvgd', 'nuts', 'metropolis', 'demetropolisz'] = 'mapx', optimization_method: Literal['nelder-mead', 'powell', 'CG', 'BFGS', 'Newton-CG', 'L-BFGS-B', 'TNC', 'COBYLA', 'SLSQP', 'trust-constr', 'dogleg', 'trust-ncg', 'trust-exact', 'trust-krylov'] = 'L-BFGS-B', maxiter: int = 10000, n: int = 10000, samples: int = 1000, chains: int = 4, cores: int = 4, nuts_sampler: Literal['pymc', 'nutpie', 'numpyro', 'blackjax'] = 'pymc', progressbar: bool = True, idata: InferenceData | None = None)[source]#

Create and fit the model to the data.

Parameters:

data (pd.DataFrame) – A pandas dataframe that must at least have columns ds (predictor), y (target) and series (name of time series).
scaler (Scaler) – Whether to use maxabs or minmax scaling of the y (target).
scale_mode (ScaleMode) – Whether to scale each series individually or together.
t_scale_params (TScaleParams | None) – Whether to override scale parameters for ds (predictor).
sigma_sd (float) – The standard deviation of the Normal prior of y (target).
sigma_pool_type (PoolType) – Type of pooling for the sigma parameter that is performed when sampling.
sigma_shrinkage_strength (float) – Shrinkage between groups for the hierarchical modeling.
method (Method) – The Bayesian inference method to be used. Either a point estimate MAP), a VI method (advi etc.) or full Bayesian sampling (MCMC).
optimization_method (OptimizationMethod) – The optimization method to be used for MAP inference. See scipy.optimize.minimize documentation for details.
maxiter (int) – The maximum number of iterations for the L-BFGS-B optimization algorithm when using MAP inference.
n (int) – The number of iterations to be used for the VI methods.
samples (int) – Denotes the number of samples to be drawn from the posterior for MCMC and VI methods.
chains (int) – Denotes the number of independent chains drawn from the posterior. Only applicable to the MCMC methods.
nuts_sampler (NutsSampler) – The sampler for the NUTS method.
progressbar (bool) – Whether to show a progressbar while fitting the model.
idata (az.InferenceData | None) – Sample from a posterior. If it is not None, Vangja will use this to set the parameters’ priors in the model. If idata is not None, each component from the model should specify how idata should be used to set its parameters’ priors.

predict(horizon: int, freq: Literal['Y', 'M', 'W', 'D', 'h', 'm', 's', 'ms', 'us', 'ns', 'ps', 'minute', 'second', 'millisecond', 'microsecond', 'nanosecond', 'picosecond'] = 'D')[source]#

Perform out-of-sample inference.

Parameters:

horizon (int) – The number of steps in the future that we are forecasting.
freq (FreqStr) – The distance between the forecasting steps.

predict_uncertainty(horizon: int, freq: Literal['Y', 'M', 'W', 'D', 'h', 'm', 's', 'ms', 'us', 'ns', 'ps', 'minute', 'second', 'millisecond', 'microsecond', 'nanosecond', 'picosecond'] = 'D', uncertainty_samples: int = 200, interval_width: float = 0.95) → DataFrame[source]#

Predict with uncertainty intervals.

For MCMC/VI methods, uncertainty is derived from posterior samples. Each posterior draw is propagated through the model to produce a family of prediction trajectories, from which percentile-based credible intervals are computed.

For MAP methods, uncertainty is estimated using a hybrid approach:

The fitted observation noise sigma provides a base noise level.
In-sample residuals are used to calibrate the noise estimate.
A forecast-distance scaling factor sqrt(1 + h/n) widens the intervals for predictions further from the training data, reflecting increasing epistemic uncertainty.

Parameters:

horizon (int) – Number of future steps to forecast.
freq (FreqStr, default "D") – Frequency of forecast steps.
uncertainty_samples (int, default 200) – Number of posterior draws to use for interval estimation. Only used for MCMC/VI methods.
interval_width (float, default 0.95) – Width of the prediction interval (e.g. 0.95 for 95%% interval).

Returns:

The future DataFrame with columns yhat_<group>, yhat_lower_<group>, and yhat_upper_<group> for each group.

Return type:

pd.DataFrame

Notes

See uncertainty.md for a detailed description of the approaches.

plot(future: DataFrame, series: str = 'series', y_true: DataFrame | None = None, clip_to_data: bool = True)[source]#

Plot the inference results for a given series.

Parameters:

future (pd.DataFrame) – Pandas dataframe containing the timestamps for which inference should be performed.
series (str) – The name of the time series.
y_true (pd.DataFrame | None) – A pandas dataframe containing the true values for the inference period that must at least have columns ds (predictor), y (target) and series (name of time series).
clip_to_data (bool) – If True, clip predictions to the date range of the training data (and y_true if provided). This avoids plotting predictions for periods before the target series’ start date, which can happen when transfer learning shifts t_scale_params.

sample_prior_predictive(samples: int = 500) → InferenceData[source]#

Sample from the prior predictive distribution.

Generates simulated observations from the model’s priors before conditioning on data, enabling visual and quantitative verification that the chosen priors are scientifically plausible.

Parameters:: samples (int, default 500) – Number of samples to draw from the prior predictive.
Returns:: ArviZ InferenceData with prior and prior_predictive groups.
Return type:: az.InferenceData
Raises:: RuntimeError – If the model has not been fit yet (self.model does not exist).

Notes

The model must be fit first so that the PyMC model graph exists. Calling this method does not alter the fitted posterior.

Examples

>>> model = LinearTrend() + FourierSeasonality(365.25, 10)
>>> model.fit(data, method="mapx")
>>> prior_pred = model.sample_prior_predictive(samples=200)

sample_posterior_predictive() → InferenceData[source]#

Sample from the posterior predictive distribution.

Generates replicated datasets from the posterior to assess goodness of fit. Requires the model to have been fitted with an MCMC or VI method so that self.trace is available.

Returns:

ArviZ InferenceData with a posterior_predictive group added.

Return type:

az.InferenceData

Raises:

RuntimeError – If the model has not been fit yet.
ValueError – If the model was fit with a MAP method (no posterior trace).

Examples

>>> model.fit(data, method="nuts")
>>> ppc = model.sample_posterior_predictive()

convergence_summary(var_names: list[str] | None = None) → DataFrame[source]#

Return an ArviZ convergence summary table.

Reports posterior mean, sd, HDI, R-hat, and ESS for every (or selected) model parameter.

Parameters:: var_names (list[str] or None, default None) – Subset of variable names to include. None includes all.
Returns:: Summary table produced by az.summary.
Return type:: pd.DataFrame
Raises:: ValueError – If the model was fit with MAP (no trace).

Examples

>>> model.fit(data, method="nuts")
>>> model.convergence_summary()

plot_trace(var_names: list[str] | None = None, **kwargs)[source]#

Plot trace and posterior density for model parameters.

A thin wrapper around az.plot_trace.

Parameters:

var_names (list[str] or None, default None) – Variables to include. None plots all.
**kwargs – Additional keyword arguments forwarded to az.plot_trace.

Returns:

The plot axes.

Return type:

matplotlib.axes.Axes

Raises:

ValueError – If no trace is available.

plot_energy(**kwargs)[source]#

Plot energy diagnostics for HMC/NUTS samplers.

Wraps az.plot_energy and reports the Bayesian Fraction of Missing Information (BFMI).

Parameters:: **kwargs – Additional keyword arguments forwarded to az.plot_energy.
Returns:: The plot axes.
Return type:: matplotlib.axes.Axes

plot_posterior(var_names: list[str] | None = None, **kwargs)[source]#

Plot posterior density for model parameters.

Parameters:

var_names (list[str] or None, default None) – Variables to include.
**kwargs – Forwarded to az.plot_posterior.

Return type:

matplotlib.axes.Axes

summary(var_names: list[str] | None = None) → DataFrame[source]#

Return a formatted posterior summary table.

For MCMC/VI models returns mean, sd, hdi_3%, hdi_97%, R-hat, and ESS. For MAP models returns the point estimates.

Parameters:: var_names (list[str] or None) – Parameter names to include.
Return type:: pd.DataFrame

compute_log_likelihood() → None[source]#

Compute log-likelihood for each observation and add to trace.

This is required for WAIC and LOO-CV calculations. If the trace already contains log-likelihood, this method does nothing.

Raises:: ValueError – If no trace is available or the model has not been fit yet.

waic() → ELPDData[source]#

Compute the Widely Applicable Information Criterion.

Returns:: WAIC result object.
Return type:: az.ELPDData
Raises:: ValueError – If no trace is available or the trace lacks log-likelihood.

loo() → ELPDData[source]#

Compute LOO-CV via Pareto-Smoothed Importance Sampling.

Returns:: LOO result object.
Return type:: az.ELPDData
Raises:: ValueError – If no trace is available.

needs_priors(*args, **kwargs)[source]#

is_individual(*args, **kwargs)[source]#

Combined Models #

These classes are created automatically when using operators to combine components.

class CombinedTimeSeries(left, right)[source]#

Bases: TimeSeriesModel

Base class for combined time series models.

This class serves as the foundation for composing multiple time series components together. It provides common functionality for combining two components (left and right) and propagating method calls to both.

Parameters:

left (TimeSeriesModel | int | float) – The left operand of the combination. Can be a model component or a numeric constant.
right (TimeSeriesModel | int | float) – The right operand of the combination. Can be a model component or a numeric constant.

left#

The left component of the combination.

Type:: TimeSeriesModel | int | float

right#

The right component of the combination.

Type:: TimeSeriesModel | int | float

See also

AdditiveTimeSeries: Combines components using addition.
MultiplicativeTimeSeries: Combines using y = left * (1 + right).
SimpleMultiplicativeTimeSeries: Combines using y = left * right.

__init__(left, right)[source]#

needs_priors(*args, **kwargs)[source]#

is_individual(*args, **kwargs)[source]#

class AdditiveTimeSeries(left, right)[source]#

Bases: CombinedTimeSeries

Combines two components using addition: y = left + right.

This class is created when using the + operator between time series components. The resulting model sums the contributions from both components.

Parameters:

left (TimeSeriesModel | int | float) – The left operand of the addition.
right (TimeSeriesModel | int | float) – The right operand of the addition.

Examples

>>> from vangja import LinearTrend, FourierSeasonality
>>> # Create an additive model with trend + seasonality
>>> model = LinearTrend() + FourierSeasonality(period=365.25, series_order=10)
>>> print(model)
LT(n=25,r=0.8,tm=None) + FS(p=365.25,n=10,tm=None)

definition(*args, **kwargs)[source]#

class MultiplicativeTimeSeries(left, right)[source]#

Bases: CombinedTimeSeries

Combines two components using y = left * (1 + right).

This class is created when using the ** operator between time series components. This follows the Prophet-style multiplicative seasonality where the right component modulates the left component around its value.

This formulation is useful when the amplitude of seasonality scales with the trend level (heteroscedastic seasonal patterns).

Parameters:

left (TimeSeriesModel | int | float) – The base component (typically a trend).
right (TimeSeriesModel | int | float) – The multiplicative modifier (typically seasonality).

Examples

>>> from vangja import LinearTrend, FourierSeasonality
>>> # Create a model with multiplicative seasonality
>>> model = LinearTrend() ** FourierSeasonality(period=365.25, series_order=10)
>>> print(model)
LT(n=25,r=0.8,tm=None) * (1 + FS(p=365.25,n=10,tm=None))

Notes

The ** operator was chosen because * is used for simple multiplication of components.

definition(*args, **kwargs)[source]#

class SimpleMultiplicativeTimeSeries(left, right)[source]#

Bases: CombinedTimeSeries

Combines two components using simple multiplication: y = left * right.

This class is created when using the * operator between time series components. The resulting model multiplies the contributions from both components directly.

This is useful for applying scaling factors or when components should truly multiply (not modulate around 1).

Parameters:

left (TimeSeriesModel | int | float) – The left operand of the multiplication.
right (TimeSeriesModel | int | float) – The right operand of the multiplication.

Examples

>>> from vangja import LinearTrend, UniformConstant
>>> # Create a model with a scaling factor
>>> model = LinearTrend() * UniformConstant(lower=0.8, upper=1.2)
>>> print(model)
LT(n=25,r=0.8,tm=None) * UC(l=0.8,u=1.2,tm=None)

definition(*args, **kwargs)[source]#

Components #

Components are the building blocks for time series models. They can be combined using + (additive), * (simple multiplicative), or ** (Prophet-style multiplicative).

LinearTrend #

class LinearTrend(n_changepoints: int = 25, changepoint_range: float = 0.8, slope_mean: float = 0, slope_sd: float = 5, intercept_mean: float = 0, intercept_sd: float = 5, delta_mean: float = 0, delta_sd: float = 0.05, delta_side: Literal['left', 'right'] = 'left', pool_type: Literal['partial', 'complete', 'individual'] = 'complete', delta_pool_type: Literal['partial', 'complete', 'individual'] = 'complete', tune_method: Literal['parametric', 'prior_from_idata'] | None = None, delta_tune_method: Literal['parametric', 'prior_from_idata'] | None = None, override_slope_mean_for_tune: ndarray | None = None, override_slope_sd_for_tune: ndarray | None = None, override_delta_loc_for_tune: ndarray | None = None, override_delta_scale_for_tune: ndarray | None = None, shrinkage_strength: float = 100, loss_factor_for_tune: float = 0)[source]#

Bases: TimeSeriesModel

A piecewise linear trend component with optional changepoints.

This component models the trend of a time series as a piecewise linear function, following the Prophet approach. The trend can have multiple changepoints where the slope is allowed to change.

The trend is defined as:

trend(t) = (k + a(t)^T * delta) * t + (m + a(t)^T * gamma)

where:

k is the base slope
m is the intercept
delta is a vector of slope changes at changepoints
a(t) is an indicator vector for changepoints before time t
gamma is computed to make the trend continuous

Parameters:

n_changepoints (int, default 25) – The number of potential changepoints. Changepoints are placed uniformly in the first changepoint_range fraction of data.
changepoint_range (float, default 0.8) – The proportion of the time range where changepoints are allowed. For example, 0.8 means changepoints only in the first 80% of data.
slope_mean (float, default 0) – The mean of the Normal prior for the slope parameter.
slope_sd (float, default 5) – The standard deviation of the Normal prior for the slope parameter.
intercept_mean (float, default 0) – The mean of the Normal prior for the intercept parameter.
intercept_sd (float, default 5) – The standard deviation of the Normal prior for the intercept parameter.
delta_mean (float, default 0) – The mean of the Laplace prior for the slope changes at changepoints.
delta_sd (float | None, default 0.05) – The scale of the Laplace prior for slope changes. If None, the scale is learned as a random variable with an Exponential(1.5) prior.
delta_side ({"left", "right"}, default "left") – If “left”, the slope parameter controls the slope at the earliest time point. If “right”, it controls the slope at the latest time.
pool_type (PoolType, default "complete") –
Type of pooling for multi-series data. One of:
- ”complete”: All series share the same trend parameters
- ”partial”: Hierarchical pooling with shared hyperpriors
- ”individual”: Each series has independent parameters
delta_pool_type (PoolType, default "complete") – Pooling type specifically for changepoint deltas. Only used when pool_type="partial".
tune_method (TuneMethod | None, default None) –
Transfer learning method. One of:
- ”parametric”: Use posterior mean/std as new priors
- ”prior_from_idata”: Use posterior samples directly
- None: No transfer learning
delta_tune_method (TuneMethod | None, default None) – Transfer learning method for changepoint deltas.
override_slope_mean_for_tune (np.ndarray | None, default None) – Override the slope mean during transfer learning.
override_slope_sd_for_tune (np.ndarray | None, default None) – Override the slope standard deviation during transfer learning.
override_delta_loc_for_tune (np.ndarray | None, default None) – Override the delta location during transfer learning.
override_delta_scale_for_tune (np.ndarray | None, default None) – Override the delta scale during transfer learning.
shrinkage_strength (float, default 100) – Controls hierarchical shrinkage. Higher values pull individual series parameters more strongly toward the shared mean.
loss_factor_for_tune (float, default 0) – Regularization factor for transfer learning. Adds a penalty to keep transferred parameters close to original values.

model_idx#

Index of this component in the model (set during fitting).

Type:: int

s#

Normalized time locations of changepoints.

Type:: np.ndarray

group#

Array of group codes for each data point.

Type:: np.ndarray

n_groups#

Number of unique groups/series.

Type:: int

groups_#

Mapping from group codes to series names.

Type:: dict[int, str]

Examples

>>> from vangja import LinearTrend, FourierSeasonality
>>> from vangja.datasets import load_peyton_manning
>>>
>>> # Basic usage
>>> model = LinearTrend() + FourierSeasonality(period=365.25, series_order=10)
>>> model.fit(data, method="mapx")
>>> predictions = model.predict(horizon=365)

>>> # With hierarchical pooling for multiple series
>>> model = LinearTrend(
...     pool_type="partial",
...     shrinkage_strength=50,
...     n_changepoints=10
... )

>>> # Transfer learning from a pre-trained model
>>> target_model = LinearTrend(tune_method="parametric")
>>> target_model.fit(short_series, idata=source_trace)

See also

FourierSeasonality: Seasonal component using Fourier series.

Notes

The changepoint formulation follows the Facebook Prophet paper [1]_. The delta_side="right" option is an extension that allows the slope parameter to represent the end slope rather than the start slope.

References

model_idx: int | None = None#

__init__(n_changepoints: int = 25, changepoint_range: float = 0.8, slope_mean: float = 0, slope_sd: float = 5, intercept_mean: float = 0, intercept_sd: float = 5, delta_mean: float = 0, delta_sd: float = 0.05, delta_side: Literal['left', 'right'] = 'left', pool_type: Literal['partial', 'complete', 'individual'] = 'complete', delta_pool_type: Literal['partial', 'complete', 'individual'] = 'complete', tune_method: Literal['parametric', 'prior_from_idata'] | None = None, delta_tune_method: Literal['parametric', 'prior_from_idata'] | None = None, override_slope_mean_for_tune: ndarray | None = None, override_slope_sd_for_tune: ndarray | None = None, override_delta_loc_for_tune: ndarray | None = None, override_delta_scale_for_tune: ndarray | None = None, shrinkage_strength: float = 100, loss_factor_for_tune: float = 0)[source]#

Creeate a Linear Trend model component.

See the class docstring for full parameter descriptions.

definition(model: TimeSeriesModel, data: DataFrame, model_idxs: dict[str, int], priors: dict[str, TensorVariable] | None, idata: InferenceData | None)[source]#

Add the LinearTrend parameters to the model.

Parameters:

model (TimeSeriesModel) – The model to which the parameters are added.
data (pd.DataFrame) – A pandas dataframe that must at least have columns ds (predictor), y (target) and series (name of time series).
model_idxs (dict[str, int]) – Count of the number of components from each type.
priors (dict[str, pt.TensorVariable] | None) – A dictionary of multivariate normal random variables approximating the posterior sample in idata.
idata (az.InferenceData | None) – Sample from a posterior. If it is not None, Vangja will use this to set the parameters’ priors in the model. If idata is not None, each component from the model should specify how idata should be used to set its parameters’ priors.

needs_priors(*args, **kwargs)[source]#

is_individual(*args, **kwargs)[source]#

FlatTrend #

class FlatTrend(intercept_mean: float = 0, intercept_sd: float = 5, pool_type: Literal['partial', 'complete', 'individual'] = 'complete', tune_method: Literal['parametric', 'prior_from_idata'] | None = None, shrinkage_strength: float = 100)[source]#

Bases: TimeSeriesModel

A flat (constant-level) trend component.

This is the simplest possible trend component: a single intercept parameter with no slope and no changepoints. It models the baseline level of the time series as a constant.

The model is:

trend(t) = intercept

This is useful when:

The time series has no discernible upward or downward trend.
You want a minimal trend component that adds only one parameter.
The series is short and estimating a slope would overfit.

Parameters:

intercept_mean (float, default 0) – The mean of the Normal prior for the intercept parameter.
intercept_sd (float, default 5) – The standard deviation of the Normal prior for the intercept.
pool_type (PoolType, default "complete") –
Type of pooling for multi-series data. One of:
- ”complete”: All series share the same intercept.
- ”partial”: Hierarchical pooling with shared hyperpriors.
- ”individual”: Each series has an independent intercept.
tune_method (TuneMethod | None, default None) –
Transfer learning method. One of:
- ”parametric”: Use posterior mean/std as new priors.
- ”prior_from_idata”: Use posterior samples directly.
- None: No transfer learning.
shrinkage_strength (float, default 100) – Controls hierarchical shrinkage for partial pooling. Higher values pull individual series intercepts more strongly toward the shared mean.

model_idx#

Index of this component in the model (set during fitting).

Type:: int | None

group#

Array of group codes for each data point.

Type:: np.ndarray

n_groups#

Number of unique groups/series.

Type:: int

groups_#

Mapping from group codes to series names.

Type:: dict[int, str]

Examples

>>> from vangja import FourierSeasonality
>>> from vangja.components import FlatTrend
>>>
>>> # Flat trend with seasonal pattern
>>> model = FlatTrend() + FourierSeasonality(period=365.25, series_order=10)
>>> model.fit(data, method="mapx")
>>> predictions = model.predict(horizon=365)

>>> # With hierarchical pooling for multiple series
>>> model = FlatTrend(pool_type="partial", shrinkage_strength=50)

>>> # Transfer learning from a pre-trained model
>>> target_model = FlatTrend(tune_method="parametric")
>>> target_model.fit(short_series, idata=source_trace)

See also

LinearTrend: Piecewise linear trend with changepoints.
DampedSmooth: Damped dynamic model with AR smoothing.

Notes

FlatTrend is equivalent to LinearTrend(n_changepoints=0) with the slope fixed to 0, but is more explicit and has fewer parameters to estimate. When composing models, it serves as a clean baseline that relies on other components (seasonality, GP, etc.) to explain temporal variation.

model_idx: int | None = None#

__init__(intercept_mean: float = 0, intercept_sd: float = 5, pool_type: Literal['partial', 'complete', 'individual'] = 'complete', tune_method: Literal['parametric', 'prior_from_idata'] | None = None, shrinkage_strength: float = 100)[source]#

Create a FlatTrend model component.

See the class docstring for full parameter descriptions.

definition(model: Model, data: DataFrame, model_idxs: dict[str, int], priors: dict[str, TensorVariable] | None, idata: InferenceData | None) → TensorVariable[source]#

Add the FlatTrend parameters to the model.

Parameters:

model (pm.Model) – The PyMC model to add parameters to.
data (pd.DataFrame) – Processed training data with columns ds, y, t, series.
model_idxs (dict[str, int]) – Count of the number of components from each type.
priors (dict[str, pt.TensorVariable] | None) – Prior variables from transfer learning.
idata (az.InferenceData | None) – Posterior samples for transfer learning.

Returns:

The flat trend values for all data points.

Return type:

pt.TensorVariable

needs_priors(*args, **kwargs)[source]#

Whether this component needs prior_from_idata transfer learning.

Returns:: True if tune_method is “prior_from_idata”.
Return type:: bool

is_individual(*args, **kwargs)[source]#

Whether this component uses individual pooling.

Returns:: True if pool_type is “individual”.
Return type:: bool

FourierSeasonality #

class FourierSeasonality(period: float, series_order: int, beta_mean: float = 0, beta_sd: float = 10, pool_type: Literal['partial', 'complete', 'individual'] = 'complete', tune_method: Literal['parametric', 'prior_from_idata'] | None = None, override_beta_mean_for_tune: ndarray | None = None, override_beta_sd_for_tune: ndarray | None = None, shrinkage_strength: float = 1, shift_for_tune: bool = False, loss_factor_for_tune: float = 0)[source]#

Bases: TimeSeriesModel

A seasonal component using Fourier series representation.

This component models periodic patterns in time series using a Fourier series, following the Prophet approach. It allows flexible representation of seasonal effects with controllable complexity via the number of terms.

The seasonality is defined as:

seasonality(t) = sum_{n=1}^{N} [a_n * cos(2*pi*n*t/P) + b_n * sin(2*pi*n*t/P)]

where:

P is the period (e.g., 365.25 for yearly, 7 for weekly)
N is the series order (number of Fourier terms)
a_n, b_n are the Fourier coefficients (collectively called beta)

Parameters:

period (float) –
The period of the seasonal pattern in days. Common values:
- 365.25: Yearly seasonality
- 7: Weekly seasonality
- 30.4375: Monthly seasonality (average month length)
series_order (int) –
Number of Fourier terms. Higher values capture more complex patterns but may overfit. Guidelines:
- Yearly: 10-20 terms
- Weekly: 3-5 terms
- Monthly: 3-5 terms
beta_mean (float, default 0) – The mean of the Normal prior for the Fourier coefficients.
beta_sd (float, default 10) – The standard deviation of the Normal prior for the Fourier coefficients.
pool_type (PoolType, default "complete") –
Type of pooling for multi-series data. One of:
- ”complete”: All series share the same seasonal pattern
- ”partial”: Hierarchical pooling with shared seasonal shape
- ”individual”: Each series has independent seasonal patterns
tune_method (TuneMethod | None, default None) –
Transfer learning method. One of:
- ”parametric”: Use posterior mean/std as new priors
- ”prior_from_idata”: Use posterior samples directly
- None: No transfer learning
override_beta_mean_for_tune (np.ndarray | None, default None) – Override the beta mean during transfer learning.
override_beta_sd_for_tune (np.ndarray | None, default None) – Override the beta standard deviation during transfer learning.
shrinkage_strength (float, default 1) – Controls hierarchical shrinkage. Higher values pull individual series parameters more strongly toward the shared pattern.
shift_for_tune (bool, default False) – If True, learn a time shift parameter during transfer learning to align seasonal patterns across time series.
loss_factor_for_tune (float, default 1) – Regularization factor for transfer learning. Adds a penalty to preserve the original seasonal amplitude.

model_idx#

Index of this component in the model (set during fitting).

Type:: int | None

group#

Array of group codes for each data point.

Type:: np.ndarray

n_groups#

Number of unique groups/series.

Type:: int

groups_#

Mapping from group codes to series names.

Type:: dict[int, str]

Examples

>>> from vangja import LinearTrend, FourierSeasonality
>>> from vangja.datasets import load_air_passengers
>>>
>>> # Basic additive seasonality
>>> model = LinearTrend() + FourierSeasonality(period=365.25, series_order=10)
>>> model.fit(data)
>>>
>>> # Multiplicative seasonality (Prophet-style)
>>> model = LinearTrend() ** FourierSeasonality(period=365.25, series_order=10)
>>>
>>> # Multiple seasonal components
>>> model = LinearTrend() ** (
...     FourierSeasonality(period=365.25, series_order=10) +
...     FourierSeasonality(period=7, series_order=3)
... )

See also

LinearTrend: Piecewise linear trend component.

Notes

The Fourier series representation is based on the Prophet paper [2]_. Using more Fourier terms allows fitting more complex seasonal patterns but increases the risk of overfitting.

References

model_idx: int | None = None#

__init__(period: float, series_order: int, beta_mean: float = 0, beta_sd: float = 10, pool_type: Literal['partial', 'complete', 'individual'] = 'complete', tune_method: Literal['parametric', 'prior_from_idata'] | None = None, override_beta_mean_for_tune: ndarray | None = None, override_beta_sd_for_tune: ndarray | None = None, shrinkage_strength: float = 1, shift_for_tune: bool = False, loss_factor_for_tune: float = 0)[source]#

Create a Fourier Seasonality model component.

See the class docstring for full parameter descriptions.

definition(model: TimeSeriesModel, data: DataFrame, model_idxs: dict[str, int], priors: dict[str, TensorVariable] | None, idata: InferenceData | None)[source]#

Add the FourierSeasonality parameters to the model.

Parameters:

model (TimeSeriesModel) – The model to which the parameters are added.
data (pd.DataFrame) – A pandas dataframe that must at least have columns ds (predictor), y (target) and series (name of time series).
model_idxs (dict[str, int]) – Count of the number of components from each type.
priors (dict[str, pt.TensorVariable] | None) – A dictionary of multivariate normal random variables approximating the posterior sample in idata.
idata (az.InferenceData | None) – Sample from a posterior. If it is not None, Vangja will use this to set the parameters’ priors in the model. If idata is not None, each component from the model should specify how idata should be used to set its parameters’ priors.

needs_priors(*args, **kwargs)[source]#

is_individual(*args, **kwargs)[source]#

NormalConstant #

class NormalConstant(mu: float = 0, sd: float = 1, pool_type: Literal['partial', 'complete', 'individual'] = 'complete', tune_method: Literal['parametric', 'prior_from_idata'] | None = None, override_mu_for_tune: float | None = None, override_sd_for_tune: float | None = None, shrinkage_strength: float = 1)[source]#

Bases: TimeSeriesModel

A constant component with a Normal (Gaussian) prior distribution.

This component adds a constant term to the model that is sampled from a Normal distribution. It’s useful for modeling baseline offsets or intercept terms that may vary across different time series.

Parameters:

mu (float, default 0) – The mean of the Normal prior for the constant parameter.
sd (float, default 1) – The standard deviation of the Normal prior for the constant parameter.
pool_type (PoolType, default "complete") –
Type of pooling performed when sampling. Options are:
- ”complete”: All series share the same constant value.
- ”partial”: Series have individual constants with shared hyperpriors.
- ”individual”: Each series has a completely independent constant.
tune_method (TuneMethod | None, default None) –
How the transfer learning is to be performed. Options are:
- ”parametric”: Use posterior mean and std from idata as new priors.
- ”prior_from_idata”: Use the posterior samples directly as priors.
- None: This component will not be tuned even if idata is provided.
override_mu_for_tune (float | None, default None) – Override the mean of the Normal prior for the constant parameter with this value during transfer learning.
override_sd_for_tune (float | None, default None) – Override the standard deviation of the Normal prior for the constant parameter with this value during transfer learning.
shrinkage_strength (float, default 1) – Shrinkage between groups for the hierarchical modeling. Higher values result in stronger shrinkage toward the shared mean.

model_idx#

Index of this component in the model, set during definition.

Type:: int | None

group#

Array of group codes for each data point.

Type:: np.ndarray

n_groups#

Number of unique groups/series.

Type:: int

groups_#

Mapping from group codes to group names.

Type:: dict[int, str]

Examples

>>> from vangja import LinearTrend, NormalConstant
>>> # Add a normal constant offset to a linear trend
>>> model = LinearTrend() + NormalConstant(mu=0, sd=10)
>>> model.fit(data)
>>> predictions = model.predict(horizon=30)

>>> # Use partial pooling for multi-series data
>>> model = LinearTrend() + NormalConstant(mu=0, sd=10, pool_type="partial")

model_idx: int | None = None#

__init__(mu: float = 0, sd: float = 1, pool_type: Literal['partial', 'complete', 'individual'] = 'complete', tune_method: Literal['parametric', 'prior_from_idata'] | None = None, override_mu_for_tune: float | None = None, override_sd_for_tune: float | None = None, shrinkage_strength: float = 1)[source]#

Initialize the NormalConstant component.

Parameters:

mu (float, default 0) – The mean of the Normal prior for the constant parameter.
sd (float, default 1) – The standard deviation of the Normal prior for the constant parameter.
pool_type (PoolType, default "complete") – Type of pooling performed when sampling.
tune_method (TuneMethod | None, default None) – How the transfer learning is to be performed.
override_mu_for_tune (float | None, default None) – Override the mean of the Normal prior during transfer learning.
override_sd_for_tune (float | None, default None) – Override the standard deviation of the Normal prior during transfer learning.
shrinkage_strength (float, default 1) – Shrinkage between groups for hierarchical modeling.

definition(model: Model, data: DataFrame, model_idxs: dict[str, int], priors: dict[str, TensorVariable] | None, idata: InferenceData | None) → TensorVariable[source]#

Add the NormalConstant parameters to the model.

Parameters:

model (pm.Model) – The model to which the parameters are added.
data (pd.DataFrame) – A pandas dataframe that must at least have columns ds (predictor), y (target) and series (name of time series).
model_idxs (dict[str, int]) – Count of the number of components from each type.
priors (dict[str, pt.TensorVariable] | None) – A dictionary of multivariate normal random variables approximating the posterior sample in idata.
idata (az.InferenceData | None) – Sample from a posterior. If it is not None, Vangja will use this to set the parameters’ priors in the model.

Returns:

The constant term(s) to add to the model.

Return type:

pt.TensorVariable

needs_priors(*args, **kwargs) → bool[source]#

Check if this component needs priors from idata.

Returns:: True if tune_method is “prior_from_idata”, False otherwise.
Return type:: bool

__str__() → str[source]#

Return string representation of the component.

Returns:: String representation.
Return type:: str

BetaConstant #

class BetaConstant(lower: float, upper: float, alpha: float = 0.5, beta: float = 0.5, pool_type: Literal['partial', 'complete', 'individual'] = 'complete', tune_method: Literal['parametric', 'prior_from_idata'] | None = None, shrinkage_strength: float = 1)[source]#

Bases: TimeSeriesModel

A constant component with a Beta prior distribution scaled to a range.

This component adds a constant term to the model that is sampled from a Beta distribution and then scaled to lie within [lower, upper]. It’s useful for modeling parameters that should be bounded and have flexible shapes controlled by the alpha and beta parameters.

Parameters:

lower (float) – The lower bound for the constant parameter after scaling.
upper (float) – The upper bound for the constant parameter after scaling.
alpha (float, default 0.5) – The alpha parameter of the Beta distribution. Controls the shape.
beta (float, default 0.5) – The beta parameter of the Beta distribution. Controls the shape.
pool_type (PoolType, default "complete") –
Type of pooling performed when sampling. Options are:
- ”complete”: All series share the same constant value.
- ”partial”: Series have individual constants with shared hyperpriors.
- ”individual”: Each series has a completely independent constant.
tune_method (TuneMethod | None, default None) –
How the transfer learning is to be performed. Options are:
- ”parametric”: Use posterior samples to derive new Beta parameters.
- ”prior_from_idata”: Use the posterior samples directly as priors.
- None: This component will not be tuned even if idata is provided.
shrinkage_strength (float, default 1) – Shrinkage between groups for the hierarchical modeling. Higher values result in stronger shrinkage toward the shared mean.

model_idx#

Index of this component in the model, set during definition.

Type:: int | None

group#

Array of group codes for each data point.

Type:: np.ndarray

n_groups#

Number of unique groups/series.

Type:: int

groups_#

Mapping from group codes to group names.

Type:: dict[int, str]

Notes

The transformation from Beta to the scaled constant is:

c = beta_value * (upper - lower) + lower

Common choices for alpha and beta:

alpha=beta=0.5: Jeffrey’s prior (U-shaped, more mass at extremes)
alpha=beta=1: Uniform distribution
alpha=beta=2: Symmetric bell-shaped

Examples

>>> from vangja import LinearTrend, BetaConstant
>>> # Add a beta-distributed scaling factor between 0.8 and 1.2
>>> model = LinearTrend() * BetaConstant(lower=0.8, upper=1.2, alpha=2, beta=2)
>>> model.fit(data)
>>> predictions = model.predict(horizon=30)

>>> # Use partial pooling for multi-series data
>>> model = LinearTrend() * BetaConstant(lower=0.5, upper=1.5,
...                                       pool_type="partial")

model_idx: int | None = None#

__init__(lower: float, upper: float, alpha: float = 0.5, beta: float = 0.5, pool_type: Literal['partial', 'complete', 'individual'] = 'complete', tune_method: Literal['parametric', 'prior_from_idata'] | None = None, shrinkage_strength: float = 1)[source]#

Initialize the BetaConstant component.

Parameters:

lower (float) – The lower bound for the constant parameter after scaling.
upper (float) – The upper bound for the constant parameter after scaling.
alpha (float, default 0.5) – The alpha parameter of the Beta distribution.
beta (float, default 0.5) – The beta parameter of the Beta distribution.
pool_type (PoolType, default "complete") – Type of pooling performed when sampling.
tune_method (TuneMethod | None, default None) – How the transfer learning is to be performed.
shrinkage_strength (float, default 1) – Shrinkage between groups for hierarchical modeling.

definition(model: Model, data: DataFrame, model_idxs: dict[str, int], priors: dict[str, TensorVariable] | None, idata: InferenceData | None) → TensorVariable[source]#

Add the BetaConstant parameters to the model.

Parameters:

model (pm.Model) – The model to which the parameters are added.
data (pd.DataFrame) – A pandas dataframe that must at least have columns ds (predictor), y (target) and series (name of time series).
model_idxs (dict[str, int]) – Count of the number of components from each type.
priors (dict[str, pt.TensorVariable] | None) – A dictionary of multivariate normal random variables approximating the posterior sample in idata.
idata (az.InferenceData | None) – Sample from a posterior. If it is not None, Vangja will use this to set the parameters’ priors in the model.

Returns:

The constant term(s) to add to the model.

Return type:

pt.TensorVariable

needs_priors(*args, **kwargs) → bool[source]#

Check if this component needs priors from idata.

Returns:: True if tune_method is “prior_from_idata”, False otherwise.
Return type:: bool

__str__() → str[source]#

Return string representation of the component.

Returns:: String representation.
Return type:: str

UniformConstant #

class UniformConstant(lower: float, upper: float, pool_type: Literal['partial', 'complete', 'individual'] = 'complete', tune_method: Literal['parametric', 'prior_from_idata'] | None = None, shrinkage_strength: float = 1)[source]#

Bases: TimeSeriesModel

A constant component with a Uniform prior distribution.

This component adds a constant term to the model that is sampled from a Uniform distribution bounded by lower and upper limits. It’s useful for modeling parameters that should be constrained to a specific range.

Parameters:

lower (float) – The lower bound of the Uniform prior for the constant parameter.
upper (float) – The upper bound of the Uniform prior for the constant parameter.
pool_type (PoolType, default "complete") –
Type of pooling performed when sampling. Options are:
- ”complete”: All series share the same constant value.
- ”partial”: Series have individual constants with shared hyperpriors.
- ”individual”: Each series has a completely independent constant.
tune_method (TuneMethod | None, default None) –
How the transfer learning is to be performed. Options are:
- ”parametric”: Use posterior mean and std from idata to create a truncated Normal prior.
- ”prior_from_idata”: Use the posterior samples directly as priors.
- None: This component will not be tuned even if idata is provided.
shrinkage_strength (float, default 1) – Shrinkage between groups for the hierarchical modeling. Higher values result in stronger shrinkage toward the shared mean.

model_idx#

Index of this component in the model, set during definition.

Type:: int | None

group#

Array of group codes for each data point.

Type:: np.ndarray

n_groups#

Number of unique groups/series.

Type:: int

groups_#

Mapping from group codes to group names.

Type:: dict[int, str]

Examples

>>> from vangja import LinearTrend, UniformConstant
>>> # Add a uniform constant multiplier
>>> model = LinearTrend() * UniformConstant(lower=0.5, upper=1.5)
>>> model.fit(data)
>>> predictions = model.predict(horizon=30)

>>> # Use partial pooling for multi-series data
>>> model = LinearTrend() * UniformConstant(lower=0.8, upper=1.2,
...                                          pool_type="partial")

model_idx: int | None = None#

__init__(lower: float, upper: float, pool_type: Literal['partial', 'complete', 'individual'] = 'complete', tune_method: Literal['parametric', 'prior_from_idata'] | None = None, shrinkage_strength: float = 1)[source]#

Initialize the UniformConstant component.

Parameters:

lower (float) – The lower bound of the Uniform prior for the constant parameter.
upper (float) – The upper bound of the Uniform prior for the constant parameter.
pool_type (PoolType, default "complete") – Type of pooling performed when sampling.
tune_method (TuneMethod | None, default None) – How the transfer learning is to be performed.
shrinkage_strength (float, default 1) – Shrinkage between groups for hierarchical modeling.

definition(model: Model, data: DataFrame, model_idxs: dict[str, int], priors: dict[str, TensorVariable] | None, idata: InferenceData | None) → TensorVariable[source]#

Add the UniformConstant parameters to the model.

Parameters:

model (pm.Model) – The model to which the parameters are added.
data (pd.DataFrame) – A pandas dataframe that must at least have columns ds (predictor), y (target) and series (name of time series).
model_idxs (dict[str, int]) – Count of the number of components from each type.
priors (dict[str, pt.TensorVariable] | None) – A dictionary of multivariate normal random variables approximating the posterior sample in idata.
idata (az.InferenceData | None) – Sample from a posterior. If it is not None, Vangja will use this to set the parameters’ priors in the model.

Returns:

The constant term(s) to add to the model.

Return type:

pt.TensorVariable

needs_priors(*args, **kwargs) → bool[source]#

Check if this component needs priors from idata.

Returns:: True if tune_method is “prior_from_idata”, False otherwise.
Return type:: bool

__str__() → str[source]#

Return string representation of the component.

Returns:: String representation.
Return type:: str

Utilities #

Utility functions for data processing and model evaluation.

get_group_definition #

get_group_definition(data: DataFrame, pool_type: Literal['partial', 'complete', 'individual']) → tuple[ndarray, int, dict[int, str]][source]#

Assign group codes to different series based on pooling type.

This function processes a multi-series dataframe and assigns integer codes to each unique series. The behavior depends on the pool_type parameter:

“complete”: All series share a single group (code 0)
“partial” or “individual”: Each unique series gets its own code

Parameters:

data (pd.DataFrame) – A pandas dataframe that must at least have columns ds (predictor), y (target) and series (name of time series).
pool_type (PoolType) – Type of pooling performed when sampling. One of “complete”, “partial”, or “individual”.

Returns:

group (np.ndarray) – Array of integer group codes, one for each row in data.
n_groups (int) – Number of unique groups.
group_mapping (dict[int, str]) – Dictionary mapping group codes (int) to series names (str).

Examples

>>> import pandas as pd
>>> data = pd.DataFrame({
...     'ds': pd.date_range('2020-01-01', periods=6),
...     'y': [1, 2, 3, 4, 5, 6],
...     'series': ['A', 'A', 'A', 'B', 'B', 'B']
... })
>>> group, n_groups, mapping = get_group_definition(data, 'partial')
>>> print(n_groups)
2
>>> print(mapping)
{0: 'A', 1: 'B'}

filter_predictions_by_series #

filter_predictions_by_series(future: DataFrame, series_data: DataFrame, yhat_col: str = 'yhat_0', horizon: int = 0) → DataFrame[source]#

Filter predictions to only include dates relevant to a specific series.

When fitting multiple series simultaneously with different date ranges, the predict() method generates predictions for the entire combined time range. This function filters predictions to only include dates within a specific series’ range, which is essential for correct metric calculation and plotting.

Parameters:

future (pd.DataFrame) – Predictions dataframe from model.predict() containing ‘ds’ and yhat columns.
series_data (pd.DataFrame) – The original data for a specific series (train + test combined, or just the portion you want to filter to). Must have ‘ds’ column.
yhat_col (str, default "yhat_0") – The name of the prediction column to include in the output.
horizon (int, default 0) – Additional days beyond the series’ max date to include (for forecast period).

Returns:

Filtered predictions with columns [‘ds’, ‘yhat_0’] containing only dates within the series’ range plus the specified horizon.

Return type:

pd.DataFrame

Examples

>>> # After fitting a multi-series model
>>> future_combined = model.predict(horizon=365)
>>> # Filter to only Air Passengers' relevant dates
>>> future_passengers = filter_predictions_by_series(
...     future_combined,
...     air_passengers,  # full dataset (train + test)
...     yhat_col=f"yhat_{passengers_group}",
...     horizon=365
... )

metrics #

metrics(y_true: DataFrame, future: DataFrame, pool_type: Literal['partial', 'complete', 'individual'] = 'complete') → DataFrame[source]#

Calculate evaluation metrics for time series predictions.

Computes Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and Mean Absolute Percentage Error (MAPE) for each time series in the dataset.

Parameters:

y_true (pd.DataFrame) – A pandas dataframe containing the true values for the inference period that must at least have columns ds (predictor), y (target) and series (name of time series).
future (pd.DataFrame) – Pandas dataframe containing the timestamps and predictions. Must have columns named ‘yhat_{group_code}’ for each group. The ‘ds’ column is used to match predictions to test data by date.
pool_type (PoolType) – Type of pooling performed when sampling. Used to determine group assignments in y_true.

Returns:

A dataframe with series names as index and columns for each metric: ‘mse’, ‘rmse’, ‘mae’, ‘mape’.

Return type:

pd.DataFrame

Examples

>>> from vangja import LinearTrend
>>> from vangja.utils import metrics
>>> model = LinearTrend()
>>> model.fit(train_data)
>>> future = model.predict(horizon=30)
>>> evaluation = metrics(test_data, future, pool_type="complete")
>>> print(evaluation)
          mse     rmse      mae     mape
series1  25.3    5.03    4.21    0.082

Notes

Predictions are matched to test data by merging on the ‘ds’ column. This correctly handles cases where predictions are at a different frequency than the test data (e.g., daily predictions vs monthly test data).

remove_random_gaps #

remove_random_gaps(df: DataFrame, n_gaps: int = 4, gap_fraction: float = 0.2) → DataFrame[source]#

Remove random continuous intervals (gaps) from a time series DataFrame.

Creates realistic missing-data scenarios by removing n_gaps non-overlapping contiguous blocks from the data. Each block removes approximately gap_fraction of the total data points.

Parameters:

df (pd.DataFrame) – A time series DataFrame. Must have at least a ds column.
n_gaps (int, default 4) – Number of contiguous intervals to remove.
gap_fraction (float, default 0.2) – Fraction of total data points removed per gap.

Returns:

A copy of the input DataFrame with the specified gaps removed, index reset.

Return type:

pd.DataFrame

Raises:

ValueError – If the total number of points to remove exceeds the length of the DataFrame.

Examples

>>> import pandas as pd
>>> df = pd.DataFrame({
...     'ds': pd.date_range('2020-01-01', periods=100),
...     'y': range(100),
... })
>>> df_with_gaps = remove_random_gaps(df, n_gaps=2, gap_fraction=0.1)
>>> len(df_with_gaps) < len(df)
True

compare_models #

compare_models(model_dict: dict, ic: str = 'loo') → DataFrame[source]#

Compare multiple fitted models using information criteria.

Wraps arviz.compare to produce a ranked table of models scored by WAIC or LOO-CV (PSIS).

Parameters:

model_dict (dict[str, az.InferenceData | object]) – Mapping of model names to either arviz.InferenceData objects or fitted vangja model objects that expose a .trace attribute.
ic ({"loo", "waic"}, default "loo") – Information criterion to use.

Returns:

Comparison table sorted by the chosen criterion (best model first).

Return type:

pd.DataFrame

Examples

>>> from vangja.utils import compare_models
>>> comparison = compare_models(
...     {"baseline": baseline_model, "transfer": transfer_model},
...     ic="loo",
... )

prior_sensitivity_analysis #

prior_sensitivity_analysis(model_factory, data: DataFrame, param_grid: dict[str, list], fit_kwargs: dict | None = None, metric_data: DataFrame | None = None, horizon: int = 0) → DataFrame[source]#

Run prior sensitivity analysis by fitting a model under varied priors.

Parameters:

model_factory (callable) –
A function that accepts keyword arguments from param_grid and returns an unfitted vangja model. For example:
```
def make_model(beta_sd=10):
    return FlatTrend() + FourierSeasonality(365.25, 6, beta_sd=beta_sd)
```
data (pd.DataFrame) – Training data (columns ds, y).
param_grid (dict[str, list]) – Dictionary mapping parameter names to lists of values to test. A full Cartesian product is evaluated.
fit_kwargs (dict or None) – Additional keyword arguments forwarded to model.fit().
metric_data (pd.DataFrame or None) – If provided, compute forecast metrics against this test set for each configuration.
horizon (int, default 0) – Forecast horizon used when computing metrics.

Returns:

One row per configuration with the varied parameter values and, if metric_data is provided, the resulting forecast metrics.

Return type:

pd.DataFrame

Examples

>>> from vangja.utils import prior_sensitivity_analysis
>>> results = prior_sensitivity_analysis(
...     model_factory=lambda beta_sd: (
...         FlatTrend() + FourierSeasonality(365.25, 6, beta_sd=beta_sd)
...     ),
...     data=train,
...     param_grid={"beta_sd": [1, 5, 10, 20]},
...     fit_kwargs={"method": "mapx", "scaler": "minmax"},
...     metric_data=test,
...     horizon=365,
... )

prior_predictive_coverage #

prior_predictive_coverage(prior_predictive, low: float = -2.0, high: float = 2.0, series_idx: int | None = None, group: ndarray | None = None) → float[source]#

Calculate the fraction of prior predictive samples within a plausible range.

This is a quantitative complement to visual prior predictive checks. Because vangja scales the data so that \(y \approx [-1, 1]\) and \(t \in [0, 1]\), comparing the prior predictive samples against a fixed plausible window (default [-2, 2]) reveals how informative or diffuse the chosen priors are.

How to interpret the result:

< 5 % — priors are too loose. The sampler wastes time in physically impossible regions. Reduce the prior standard deviations.
> 95 % — priors may be too tight. The model risks being unable to capture sudden spikes or changepoints. Increase the prior standard deviations.
30–60 % — a reasonable sweet spot for flexible models like Prophet. The prior covers the data range without encouraging absurd values.

Parameters:

prior_predictive (az.InferenceData) – Result of model.sample_prior_predictive().
low (float, default -2.0) – Lower bound of the plausible range (in scaled space).
high (float, default 2.0) – Upper bound of the plausible range (in scaled space).
series_idx (int or None) – If the prior predictive contains multiple series (e.g. from a hierarchical model), specify which one to calculate coverage for. If None, calculates for everything. If series_idx is not None you must also pass the corresponding group array to the group parameter.
group (np.ndarray or None) – If the prior predictive contains multiple groups (e.g. from a hierarchical model), specify which element belongs to which group.

Returns:

Fraction of individual sample values inside [low, high], between 0 and 1.

Return type:

float

Examples

>>> model = LinearTrend() + FourierSeasonality(365.25, 10)
>>> model.fit(data, method="mapx")
>>> ppc = model.sample_prior_predictive(samples=500)
>>> coverage = prior_predictive_coverage(ppc)
>>> print(f"{coverage * 100:.1f}% of prior samples are within [-2, 2]")

plot_prior_posterior #

plot_prior_posterior(trace, prior_params: dict[str, dict[str, float]], var_names: list[str] | None = None, figsize: tuple[float, float] | None = None)[source]#

Plot prior and posterior densities on the same axes.

Generates a grid of subplots, one per parameter, showing the prior density (from the analytic specification) and the posterior density (from MCMC/VI samples).

Parameters:

trace (az.InferenceData) – Posterior samples from a fitted model.
prior_params (dict[str, dict[str, float]]) – Mapping of variable names to dicts describing the prior. Each dict must contain "dist" (one of "normal", "halfnormal", "laplace") and the relevant parameters ("mu"/"sigma" for Normal, "sigma" for HalfNormal, "mu"/"b" for Laplace).
var_names (list[str] or None) – Subset of variables to include. Defaults to all keys in prior_params.
figsize (tuple or None) – Figure size.

Return type:

matplotlib.figure.Figure

Examples

>>> plot_prior_posterior(
...     model.trace,
...     {
...         "fs_0 - beta(p=365.25,n=6)": {"dist": "normal", "mu": 0, "sigma": 10},
...     },
... )

plot_posterior_predictive #

plot_posterior_predictive(posterior_predictive, series_idx: int | None = None, group: ndarray | None = None, data: DataFrame | None = None, n_samples: int = 50, ax=None, title: str = 'Posterior Predictive Check', show_hdi: bool = False, hdi_prob: float = 0.9, show_ref_lines: bool = False, ref_values: tuple[float, float] = (-1.0, 1.0), t: ndarray | None = None)[source]#

Plot posterior predictive samples, overlaid on observed data.

Parameters:

posterior_predictive (az.InferenceData) – Result of model.sample_posterior_predictive().
series_idx (int or None) – If the posterior predictive contains multiple series (e.g. from a hierarchical model), specify which one to plot. If None, plots everything. If series_idx is not None you must also pass the corresponding group array to the group parameter.
group (np.ndarray or None) – If the posterior predictive contains multiple groups (e.g. from a hierarchical model), specify which element belongs to which group.
data (pd.DataFrame or None) – Observed data with columns ds and y.
n_samples (int, default 50) – Number of posterior predictive traces to draw.
ax (matplotlib axes or None) – Axes to plot on.
title (str) – Plot title.
show_hdi (bool, default False) – If True, shade the Highest Density Interval across time.
hdi_prob (float, default 0.9) – Probability mass for the HDI band (ignored when show_hdi=False).
show_ref_lines (bool, default False) – If True, draw horizontal dashed lines at the scaled-data bounds given by ref_values.
ref_values (tuple[float, float], default (-1.0, 1.0)) – (lower, upper) values for the reference lines.
t (np.ndarray or None) – x-axis values. When None the observation index is used. Pass model.data["t"].values for the normalised time axis, or model.data["ds"].values for calendar dates.

Return type:

matplotlib.axes.Axes

plot_prior_predictive #

plot_prior_predictive(prior_predictive, series_idx: int | None = None, group: ndarray | None = None, data: DataFrame | None = None, n_samples: int = 50, ax=None, title: str = 'Prior Predictive Check', show_hdi: bool = False, hdi_prob: float = 0.9, show_ref_lines: bool = False, ref_values: tuple[float, float] = (-1.0, 1.0), t: ndarray | None = None)[source]#

Plot prior predictive samples, optionally overlaid on observed data.

Draws a “spaghetti plot” of prior predictive traces and, optionally, an HDI envelope and horizontal reference lines to help judge whether the chosen priors are plausible in the scaled data space.

Parameters:

prior_predictive (az.InferenceData) – Result of model.sample_prior_predictive().
series_idx (int or None) – If the prior predictive contains multiple series (e.g. from a hierarchical model), specify which one to plot. If None, plots everything. If series_idx is not None you must also pass the corresponding group array to the group parameter.
group (np.ndarray or None) – If the prior predictive contains multiple groups (e.g. from a hierarchical model), specify which element belongs to which group.
data (pd.DataFrame or None) – Observed data with columns ds and y.
n_samples (int, default 50) – Number of prior predictive traces to draw.
ax (matplotlib axes or None) – Axes to plot on. Created if None.
title (str) – Plot title.
show_hdi (bool, default False) – If True, shade the Highest Density Interval across time.
hdi_prob (float, default 0.9) – Probability mass for the HDI band (ignored when show_hdi=False).
show_ref_lines (bool, default False) – If True, draw horizontal dashed lines at the scaled-data bounds given by ref_values. Useful for checking whether the prior predictive concentrates within the plausible region of scaled data.
ref_values (tuple[float, float], default (-1.0, 1.0)) – (lower, upper) values for the reference lines (ignored when show_ref_lines=False). The defaults correspond to the approximate extent of maxabs-scaled data.
t (np.ndarray or None) – x-axis values. When None the observation index is used. Pass model.data["t"].values for the normalised time axis, or model.data["ds"].values for calendar dates.

Return type:

matplotlib.axes.Axes

Examples

>>> model = LinearTrend() + FourierSeasonality(365.25, 10)
>>> model.fit(data, method="mapx")
>>> ppc = model.sample_prior_predictive(samples=200)
>>> # Simple spaghetti plot
>>> plot_prior_predictive(ppc)
>>> # With HDI, reference lines and scaled time axis
>>> plot_prior_predictive(
...     ppc,
...     data=data,
...     show_hdi=True,
...     show_ref_lines=True,
...     t=model.data["t"].values,
... )

Datasets #

Functions for loading example datasets and generating synthetic data.

Real-World Datasets #

load_air_passengers#

load_air_passengers() → DataFrame[source]#

Load the Air Passengers dataset.

The Air Passengers dataset is a classic time series dataset containing monthly totals of international airline passengers from January 1949 to December 1960 (144 observations).

This dataset exhibits: - Clear upward trend - Strong yearly seasonality - Multiplicative seasonality (variance increases with level)

Returns:: DataFrame with columns: - ds: datetime, monthly timestamps from 1949-01 to 1960-12 - y: float, number of passengers (in thousands)
Return type:: pd.DataFrame

Examples

>>> from vangja.datasets import load_air_passengers
>>> df = load_air_passengers()
>>> print(f"Shape: {df.shape}")
Shape: (144, 2)
>>> print(f"Date range: {df['ds'].min()} to {df['ds'].max()}")
Date range: 1949-01-01 to 1960-12-01

Notes

Data is downloaded from the Prophet examples repository on GitHub. Original source: Box, G. E. P., Jenkins, G. M. and Reinsel, G. C. (1976) Time Series Analysis, Forecasting and Control. Third Edition.

load_peyton_manning#

load_peyton_manning() → DataFrame[source]#

Load the Peyton Manning Wikipedia page views dataset.

This dataset contains daily log-transformed Wikipedia page views for Peyton Manning from December 2007 to January 2016 (2905 observations).

This dataset exhibits: - Multiple trend changes (career events) - Strong yearly seasonality (NFL season) - Weekly seasonality (game days) - Holiday effects (Super Bowl, playoffs)

Returns:: DataFrame with columns: - ds: datetime, daily timestamps from 2007-12-10 to 2016-01-20 - y: float, log-transformed page views
Return type:: pd.DataFrame

Examples

>>> from vangja.datasets import load_peyton_manning
>>> df = load_peyton_manning()
>>> print(f"Shape: {df.shape}")
Shape: (2905, 2)
>>> print(f"Date range: {df['ds'].min().date()} to {df['ds'].max().date()}")
Date range: 2007-12-10 to 2016-01-20

Notes

Data is downloaded from the Prophet examples repository on GitHub. This is the same dataset used in the original Prophet paper and tutorials.

load_citi_bike_sales#

load_citi_bike_sales() → DataFrame[source]#

Load the Citi Bike station 360 sales dataset.

This dataset contains daily bike ride counts from Citi Bike station 360 in New York City (2013-07-01 to 2014-10-31). It is used to demonstrate forecasting short time series with transfer learning.

The dataset exhibits:

Strong weekly seasonality (weekday vs weekend patterns)
Yearly seasonality correlated with temperature/weather
Approximately 3 months of initial data used for training (~106 days)

Returns:

DataFrame with columns:

ds: datetime, daily timestamps from 2013-07-01 to 2014-10-31
y: float, number of bike rides

Return type:

pd.DataFrame

Examples

>>> from vangja.datasets import load_citi_bike_sales
>>> df = load_citi_bike_sales()
>>> print(f"Shape: {df.shape}")
Shape: (488, 2)

Notes

This dataset is from Tim Radtke’s blog post “Modeling Short Time Series with Prior Knowledge”. The vangja library was partially inspired by this work and Juan Orduz’s PyMC implementation.

Requires the pyreadr package (install with pip install vangja[datasets]).

References

load_nyc_temperature#

load_nyc_temperature(return_daily_average: bool = True) → DataFrame[source]#

Load New York City historical daily temperature data.

This dataset contains daily maximum temperatures (Fahrenheit) for New York City from 2012-10-01 to 2017-11-29. It is used to learn yearly seasonality patterns that can be transferred to short time series.

The dataset exhibits:

Strong yearly seasonality (summer highs, winter lows)
Consistent periodic pattern across years

Parameters:

return_daily_average (bool, default True) – If True, return daily average temperatures. If False, return raw hourly data.

Returns:

DataFrame with columns:

ds: datetime, daily timestamps from 2012-10-01 to 2017-11-29
y: float, maximum daily temperature in Fahrenheit

Return type:

pd.DataFrame

Examples

>>> from vangja.datasets import load_nyc_temperature
>>> df = load_nyc_temperature()
>>> print(f"Shape: {df.shape}")
Shape: (1886, 2)

Notes

This dataset is from Tim Radtke’s blog post “Modeling Short Time Series with Prior Knowledge”. The temperature seasonality can be used as prior information for forecasting related short time series (e.g., bike sales).

References

load_kaggle_temperature#

load_kaggle_temperature(city: Literal['Portland', 'San Francisco', 'Seattle', 'Los Angeles', 'San Diego', 'Las Vegas', 'Phoenix', 'Albuquerque', 'Denver', 'San Antonio', 'Dallas', 'Houston', 'Kansas City', 'Minneapolis', 'Saint Louis', 'Chicago', 'Nashville', 'Indianapolis', 'Atlanta', 'Detroit', 'Jacksonville', 'Charlotte', 'Miami', 'Pittsburgh', 'Philadelphia', 'New York', 'Boston', 'Vancouver', 'Toronto', 'Montreal', 'Beersheba', 'Tel Aviv District', 'Eilat', 'Haifa', 'Nahariyya', 'Jerusalem'] = 'New York', start_date: str | Timestamp | None = None, end_date: str | Timestamp | None = None, freq: str = 'D') → DataFrame[source]#

Load historical hourly temperature data from Kaggle.

Downloads the temperature.csv file from the Historical Hourly Weather Data dataset. Returns data for the requested city, filtered to the given date range and aggregated to the specified frequency.

The raw data contains hourly observations in Kelvin. Values are converted to Celsius before returning.

Parameters:

city (KaggleTemperatureCity, default "New York") – City column to extract. Must be one of the 36 cities in the dataset (see KaggleTemperatureCity).
start_date (str, pd.Timestamp, or None, default None) – Start of the date range (inclusive). If None, the earliest available date is used (~2012-10-01).
end_date (str, pd.Timestamp, or None, default None) – End of the date range (inclusive). If None, the latest available date is used (~2017-11-30).
freq (str, default "D") – Pandas offset alias for temporal aggregation (e.g. "D" for daily mean, "W" for weekly mean, "h" for hourly — no aggregation). The aggregation function is mean.

Returns:

DataFrame with columns:

ds: datetime
y: float, temperature in degrees Celsius
series: str, the original city name from the Kaggle dataset

Return type:

pd.DataFrame

Raises:

ImportError – If kagglehub is not installed.

Examples

>>> from vangja.datasets import load_kaggle_temperature
>>> df = load_kaggle_temperature("New York", "2015-01-01", "2015-12-31")
>>> print(df.columns.tolist())
['ds', 'y', 'series']

Notes

Requires the kagglehub package (install with pip install vangja[datasets]).

Data is downloaded and cached locally by kagglehub. A valid Kaggle API token is required (see Kaggle API docs).

References

load_smart_home_readings#

load_smart_home_readings(column: Literal['use [kW]', 'gen [kW]', 'House overall [kW]', 'Dishwasher [kW]', 'Furnace 1 [kW]', 'Furnace 2 [kW]', 'Home office [kW]', 'Fridge [kW]', 'Wine cellar [kW]', 'Garage door [kW]', 'Kitchen 12 [kW]', 'Kitchen 14 [kW]', 'Kitchen 38 [kW]', 'Barn [kW]', 'Well [kW]', 'Microwave [kW]', 'Living room [kW]', 'Solar [kW]'] | list[Literal['use [kW]', 'gen [kW]', 'House overall [kW]', 'Dishwasher [kW]', 'Furnace 1 [kW]', 'Furnace 2 [kW]', 'Home office [kW]', 'Fridge [kW]', 'Wine cellar [kW]', 'Garage door [kW]', 'Kitchen 12 [kW]', 'Kitchen 14 [kW]', 'Kitchen 38 [kW]', 'Barn [kW]', 'Well [kW]', 'Microwave [kW]', 'Living room [kW]', 'Solar [kW]']] = 'use [kW]', start_date: str | Timestamp | None = None, end_date: str | Timestamp | None = None, freq: str | None = None) → DataFrame[source]#

Load smart home energy readings from Kaggle.

Downloads the HomeC.csv file from the Smart Home Dataset with Weather Information dataset. Returns data for the requested appliance or total column(s), filtered to the given date range and aggregated to the specified frequency.

The raw data has 1-minute resolution and covers roughly 2016-01-01 to 2016-12-16. Each column is in kW.

Parameters:

column (SmartHomeColumn or list[SmartHomeColumn], default "use [kW]") –
The appliance or total column(s) to extract (see SmartHomeColumn). When a single string is passed the returned DataFrame has columns ds and y. When a list is passed the result is in long format with an additional series column identifying each appliance.

Common choices:
- "use [kW]" — total energy use
- "gen [kW]" — total energy generation
- "House overall [kW]" — house overall consumption
- "Dishwasher [kW]", "Fridge [kW]", etc. — individual appliances
start_date (str, pd.Timestamp, or None, default None) – Start of the date range (inclusive). If None, the earliest available date is used (~2016-01-01).
end_date (str, pd.Timestamp, or None, default None) – End of the date range (inclusive). If None, the latest available date is used (~2016-12-16).
freq (str or None, default None) – Pandas offset alias for temporal aggregation (e.g. "D" for daily mean, "h" for hourly mean, "W" for weekly mean). The aggregation function is mean. If None, no aggregation is performed and the original 1-minute data is returned.

Returns:

DataFrame with columns:

ds: datetime
y: float, energy reading in kW
series: str (only when ``column`` is a list) — the original column name from the Kaggle dataset

Return type:

pd.DataFrame

Raises:

ImportError – If kagglehub is not installed.

Examples

>>> from vangja.datasets import load_smart_home_readings
>>> df = load_smart_home_readings("Fridge [kW]", "2016-03-01", "2016-06-30")
>>> print(df.columns.tolist())
['ds', 'y']

Multiple columns return a long-format DataFrame:

>>> df = load_smart_home_readings(
...     ["Fridge [kW]", "Microwave [kW]"], freq="D"
... )
>>> print(df.columns.tolist())
['ds', 'y', 'series']

Notes

Requires the kagglehub package (install with pip install vangja[datasets]).

Data is downloaded and cached locally by kagglehub. A valid Kaggle API token is required (see Kaggle API docs).

The raw time column contains Unix timestamps. The last row of the CSV may contain malformed data and is automatically dropped.

References

Type Aliases #

vangja.datasets.KaggleTemperatureCity#: Literal type of valid city names for load_kaggle_temperature().

vangja.datasets.SmartHomeColumn#: Literal type of valid column names for load_smart_home_readings().

Synthetic Data Generation #

generate_multi_store_data#

generate_multi_store_data(start_date: str = '2015-01-01', end_date: str = '2019-12-31', freq: str = 'D', seed: int | None = 42) → tuple[DataFrame, list[dict[str, Any]]][source]#

Generate synthetic multi-store time series data.

Creates 5 synthetic time series representing different stores, all sharing the same date range. Each series has: - Linear trend with different slopes and intercepts - Yearly seasonality with different amplitudes - Weekly seasonality - Random noise

This dataset is ideal for demonstrating: - Simultaneous vs sequential fitting - Individual pooling across multiple series - Vectorized multi-series forecasting

Parameters:

start_date (str, default "2015-01-01") – Start date for the time series
end_date (str, default "2019-12-31") – End date for the time series
freq (str, default "D") – Frequency of the time series (e.g., “D” for daily)
seed (int or None, default 42) – Random seed for reproducibility. Set to None for random data.

Returns:

df (pd.DataFrame) – Combined DataFrame with columns: - ds: datetime timestamps - y: target values - series: store name (e.g., “store_north”)
params (list of dict) – List of parameter dictionaries for each store, containing: - name: store name - trend_slope, trend_intercept: trend parameters - yearly_amplitude, weekly_amplitude: seasonality amplitudes - noise_std: noise standard deviation

Examples

>>> from vangja.datasets import generate_multi_store_data
>>> df, params = generate_multi_store_data(seed=42)
>>> print(f"Total samples: {len(df)}")
>>> print(f"Number of stores: {df['series'].nunique()}")
>>> print(f"Stores: {df['series'].unique().tolist()}")

generate_hierarchical_products#

generate_hierarchical_products(start_date: str = '2018-01-01', end_date: str = '2019-12-31', freq: str = 'D', n_changepoints: int = 8, seed: int | None = 42, include_all_year: bool = False) → tuple[DataFrame, dict[str, dict[str, Any]]][source]#

Generate synthetic hierarchical product time series data.

Creates synthetic time series representing products that belong to groups with different seasonal patterns:

Summer products (3 series): Peak sales in summer months
Winter products (2 series): Peak sales in winter months (opposite)
All-year products (1 series, optional): Minimal seasonality

Each series has:

Piecewise linear trend with changepoints (Prophet-style)
Yearly seasonality from Fourier series (group-specific pattern)
Weekly seasonality from Fourier series
Random noise

This dataset is ideal for demonstrating:

Hierarchical Bayesian modeling with partial pooling
Using UniformConstant(-1, 1) to handle opposite seasonality directions
Group-level parameter sharing
Shrinkage effects

Parameters:

start_date (str, default "2018-01-01") – Start date for the time series
end_date (str, default "2019-12-31") – End date for the time series
freq (str, default "D") – Frequency of the time series (e.g., “D” for daily)
n_changepoints (int, default 8) – Number of potential changepoints in the trend
seed (int or None, default 42) – Random seed for reproducibility. Set to None for random data.
include_all_year (bool, default False) – If True, include an “all_year” product with minimal seasonality. This creates 6 series total instead of 5.

Returns:

df (pd.DataFrame) – Combined DataFrame with columns:
- ds: datetime timestamps
- y: target values (sales)
- series: product name (e.g., “summer_1”, “winter_2”, “all_year”)
params (dict) – Dictionary mapping product names to their parameters:
- k: initial slope
- m: initial intercept (base level)
- delta: slope changes at changepoints
- yearly_beta: Fourier coefficients for yearly seasonality
- weekly_beta: Fourier coefficients for weekly seasonality
- noise_std: noise standard deviation
- group: “summer”, “winter”, or “all_year”

Examples

>>> from vangja.datasets import generate_hierarchical_products
>>> df, params = generate_hierarchical_products(seed=42)
>>> print(f"Total samples: {len(df)}")
>>> print(f"Products: {list(params.keys())}")
>>> print(f"Summer products: {[k for k, v in params.items() if v['group'] == 'summer']}")
>>> print(f"Winter products: {[k for k, v in params.items() if v['group'] == 'winter']}")

Including the all-year product:

>>> df, params = generate_hierarchical_products(seed=42, include_all_year=True)
>>> print(f"All-year products: {[k for k, v in params.items() if v['group'] == 'all_year']}")

Notes

The data generation follows the Prophet/timeseers formulation:

y = g(t) + s_yearly(t) + s_weekly(t) + noise

where:

g(t) is a piecewise linear trend with changepoints
s(t) are Fourier series seasonality components
Summer products have positive yearly seasonality (peak in summer)
Winter products have negative yearly seasonality (peak in winter)
All-year products have minimal seasonality (nearly flat)

To model products with opposite seasonality directions, use UniformConstant(-1, 1) as a scaling factor in the model composition.

Stock Data #

load_stock_data#

load_stock_data(tickers: list[str], split_date: str | Timestamp, window_size: int, horizon_size: int, cache_path: Path | None = None, interpolate: bool = False) → tuple[DataFrame, DataFrame][source]#

Load historical stock data split into training and test sets.

Downloads daily OHLCV data for the specified tickers using Yahoo Finance and computes the typical price as (Open + High + Low + Close) / 4. The data is split into a training window and a test horizon around split_date.

Parameters:

tickers (list[str]) – List of ticker symbols to download (e.g., ["AAPL", "MSFT"]).
split_date (str or pd.Timestamp) – The date separating training and test data. Training data covers [split_date - window_size, split_date) and test data covers [split_date, split_date + horizon_size].
window_size (int) – Number of calendar days for the training window (before split_date).
horizon_size (int) – Number of calendar days for the test horizon (from split_date onwards).
cache_path (Path or None, default None) – Directory for caching downloaded data. Each ticker is stored as a CSV file. If None, data is downloaded without caching. If provided, parent directories are created if they do not exist.
interpolate (bool, default False) – If True, missing days (weekends, holidays) within each series are filled using linear interpolation after reindexing to a daily calendar.

Returns:

(train_df, test_df) — DataFrames with columns:

ds: datetime
y: float, typical price
series: str, ticker symbol

Return type:

tuple[pd.DataFrame, pd.DataFrame]

Examples

>>> from vangja.datasets import load_stock_data
>>> train, test = load_stock_data(
...     ["AAPL"], "2024-01-01", window_size=365, horizon_size=30
... )
>>> print(train.columns.tolist())
['ds', 'y', 'series']

Notes

Requires the yfinance package (install with pip install vangja[datasets]).

get_sp500_tickers_for_range#

Get tickers consistently in the S&P 500 during a date range.

Returns tickers that were part of the S&P 500 for the entire duration between start_date and end_date. A ticker is excluded if it was removed at any point during the range, even if it was later re-added.

Parameters:

start_date (str, datetime, or pd.Timestamp) – Start of the date range (inclusive).
end_date (str, datetime, or pd.Timestamp) – End of the date range (inclusive).
cache_path (Path or None, default None) – Directory for caching Wikipedia data as CSV files. If None, data is fetched without caching. If provided, parent directories are created if they do not exist.

Returns:

Sorted list of ticker symbols that were consistently in the S&P 500 during the entire date range.

Return type:

list[str]

Raises:

ValueError – If start_date is after end_date.

Notes

Accuracy depends on Wikipedia’s “List of S&P 500 companies” historical changes table, which has comprehensive data from approximately 1997 onwards. Results for earlier periods may be less accurate.

Examples

>>> from vangja.datasets.stocks import get_sp500_tickers_for_range
>>> tickers = get_sp500_tickers_for_range(
...     "2020-01-01", "2020-12-31"
... )
>>> "AAPL" in tickers
True

Types #

Type definitions used throughout vangja for type hints and documentation.

Type definitions for vangja time series models.

This module provides type aliases and TypedDicts used throughout the vangja package for type hints and documentation.

Type Aliases #

ScalerLiteral[“maxabs”, “minmax”]: Methods for scaling the target variable y.
ScaleModeLiteral[“individual”, “complete”]: Whether to scale each series separately or all together.
MethodLiteral[…]: Bayesian inference methods for model fitting.
NutsSamplerLiteral[…]: Backend samplers for NUTS inference.
FreqStrLiteral[…]: Frequency strings for date ranges.
TuneMethodLiteral[“parametric”, “prior_from_idata”]: Transfer learning methods.
PoolTypeLiteral[“partial”, “complete”, “individual”]: Pooling types for multi-series modeling.

TypedDicts #

YScaleParams: Parameters for y (target) scaling.
TScaleParams: Parameters for t (time) scaling.

Scaler#

Scaling method for the target variable y.

“maxabs”: Scale by absolute maximum. y_scaled = y / max(|y|)
“minmax”: Scale to [0, 1]. y_scaled = (y - min(y)) / (max(y) - min(y))

alias of Literal[‘maxabs’, ‘minmax’]

ScaleMode#

Scale mode for multi-series data.

“individual”: Scale each series independently.
“complete”: Scale all series together using global min/max.

alias of Literal[‘individual’, ‘complete’]

class YScaleParams[source]#

Bases: TypedDict

Parameters for scaling the target variable y.

scaler#

The scaling method used (“maxabs” or “minmax”).

Type:: Scaler

y_min#

The minimum value of y before scaling (0 for maxabs).

Type:: float

y_max#

The maximum value of y (or max(|y|) for maxabs) before scaling.

Type:: float

scaler: Literal['maxabs', 'minmax']#

y_min: float#

y_max: float#

class TScaleParams[source]#

Bases: TypedDict

Parameters for scaling the time variable t.

The time variable is always scaled to [0, 1] using minmax scaling.

ds_min#

The minimum datetime value in the dataset.

Type:: float

ds_max#

The maximum datetime value in the dataset.

Type:: float

ds_min: float#

ds_max: float#

Method#

Bayesian inference methods.

Point estimates: - “mapx”: Maximum a posteriori using pymc-extras (recommended, uses JAX). - “map”: Maximum a posteriori using PyMC.

Variational inference: - “advi”: Automatic Differentiation Variational Inference. - “fullrank_advi”: Full-rank ADVI. - “svgd”: Stein Variational Gradient Descent. - “asvgd”: Amortized SVGD.

Markov Chain Monte Carlo: - “nuts”: No-U-Turn Sampler (recommended for full posterior). - “metropolis”: Metropolis-Hastings sampler. - “demetropolisz”: Differential Evolution Metropolis-Z.

alias of Literal[‘mapx’, ‘map’, ‘fullrank_advi’, ‘advi’, ‘svgd’, ‘asvgd’, ‘nuts’, ‘metropolis’, ‘demetropolisz’]

OptimizationMethod#

Optimization methods for MAP inference.

“nelder-mead”: Simplex algorithm, does not use gradients.
“powell”: Conjugate direction method, does not use gradients.
“CG”: Conjugate Gradient, uses gradients.
“BFGS”: Broyden-Fletcher-Goldfarb-Shanno, uses gradients.
“Newton-CG”: Newton-Conjugate Gradient, uses gradients and Hessian-vector products.
“L-BFGS-B”: Limited-memory BFGS with bounds, uses gradients.
“TNC”: Truncated Newton Conjugate-Gradient, uses gradients and supports bounds.
“COBYLA”: Constrained Optimization BY Linear Approximation, does not use gradients.
“SLSQP”: Sequential Least Squares Programming, uses gradients and supports constraints.
“trust-constr”: Trust-region Constrained Algorithm, uses gradients and supports constraints.
“dogleg”: Trust-region Dogleg method, uses gradients and Hessian.
“trust-ncg”: Trust-region Newton Conjugate Gradient, uses gradients and Hessian-vector products.
“trust-exact”: Trust-region Exact method, uses gradients and Hessian.
“trust-krylov”: Trust-region Krylov method, uses gradients and Hessian-vector products.

alias of Literal[‘nelder-mead’, ‘powell’, ‘CG’, ‘BFGS’, ‘Newton-CG’, ‘L-BFGS-B’, ‘TNC’, ‘COBYLA’, ‘SLSQP’, ‘trust-constr’, ‘dogleg’, ‘trust-ncg’, ‘trust-exact’, ‘trust-krylov’]

NutsSampler#

Backend samplers for NUTS inference.

“pymc”: Default PyMC sampler.
“nutpie”: Fast Rust-based sampler.
“numpyro”: JAX-based sampler from NumPyro.
“blackjax”: JAX-based sampler from BlackJAX.

alias of Literal[‘pymc’, ‘nutpie’, ‘numpyro’, ‘blackjax’]

FreqStr#

Frequency strings for pandas date ranges.

Common values: - “Y”: Year - “M”: Month - “W”: Week - “D”: Day - “h”: Hour - “m” or “minute”: Minute - “s” or “second”: Second

alias of Literal[‘Y’, ‘M’, ‘W’, ‘D’, ‘h’, ‘m’, ‘s’, ‘ms’, ‘us’, ‘ns’, ‘ps’, ‘minute’, ‘second’, ‘millisecond’, ‘microsecond’, ‘nanosecond’, ‘picosecond’]

TuneMethod#

Transfer learning methods.

“parametric”: Use posterior mean and std from idata to set new priors.
“prior_from_idata”: Use the posterior samples directly as priors via multivariate normal approximation.

alias of Literal[‘parametric’, ‘prior_from_idata’]

PoolType#

Pooling types for multi-series modeling.

“complete”: All series share the same parameters.
“partial”: Hierarchical pooling with shared hyperpriors.
“individual”: Each series has completely independent parameters.

alias of Literal[‘partial’, ‘complete’, ‘individual’]

API Reference

Contents

API Reference#

load_air_passengers#

load_peyton_manning#

load_citi_bike_sales#

load_nyc_temperature#

load_kaggle_temperature#

load_smart_home_readings#

generate_multi_store_data#

generate_hierarchical_products#

load_stock_data#

get_sp500_tickers_for_range#