GAMs
GAM model types from pymgcv.gam
.
AbstractGAM
Abstract base class for GAM models.
This class cannot be initialized but provides a common interface for fitting and predicting GAM models.
coefficients
Extract model coefficients from the fitted GAM.
Returns a series where the index if the mgcv-style name of the parameter.
covariance
covariance(
*, sandwich: bool = False, freq: bool = False, unconditional: bool = False
) -> DataFrame
Extract the covariance matrix from the fitted GAM.
Extracts the Bayesian posterior covariance matrix of the parameters or frequentist covariance matrix of the parameter estimators from the fitted GAM.
Parameters:
-
sandwich
(bool
, default:False
) –If True, compute sandwich estimate of covariance matrix. Currently expensive for discrete bam fits.
-
freq
(bool
, default:False
) –If True, return the frequentist covariance matrix of the parameter estimators. If False, return the Bayesian posterior covariance matrix of the parameters. The latter option includes the expected squared bias according to the Bayesian smoothing prior.
-
unconditional
(bool
, default:False
) –If True (and freq=False), return the Bayesian smoothing parameter uncertainty corrected covariance matrix, if available.
Returns:
-
DataFrame
–The covariance matrix as a pandas dataframe where the column names and index
-
DataFrame
–are the mgcv-style parameter names.
partial_effect
partial_effect(
target: str, term: TermLike, data: DataFrame, *, compute_se: bool = False
) -> PredictionResult
Compute the partial effect for a single model term.
This method efficiently computes the contribution of one specific term to the model predictions.
Parameters:
-
target
(str
) –Name of the target variable (response variable or family parameter name from the model specification)
-
term
(TermLike
) –The specific term to evaluate (must match a term used in the original model specification)
-
data
(DataFrame
) –DataFrame containing the predictor variables needed for the term
-
compute_se
(bool
, default:False
) –Whether to compute and return standard errors
partial_residuals
Compute partial residuals for model diagnostic plots.
Partial residuals combine the fitted values from a specific term with the overall model residuals. They're useful for assessing whether the chosen smooth function adequately captures the relationship, or if a different functional form might be more appropriate.
Parameters:
-
target
(str
) –Name of the response variable.
-
term
(TermLike
) –The model term to compute partial residuals for.
-
data
(DataFrame
) –DataFrame containing the data (must include the response variable).
-
**kwargs
(Any
) –Additional keyword arguments to pass to
partial_effects
.
Returns:
-
Series
–Series containing the partial residuals for the specified term
GAM
GAM(
predictors: dict[str, Iterable[TermLike] | TermLike],
family_predictors: dict[str, Iterable[TermLike] | TermLike] | None = None,
*,
family: str = "gaussian",
add_intercepts: bool = True,
)
Standard GAM Model.
Initialize a GAM/BAM model.
Parameters:
-
predictors
(dict[str, Iterable[TermLike] | TermLike]
) –Dictionary mapping response variable names to an iterable of
TermLike
objects used to predict \(g([\mathbb{E}[Y])\). For single response models, use a single key-value pair. For multivariate models, include multiple response variables. -
family_predictors
(dict[str, Iterable[TermLike] | TermLike] | None
, default:None
) –Dictionary mapping family parameter names to an iterable of terms for modeling those parameters. Keys are used as labels during prediction and should match the order expected by the mgcv family.
-
family
(str
, default:'gaussian'
) –String specifying the mgcv family for the error distribution. This is passed directly to R's mgcv and can include family arguments.
-
add_intercepts
(bool
, default:True
) –If True, adds an intercept term to each formula. If false, we assume that any
Intercept
terms desired are manually added to the formulae.
fit
fit(
data: DataFrame,
*,
method: Literal[
"GCV.Cp", "GACV.Cp", "QNCV", "REML", "P-REML", "ML", "P-ML", "NCV"
] = "GCV.Cp",
weights: str | ndarray | Series | None = None,
optimizer: str | tuple[str, str] = ("outer", "newton"),
scale: Union[Literal["unknown"], float, int, NoneType] = None,
select: bool = False,
gamma: float | int = 1,
n_threads: int = 1,
) -> typing.Self
Fit the GAM.
Parameters:
-
data
(DataFrame
) –DataFrame containing all variables referenced in the specification. Variable names must match those used in the model terms.
-
method
(Literal['GCV.Cp', 'GACV.Cp', 'QNCV', 'REML', 'P-REML', 'ML', 'P-ML', 'NCV']
, default:'GCV.Cp'
) –Method for smoothing parameter estimation, matching the mgcv, options.
-
weights
(str | ndarray | Series | None
, default:None
) –Observation weights. Either a string, matching a column name, or a array/series with length equal to the number of observations.
-
optimizer
(str | tuple[str, str]
, default:('outer', 'newton')
) –An string or length 2 tuple, specifying the numerical optimization method to use to optimize the smoothing parameter estimation criterion (given by method). "outer" for the direct nested optimization approach. "outer" can use several alternative optimizers, specified in the second element: "newton" (default), "bfgs", "optim" or "nlm". "efs" for the extended Fellner Schall method of Wood and Fasiolo (2017).
-
scale
(Union[Literal['unknown'], float, int, NoneType]
, default:None
) –If a number is provided, it is treated as a known scale parameter. If left to None, the scale parameter is 1 for Poisson and binomial and unknown otherwise. Note that (RE)ML methods can only work with scale parameter 1 for the Poisson and binomial cases.
-
select
(bool
, default:False
) –If set to True then gam can add an extra penalty to each term so that it can be penalized to zero. This means that the smoothing parameter estimation during fitting can completely remove terms from the model. If the corresponding smoothing parameter is estimated as zero then the extra penalty has no effect. Use gamma to increase level of penalization.
-
gamma
(float | int
, default:1
) –Increase this beyond 1 to produce smoother models. gamma multiplies the effective degrees of freedom in the GCV or UBRE/AIC. gamma can be viewed as an effective sample size in the GCV score, and this also enables it to be used with REML/ML. Ignored with P-RE/ML or the efs optimizer.
-
n_threads
(int
, default:1
) –Number of threads to use for fitting the GAM.
predict
predict(
data: DataFrame | None = None,
*,
compute_se: bool = False,
block_size: int | None = None,
) -> dict[str, pymgcv.custom_types.PredictionResult]
Compute model predictions with uncertainty estimates.
Makes predictions for new data using the fitted GAM model. Predictions are returned on the link scale (linear predictor scale), not the response scale. For response scale predictions, apply the appropriate inverse link function to the results.
Parameters:
-
data
(DataFrame | None
, default:None
) –DataFrame containing predictor variables. Must include all variables referenced in the original model specification.
-
compute_se
(bool
, default:False
) –Whether to compute standard errors for predictions.
-
block_size
(int | None
, default:None
) –Number of rows to process at a time. If None then block size is 1000 if data supplied, and the number of rows in the model frame otherwise.
Returns:
-
dict[str, PredictionResult]
–A dictionary mapping the target variable names to a pandas DataFrame
-
dict[str, PredictionResult]
–containing the predictions and standard errors if
se
is True.
partial_effects
partial_effects(
data: DataFrame | None = None,
*,
compute_se: bool = False,
block_size: int | None = None,
) -> dict[str, pymgcv.custom_types.PartialEffectsResult]
Compute partial effects for all model terms.
Calculates the contribution of each model term to the overall prediction on the link scale. The sum of all fit columns equals the total prediction (link scale).
Parameters:
-
data
(DataFrame | None
, default:None
) –DataFrame containing predictor variables for evaluation. Defaults to using the data for fitting.
-
compute_se
(bool
, default:False
) –Whether to compute and return standard errors.
-
block_size
(int | None
, default:None
) –Number of rows to process at a time. If None then block size is 1000 if data supplied, and the number of rows in the model frame otherwise.
BAM
BAM(
predictors: dict[str, Iterable[TermLike] | TermLike],
family_predictors: dict[str, Iterable[TermLike] | TermLike] | None = None,
*,
family: str = "gaussian",
add_intercepts: bool = True,
)
A big-data GAM (BAM) model.
Initialize a GAM/BAM model.
Parameters:
-
predictors
(dict[str, Iterable[TermLike] | TermLike]
) –Dictionary mapping response variable names to an iterable of
TermLike
objects used to predict \(g([\mathbb{E}[Y])\). For single response models, use a single key-value pair. For multivariate models, include multiple response variables. -
family_predictors
(dict[str, Iterable[TermLike] | TermLike] | None
, default:None
) –Dictionary mapping family parameter names to an iterable of terms for modeling those parameters. Keys are used as labels during prediction and should match the order expected by the mgcv family.
-
family
(str
, default:'gaussian'
) –String specifying the mgcv family for the error distribution. This is passed directly to R's mgcv and can include family arguments.
-
add_intercepts
(bool
, default:True
) –If True, adds an intercept term to each formula. If false, we assume that any
Intercept
terms desired are manually added to the formulae.
fit
fit(
data: DataFrame,
*,
method: Literal[
"fREML", "GCV.Cp", "GACV.Cp", "REML", "P-REML", "ML", "P-ML", "NCV"
] = "fREML",
weights: str | ndarray | Series | None = None,
scale: Union[Literal["unknown"], float, int, NoneType] = None,
select: bool = False,
gamma: float | int = 1,
chunk_size: int = 10000,
discrete: bool = False,
samfrac: float | int = 1,
gc_level: Literal[0, 1, 2] = 0,
) -> typing.Self
Fit the GAM.
Parameters:
-
data
(DataFrame
) –DataFrame containing all variables referenced in the specification. Variable names must match those used in the model terms.
-
method
(Literal['fREML', 'GCV.Cp', 'GACV.Cp', 'REML', 'P-REML', 'ML', 'P-ML', 'NCV']
, default:'fREML'
) –Method for smoothing parameter estimation, matching the mgcv, options.
-
weights
(str | ndarray | Series | None
, default:None
) –Observation weights. Either a string, matching a column name, or a array/series with length equal to the number of observations.
-
scale
(Union[Literal['unknown'], float, int, NoneType]
, default:None
) –If a number is provided, it is treated as a known scale parameter. If left to None, the scale parameter is 1 for Poisson and binomial and unknown otherwise. Note that (RE)ML methods can only work with scale parameter 1 for the Poisson and binomial cases.
-
select
(bool
, default:False
) –If set to True then gam can add an extra penalty to each term so that it can be penalized to zero. This means that the smoothing parameter estimation during fitting can completely remove terms from the model. If the corresponding smoothing parameter is estimated as zero then the extra penalty has no effect. Use gamma to increase level of penalization.
-
gamma
(float | int
, default:1
) –Increase this beyond 1 to produce smoother models. gamma multiplies the effective degrees of freedom in the GCV or UBRE/AIC. gamma can be viewed as an effective sample size in the GCV score, and this also enables it to be used with REML/ML. Ignored with P-RE/ML or the efs optimizer.
-
chunk_size
(int
, default:10000
) –The model matrix is created in chunks of this size, rather than ever being formed whole. Reset to 4p if chunk.size < 4p where p is the number of coefficients.
-
discrete
(bool
, default:False
) –if True and using method="fREML", discretizes covariates for storage and efficiency reasons.
-
samfrac
(float | int
, default:1
) –If
0<samfrac<1
, performs a fast preliminary fitting step using a subsample of the data to improve convergence speed. -
gc_level
(Literal[0, 1, 2]
, default:0
) –0 uses R's garbage collector, 1 and 2 use progressively more frequent garbage collection, which takes time but reduces memory requirements.
predict
predict(
data: DataFrame | None = None,
*,
compute_se: bool = False,
block_size: int = 50000,
discrete: bool = True,
n_threads: int = 1,
gc_level: Literal[0, 1, 2] = 0,
) -> dict[str, pymgcv.custom_types.PredictionResult]
Compute model predictions with uncertainty estimates.
Makes predictions for new data using the fitted GAM model. Predictions are returned on the link scale (linear predictor scale), not the response scale. For response scale predictions, apply the appropriate inverse link function to the results.
Parameters:
-
data
(DataFrame | None
, default:None
) –DataFrame containing predictor variables. Must include all variables referenced in the original model specification.
-
compute_se
(bool
, default:False
) –Whether to compute and return standard errors.
-
block_size
(int
, default:50000
) –Number of rows to process at a time.
-
n_threads
(int
, default:1
) –Number of threads to use for computation.
-
discrete
(bool
, default:True
) –If True and the model was fitted with discrete=True, then uses discrete prediction methods in which covariates are discretized for efficiency for storage and efficiency reasons.
-
gc_level
(Literal[0, 1, 2]
, default:0
) –0 uses R's garbage collector, 1 and 2 use progressively more frequent garbage collection, which takes time but reduces memory requirements.
partial_effects
partial_effects(
data: DataFrame | None = None,
*,
compute_se: bool = False,
block_size: int = 50000,
n_threads: int = 1,
discrete: bool = True,
gc_level: Literal[0, 1, 2] = 0,
) -> dict[str, pymgcv.custom_types.PartialEffectsResult]
Compute partial effects for all model terms.
Calculates the contribution of each model term to the overall prediction. This decomposition is useful for understanding which terms contribute most to predictions and for creating partial effect plots. The sum of all fit columns equals the total prediction.
Parameters:
-
data
(DataFrame | None
, default:None
) –DataFrame containing predictor variables for evaluation.
-
compute_se
(bool
, default:False
) –Whether to compute and return standard errors.
-
block_size
(int
, default:50000
) –Number of rows to process at a time. Higher is faster but more memory intensive.
-
n_threads
(int
, default:1
) –Number of threads to use for computation.
-
discrete
(bool
, default:True
) –If True and the model was fitted with discrete=True, then uses discrete prediction methods in which covariates are discretized for efficiency for storage and efficiency reasons.
-
gc_level
(Literal[0, 1, 2]
, default:0
) –0 uses R's garbage collector, 1 and 2 use progressively more frequent garbage collection, which takes time but reduces memory requirements.
FitState
The mgcv gam, and the data used for fitting.
This gets set as an attribute fit_state on the AbstractGAM object after fitting.
Attributes:
-
rgam
–The fitted mgcv gam object.
-
data
–The data used for fitting.
PredictionResult
Container for predictions or individual partial effects with optional standard errors.
Used for predictions or the partial effect of a single variable.
Attributes:
-
fit
–Predicted values or partial effect as a NumPy array.
-
se
–Standard errors of the predictions, if available.