preprocessing

`boxcox(method='mle')`

Applies the Box-Cox transformation to numeric columns in a panel DataFrame.

Parameters:

Name	Type	Description	Default
`method`	`str`	The method used to determine the lambda parameter of the Box-Cox transformation. Supported methods: `mle`: maximum likelihood estimation `pearsonr`: Pearson correlation coefficient	`'mle'`

`coerce_dtypes(schema)`

Coerces the column datatypes of a DataFrame using the provided schema.

Parameters:

Name	Type	Description	Default
`schema`	`Mapping[str, DataType]`	A dictionary-like object mapping column names to the desired data types.	required

`deseasonalize_fourier(sp, K, robust=False)`

Removes seasonality via residualized regression with Fourier terms.

Parameters:

Name	Type	Description	Default
`sp`	`int`	Seasonal period.	required
`K`	`int`	Maximum order(s) of Fourier terms. Must be less than `sp`.	required
`Note`			required

`detrend(freq, method='linear')`

Removes mean or linear trend from numeric columns in a panel DataFrame.

Parameters:

Name	Type	Description	Default
`freq`	`str`	Offset alias supported by Polars.	required
`method`	`str`	If `mean`, subtracts mean from each time-series. If `linear`, subtracts line of best-fit (via OLS) from each time-series. Defaults to `linear`.	`'linear'`

`diff(order, sp=1, fill_strategy=None)`

Difference time-series in panel data given order and seasonal period.

Parameters:

Name	Type	Description	Default
`order`	`int`	The order to difference.	required
`sp`	`int`	Seasonal periodicity.	`1`
`fill_strategy`	`Optional[str]`	Strategy to fill nulls by. Nulls are not filled if None. Supported strategies include: ["backward", "forward", "mean", "zero"].	`None`

`fractional_diff(d, min_weight=None, window_size=None)`

Compute the fractional differential of a time series.

This particular functionality is referenced in Advances in Financial Machine Learning by Marcos Lopez de Prado (2018).

For feature creation purposes, it is suggested that the minimum value of d is used that removes stationarity from the time series. This can be achieved by running the augmented dickey-fuller test on the time series for different values of d and selecting the minimum value that makes the time series stationary.

Parameters:

Name	Type	Description	Default
`d`	`float`	The fractional order of the differencing operator.	required
`min_weight`	`float`	The minimum weight to use for calculations. If specified, the window size is computed from this value and not needed.	`None`
`window_size`	`int`	The window size of the fractional differencing operator. If specified, the minimum weight is not needed.	`None`

`impute(method)`

Performs missing value imputation on numeric columns of a DataFrame grouped by entity.

Parameters:

Name	Type	Description	Default
`method`	`Union[str, int, float]`	The imputation method to use. Supported methods are: 'mean': Replace missing values with the mean of the corresponding column. 'median': Replace missing values with the median of the corresponding column. 'fill': Replace missing values with the mean for float columns and the median for integer columns. 'ffill': Forward fill missing values. 'bfill': Backward fill missing values. 'interpolate': Interpolate missing values using linear interpolation. int or float: Replace missing values with the specified constant.	required

`lag(lags, is_sorted=False)`

Applies lag transformation to a LazyFrame. The time series is assumed to have no null values.

Parameters:

Name	Type	Description	Default
`lags`	`List[int]`	A list of lag values to apply.	required
`is_sorted`	`bool`	If already sorted by entity and time columns already, this won't sort again and can save some time.	`False`

`one_hot_encode(drop_first=False)`

Encode categorical features as a one-hot numeric array.

Parameters:

Name	Type	Description	Default
`drop_first`	`bool`	Drop the first one hot feature.	`False`

Raises:

Type	Description
`ValueError`	if X passed into `transform_new` contains unknown categories.

`reindex(drop_duplicates=False)`

Reindexes the entity and time columns to have every possible combination of (entity, time).

Parameters:

Name	Type	Description	Default
`drop_duplicates`	`bool`	Defaults to False. If True, duplicates are dropped before reindexing.	`False`

`resample(freq, agg_method, impute_method)`

Resamples and transforms a DataFrame using the specified frequency, aggregation method, and imputation method.

Parameters:

Name	Type	Description	Default
`freq`	`str`	Offset alias supported by Polars.	required
`agg_method`	`str`	The aggregation method to use for resampling. Supported values are 'sum', 'mean', and 'median'.	required
`impute_method`	`Union[str, int, float]`	The method used for imputing missing values. If a string, supported values are 'ffill' (forward fill) and 'bfill' (backward fill). If an int or float, missing values will be filled with the provided value.	required

`roll(window_sizes, stats, freq, fill_strategy=None)`

Performs rolling window calculations on specified columns of a DataFrame.

Parameters:

Name	Type	Description	Default
`window_sizes`	`List[int]`	A list of integers representing the window sizes for the rolling calculations.	required
`stats`	`List[Literal['mean', 'min', 'max', 'mlm', 'sum', 'std', 'cv']]`	A list of statistical measures to calculate for each rolling window. Supported values are: 'mean' for mean 'min' for minimum 'max' for maximum 'mlm' for maximum minus minimum 'sum' for sum 'std' for standard deviation 'cv' for coefficient of variation	required
`freq`	`str`	Offset alias supported by Polars.	required
`fill_strategy`	`Optional[str]`	Strategy to fill nulls by. Nulls are not filled if None. Supported strategies include: ["backward", "forward", "mean", "zero"].	`None`

`scale(use_mean=True, use_std=True, rescale_bool=False)`

Performs scaling and rescaling operations on the numeric columns of a DataFrame.

Parameters:

Name	Type	Description	Default
`use_mean`	`bool`	Whether to subtract the mean from the numeric columns. Defaults to True.	`True`
`use_std`	`bool`	Whether to divide the numeric columns by the standard deviation. Defaults to True.	`True`
`rescale_bool`	`bool`	Whether to rescale boolean columns to the range [-1, 1]. Defaults to False.	`False`

`time_to_arange(eager=False)`

Coerces time column into arange per entity.

Assumes even-spaced time-series and homogeneous start dates.

`trim(direction='both')`

Trims time-series in panel to have the same start or end dates as the shortest time-series.

Parameters:

Name	Type	Description	Default
`direction`	`Literal['both', 'left', 'right']`	Defaults to "both". If "left" trims from start date of the shortest time series); if "right" trims up to the end date of the shortest time-series; or otherwise "both" trims between start and end dates of the shortest time-series	`'both'`

`yeojohnson(brack=(-2, 2))`

Applies the Yeo-Johnson transformation to numeric columns in a panel DataFrame.

Parameters:

Name	Type	Description	Default
`brack`	`2 - tuple`	The starting interval for a downhill bracket search with optimize.brent. Note that this is in most cases not critical; the final result is allowed to be outside this bracket.	`(-2, 2)`