ML.DETECT_ANOMALIES In BigQuery

This article explains how the ML.DETECT_ANOMALIES function works in BigQuery ML, focusing on its use for time-series anomaly detection. It covers the core mechanics, input requirements, output fields, and practical considerations for using this function reliably in production.

Conceptual flow

Anomaly detection in BigQuery ML works in two distinct steps, where training and detection are kept separate.

A forecasting model, most commonly ARIMA_PLUS, which captures trends, seasonality, and short-term patterns, is first trained on historical data. This step defines what the model considers normal behaviour for the series.
New observations, following the same data structure as the training data, are then passed to the ML.DETECT_ANOMALIES function.
The function compares each new observation to what the model expects, measuring how far the observed value deviates from the prediction.
For each data point, the function returns a set of outputs: the expected value range (lower and upper bounds), an anomaly score, and a flag indicating whether the point is considered anomalous.

Keeping training and detection separate is important: it means the model stays stable over time, and new data can be evaluated continuously without retraining.

Typical requirements

For ML.DETECT_ANOMALIES to work correctly, the input data needs to include at least the following columns:

A timestamp column, specified via TIME_SERIES_TIMESTAMP_COL, which defines the time ordering of observations.
A numeric metric column, specified via TIME_SERIES_DATA_COL, representing the value being measured at each point in time.
Optionally, one or more identifier columns, specified via TIME_SERIES_ID_COL, which allow multiple distinct series to be evaluated in a single query.

When identifier columns are used, the function runs in multi-series mode: each unique combination of identifier values is treated as its own independent series, with its own expected behaviour. This makes it possible to run anomaly detection across many segments at once, such as multiple properties, user groups, or markets, without needing a separate model for each.

What the function evaluates

ML.DETECT_ANOMALIES works by comparing observed values against the range of values the model expects at each point in time. After learning the patterns in a series, including trends, seasonal cycles, and typical variability, the model can estimate, for any given timestamp, what range of values would be considered normal.

The classification logic is straightforward:

If the observed value falls within the expected range, it is considered normal.
If the observed value falls far enough outside that range, relative to the model's estimated uncertainty, it is flagged as anomalous.

How sensitive this classification is depends on the anomaly_prob_threshold parameter.

A higher threshold (e.g., 0.99) means the model only flags strong deviations, resulting in fewer alerts and fewer false positives, but it may also miss some real anomalies.
A lower threshold increases sensitivity, catching more deviations, but also producing more noise that requires filtering downstream.

The right threshold depends on the context and on how costly a missed anomaly is compared to a false alarm.

Interpreting common outputs

ML.DETECT_ANOMALIES returns several fields for each evaluated data point:

is_anomaly: A boolean flag indicating whether the data point has been classified as anomalous given the specified threshold. This is the primary signal used in alerting or triage workflows.
lower_bound and upper_bound: The boundaries of the model's expected range for that timestamp. Data points outside this range are candidates for anomaly classification. These fields are also useful for visualisation, allowing expected ranges to be overlaid against observed values in reporting tools.
anomaly_score (or an equivalent scoring field): A numeric value indicating how strongly a data point deviates from what the model expects. Higher scores correspond to larger deviations and can be used to rank or prioritise flagged points for investigation.

The exact field names may vary slightly depending on the model type and how the function is called, but the interpretation remains the same across variants.

Practical guidance

Getting reliable results from ML.DETECT_ANOMALIES in practice requires attention to a few key considerations:

Train on enough history. The quality of the model depends directly on the quality and length of the training data. The training period should be long enough to cover the full range of seasonal patterns in the metric. Too little history leads to inaccurate expected ranges and more classification errors.
Filter out sparse or irregular series before training. Series with too many gaps, very low volumes, or highly erratic patterns are difficult to model reliably. It is worth applying minimum data completeness filters before training to exclude series that are unlikely to produce stable results.
Validate against known events before going live. Before using anomaly detection for operational alerting, check whether the model correctly identifies known historical events, such as tracking failures, platform outages, or major campaigns. This validation helps confirm that the model is well-calibrated and informs the choice of threshold.
Treat the threshold as a parameter to tune, not a fixed setting. The anomaly_prob_threshold value should be adjusted based on observed results rather than set once and left. Testing different values against historical data and measuring how well they separate real anomalies from noise will lead to a more appropriate configuration.

Common pitfalls

Several issues come up frequently when deploying time-series anomaly detection in practice:

Too little training history. A model trained on a short window will not capture seasonal patterns, leading to systematically inaccurate expected ranges, especially during periods the training data did not cover.
Mixing different kinds of data in the same series. Combining observations from distinct segments, such as different acquisition channels or device types, into a single series identifier introduces inconsistencies that make the model harder to train accurately and reduce the quality of its predictions.
Treating every flagged point as an alert. The raw output of the function should not be used directly as an alert trigger without additional filtering. Requiring that an anomaly persist across several consecutive periods, or that it exceed a minimum absolute size, significantly reduces noise while still catching anomalies that matter.
Ignoring calendar effects. Holidays, product launches, promotions, and similar planned events cause deviations that look anomalous but are entirely expected. Where possible, these should be accounted for, either by encoding them as inputs to the model or by filtering them out during post-processing, to avoid generating false alerts around predictable events.

Minimal query pattern

The following query shows the basic structure for calling ML.DETECT_ANOMALIES against a trained BigQuery ML model:

SELECT *
FROM ML.DETECT_ANOMALIES(
  MODEL `project.dataset.model_name`,
  STRUCT(0.99 AS anomaly_prob_threshold),
  TABLE input_table
);

This is a structural template illustrating the calling convention. In practice, the model reference, threshold value, and input table should be adapted to match the specific data schema and operational requirements of the use case.

Conceptual flow​

Typical requirements​

What the function evaluates​

Interpreting common outputs​

Practical guidance​

Common pitfalls​

Minimal query pattern​