Tuning Guide

Preset selection and tuning guidance for Prisma AI models.

This page covers the technical background behind Prisma AI models and the advanced settings you can adjust if the defaults don’t suit your environment. Please note that our presets have been optimized across a vast assortment of diverse datasets, it is unlikely that changes to our presets will produce generally more capable models, but it is possible that you might be able to tune the model more to your specific use case. When tuning the options, be carefull not to overfit to your specific usecase. For the user-facing configuration flow, see the Models page.

How Prisma AI models work

A Prisma AI model learns the normal behavior of the items in your dataset during training. Once deployed, the model continuously evaluates incoming observations against what it expects and produces an anomaly score. Higher values indicate a larger deviation from the learned baseline. Deployments then turn these scores into actionable alerts using configurable sensitivity thresholds.

Point Detection vs Segment Detection

Every Prisma AI model produces two streams of anomaly scores: a point detection score and a segment detection score. They are tuned for different jobs, and the system gets its accuracy from using them together rather than treating them as alternatives.

Point Detection operates at the level of a single measurement, producing one score per measurement that expresses how much it deviates from what the model expected. The resulting score stream traces the shape of an anomaly in detail, showing when a deviation begins, how it evolves, and when it subsides, which makes point detection the primary signal for analysis. The tradeoff is much lower precision: it prioritizes tracing the shape of potential anomalies over deciding whether they are real events, and on its own will respond to deviations that are not worth alerting on.
Segment Detection operates at the level of an event. It is optimized to produce a high-confidence verdict on whether an anomalous event is actually occurring, trading the fine-grained shape information of point detection for much higher precision and a much lower false-positive rate. This makes it the right signal for deciding whether something alert-worthy is happening.

The two scores are combined inside a Prisma AI. Point detection provides the broad window and the descriptive shape; segment detection is used to validate that window. When a segment detection trigger lands inside a point-detection anomaly, that anomaly is treated as confirmed with segment detection’s much higher precision, while still carrying the rich shape information from point detection for investigation. The net effect is that you get the descriptive, analysis-friendly contour of point detection together with the low false-positive rate of segment detection, without having to choose between them.

In practice, this means the two thresholds play different roles when tuning sensitivity. The point detection threshold mainly shapes how much contour you see around anomalies during analysis, while the segment detection threshold is the one that governs whether an anomaly is actually treated as a real event.

Choosing items for a model

Before picking a preset, decide which Zabbix items should be learned by the model. Prisma AI models learn cross-item patterns during training, so items that share an environment reinforce each other and produce more accurate detections together than in isolation. The reverse also holds: unrelated items mixed into one model contribute noise that weakens the learned baseline.

Group items by functional relationship. Strong groupings include:

Metrics from a single service, such as CPU, memory, request rate, and error rate for one application.
A cluster of hosts running the same workload, where individual hosts normally behave similarly.
Metrics that move together under normal load, for example database connections and request throughput on a web tier.

At the same time, avoid bundling items that carry essentially the same information. Redundant metrics do not add useful detection capacity and can overweight a single underlying signal during training, which skews what the model treats as normal. When in doubt, prefer one representative metric per underlying signal and let the model learn its relationship to the others.

Consider splitting into separate models when:

Two groups have unrelated schedules or seasonality, such as a batch analytics system and a customer-facing API.
One group is substantially noisier than another and is drowning out detections on the quieter group.

All items in one model share a single sensitivity setting and a single pair of alert triggers per deployment. To give different parts of your environment their own alerting behavior, split them into separate models.

Choosing a preset

Prisma ships four presets. They all detect the same kinds of anomalies, but differ in how many items they can handle, how much they vary between training runs, and how much compute they require.

Preset	Max items	Context window	Run-to-run variance	Reliability	Relative training cost
Iris Nano	20	96 (~1.6 hours)	~15%	Single model, moderate	Low
Iris Core	40	192 (~3.2 hours)	~10%	Single model, improved reliability	Moderate
Iris Pro	80	192 (~3.2 hours)	~3%	Ensemble of 3, cluster of 2/3	High
Iris Ultra	128	192 (~3.2 hours)	<1%	Ensemble of 5, cluster of 3/5	Very high

Run-to-run variance is the expected difference in detection behavior between two models trained on the same dataset. Lower variance means re-trained models behave more consistently over time, is crucial when automatic unattended retraining is enabled.

Context window

Each preset uses a fixed context window, which is the amount of recent history the model considers when scoring a new point. Prisma samples data at 60-second intervals, so the step counts in the preset table above map directly to wall-clock history. A longer context window lets the model factor in more past behavior when judging the current measurement, at the cost of additional memory and compute per inference step.

The context window only controls how much recent history the model weighs at each inference step. Seasonal patterns like daily cycles, business hours, or weekly workload shifts are learned during training from the full extent of the dataset and become part of the model’s baseline, so they remain recognized even when the inference-time context window is short. For these patterns to be learned reliably, the training dataset should contain several complete periods: roughly three weeks for daily cycles, three months for weekly cycles, and a full year if yearly patterns are relevant. Shortening the context window does not remove the model’s knowledge of seasonality, it only reduces how much recent history the model weighs when scoring.

Ensembles and reliability

Iris Pro and Iris Ultra do not rely on a single trained model. Instead, they train several models on the same dataset, each with an independent random seed, and combine their anomaly scores at inference time. This has two practical effects:

Higher detection reliability. Cross-validation between ensemble members reduces the chance of a single-model quirk producing a false alarm.
Lower run-to-run variance. Because the final score is aggregated, differences between individual training runs average out.

The cost is proportionally more training time, memory, and, if GPU training is enabled, VRAM. For most users the Iris Pro ensemble is the sweet spot between reliability and training cost.

Why plain averaging is not enough

The simplest way to combine ensemble scores is to average them, but a plain mean is fragile. Models trained with different seeds develop slightly different quirks, and rarely one member converges substantially worse than the rest, whether because of a bad random seed, an unlucky initialization, or a pathological interaction with the data. A plain mean is dragged by whichever member happens to be wrong on any given input, and a single badly-trained member can drag every aggregate score along with it.

Consensus clustering

To avoid this, the larger presets aggregate ensemble scores with a consensus clustering step. At every inference step the tightest-agreeing cluster of K members contributes to the final score; the remaining outliers are discarded for that step. The cluster is re-picked each step, so a member that disagrees on one input can still contribute on the next. Two effects follow:

Outlier rejection. If one or two members disagree sharply with the rest, they are left out of the aggregate for that step. A member only pulls the final score when it lines up with at least K − 1 others, which neutralizes a badly-converged member automatically.
High-agreement signal. Scores that survive aggregation are those that K independently-seeded models all converged to. A high aggregate score therefore requires genuine multi-model consensus, which is the main reason detection is more precise on Iris Pro and Iris Ultra.

K per preset

K is the number of members required in the agreeing cluster. Larger K (relative to member count) makes the ensemble stricter, smaller K makes it more tolerant. The preset defaults are:

Preset	Members	K	Agreement requirement
Iris Nano	1	—	single model
Iris Core	1	—	single model
Iris Pro	3	2	2 of 3 must agree
Iris Ultra	5	3	3 of 5 must agree

Iris Pro’s 2-of-3 is permissive enough to react quickly while ensuring no single member can fire an alert alone. Iris Ultra’s 3-of-5 is a majority vote: no anomalous score reaches the final stream unless the majority of independently-trained models agree, which is what produces its very low run-to-run variance and false-positive rate.

When to pick each preset

Iris Nano: a good fit for simple systems. Also the right choice for quick experiments and proof-of-concept work. Fastest to train and lightest on hardware. Its higher run-to-run variance means automatic unattended retraining will produce more variable alerting behavior over time; for larger or reliability-critical production environments, prefer Iris Core or above.
Iris Core: a solid default for most small-to-medium production systems. Balanced training cost and reliability.
Iris Pro: well suited for larger environments of up to 80 items, and for smaller environments where consistent alerting is a priority. Runs three models in an ensemble with cross-validated detection, which substantially reduces the chance of any single model’s quirks producing false alarms. Its low run-to-run variance also makes it a strong match for automatic unattended retraining, since successive retrains will alert on the same patterns with minimal drift. Training cost is noticeably higher than the single-model presets, and we therefore recommend GPU training from Iris Pro upwards.
Iris Ultra: the right choice for the largest supported environments, up to 128 items, and for deployments where alerting must remain stable across many automatic retraining cycles. Runs five models in an ensemble with the strictest consensus voting of any preset, producing the lowest run-to-run variance and the highest detection reliability Prisma offers. The tradeoff is substantially more training time, memory, and VRAM, and we strongly recommend GPU training at this scale.

Advanced mode (expert only)

Before overriding any of these parameters, check whether your training data can be improved. Better training data almost always helps more than parameter tuning: a longer history, fewer gaps, and removing noisy or redundant items all change what the model can learn in the first place, while parameter tuning only changes how it learns from the data it already has.

The defaults for each preset are tuned for the majority of environments. Only override them if you have a clear reason, such as training consistently failing to converge, visible overfitting, or runs that take too long. Poorly chosen values can produce a model that detects nothing or flags everything.

Enable Advanced Mode on the model configuration to reveal these parameters. The groups below describe what each knob affects and which direction to move it; the exact numeric defaults come from the selected preset.

Training duration and convergence

Epochs: the hard upper bound on how long training runs. The model may finish earlier if early stopping engages. Increase only if the training and validation curves on the training report are still improving at the final epoch; decrease if training reliably plateaus well before the cap and runs are taking too long.
Early stopping patience: how many consecutive epochs without validation improvement to tolerate before stopping. Raise if the training curves are noisy and runs stop before the model has genuinely converged; lower to save time on datasets where the model converges quickly and later epochs only add noise.
Learning rate: the size of each weight update during training. Too high and the loss curve will oscillate, spike, or diverge outright; too low and training crawls and may never reach a good minimum. When adjusting, prefer halving or doubling from the preset default rather than changing by an order of magnitude.
Batch size: how many samples the model processes per gradient step. Larger batches give more stable, less noisy updates and train faster on the GPU, but use more VRAM. Smaller batches fit tighter hardware budgets at the cost of noisier training; when you raise the batch size substantially, a modest increase in learning rate usually compensates well.

Generalization and robustness

Dropout rate (0–1): how much of the network is randomly masked during each training step to force it to learn redundant, generalizable features. Raise when the model fits the training data well but performs poorly on held-out data (overfitting); lower when training and validation loss both plateau too high (underfitting). Values outside roughly 0.05–0.4 are rarely productive.
Weight decay: an additional regularizer that penalizes large weights and nudges the model toward simpler solutions. Use alongside dropout when tackling overfitting, and raise cautiously: too much weight decay prevents the model from fitting any pattern at all and produces a detector that flags nothing.
Noise strength: how much random perturbation is injected into the training data so that the model learns to tolerate measurement jitter. Raise for noisy data sources where small fluctuations should not be treated as anomalies; lower for very clean, stable signals where you want the model to stay sensitive to small deviations.

Model capacity

Depth: the number of layers in the model. Deeper models can represent more complex temporal patterns but are slower to train, harder to keep stable, and more prone to overfitting when the dataset is small or narrow.
Width: how many features each layer works with. Wider models have more capacity per layer and are usually a safer way to add capacity than depth, but they increase VRAM usage and training time roughly linearly.

Transformer architecture

These parameters control the internal attention layers used by Prisma’s AI models. They have a strong effect on training cost and model behavior, and are the most likely to destabilize training if changed carelessly.

Transformer head count: the number of attention heads used in each self-attention layer. More heads let the model focus on several distinct patterns in parallel, which can help on datasets where items interact in complex ways, at the cost of proportionally more compute and VRAM. On simple, single-signal items, raising this above the preset default is rarely productive.
Transformer key dimension: the size of each attention head’s key/query vector. Larger values give each head a richer view of the input sequence, which can improve detection of subtle long-range patterns, but memory use and compute scale with this knob multiplied by the head count, so push it up cautiously.
Transformer feed-forward multiplier: how much wider the per-position feed-forward layer is relative to the model’s working dimension. Raising it adds capacity to the non-attention pathway and can help with complex, non-periodic behavior, but it inflates parameter count and training time faster than any other capacity knob, so it is usually the wrong place to start when a model needs more capacity.

Forecasting horizon

Horizon: how many time steps ahead the model is asked to predict during training. A longer horizon forces the model to build a richer internal representation of the data, which can improve detection of slow, gradual deviations, but it also makes the training task harder and the loss curve noisier. A shorter horizon is easier to fit and trains faster, but leaves the model less able to anticipate slower phenomena. The preset defaults are sized to match their context windows and should only be changed when you have a specific reason.

Validating tuning changes

Most tuning decisions only show their effect once a training run completes or a model sees unfamiliar data, so two feedback signals matter more than any single parameter choice: the training loss graph while the model is training, and model tests after it has trained.

Reading the training loss graph

The model details page plots training loss and validation loss per epoch (see Models). The shape of the two curves is usually a better signal for tuning than any individual metric value. If runs repeatedly behave badly in the same way, re-inspect the dataset before tuning further. A bad convergence pattern caused by a low-quality dataset will not be fixed by any parameter change.

Healthy convergence. Both curves fall and then flatten, with validation loss settling close to training loss.
Training loss keeps falling while validation loss stops improving or climbs. The model is overfitting. Raise dropout, raise weight decay, or raise noise strength before adding capacity.
Both curves plateau well above zero and stop improving early. The model is underfitting. Raise epochs or early stopping patience first; if neither helps, raise width or depth modestly.
Loss oscillates, spikes, or diverges outright. The learning rate is too high. Halve it. A single spike followed by recovery is usually benign; repeated spikes or a spike the run never recovers from is not.
Training stops long before the loss stops improving. Early stopping patience or epoch count are too low.
Loss is very noisy but trends downward. Batch size is likely too small. Raise it first, and raise noise strength only if the underlying signal is genuinely noisy.

Validating with model tests

A training run that converged cleanly is not yet a model that detects well. The authoritative way to judge a tuning change is to run the trained model against data it did not train on and compare what it flags against what you expected to see.

Hold out a validation dataset from the same environment as the training data, ideally one that contains incidents with a known shape and timing.
Train two models, one with the old configuration and one with the new, against the same training dataset.
Run a model test for each trained model against the same held-out dataset and compare the anomaly score graphs side by side.

What to look for when comparing:

Does the new configuration flag the known incidents at least as clearly as the old one?
Are the quiet periods actually quieter, or did the change trade off known-good detection for lower noise?
Does the anomaly shape on point detection still resemble the event, or has it become smeared or spiky?

Avoid iterating repeatedly against the same test dataset until the numbers look good. Doing so fits the hyperparameters to that specific dataset and the improvements rarely generalize.

Output items and triggers

For each deployment, Prisma creates four Zabbix items on the output host, all prefixed with the item key prefix you chose when deploying the model. They are written as numeric floats once per inference step.

Item key suffix	What it contains
`point.score`	The current point detection anomaly score. Higher values mean the most recent measurement deviates more from the model’s expectation.
`point.threshold`	The active threshold for the point detection score. When `point.score` exceeds this value, a point anomaly is considered active.
`segment.score`	The current segment detection anomaly score. Higher values mean the recent stretch of behavior is more likely to be a real anomalous event.
`segment.threshold`	The active threshold for the segment detection score. When `segment.score` exceeds this value, a segment anomaly is considered active.

The thresholds are exposed as items (rather than fixed constants) so that the current sensitivity setting of the deployment is always visible alongside the scores, and so any changes you make to sensitivity are reflected directly in the graphs.

Alongside these items, Prisma also creates two Zabbix triggers per deployment, one for point detection and one for segment detection. Each trigger fires a Problem on your Zabbix server whenever the corresponding score exceeds its threshold. Because segment detection is the precision-optimized signal, its trigger is the one you should generally treat as the authoritative alert; the point detection trigger is most useful for visibility into the broader anomaly shape around that alert.

Tuning sensitivity

Each deployment has a sensitivity setting that shifts how aggressively its thresholds flag an anomaly. Sensitivity is a deployment-side knob, not a model-config one, so changing it takes effect immediately without retraining and you can run two deployments of the same model side by side with different sensitivities to compare behavior. The UI options themselves are documented on the Deployments page, the notes below are about how to decide between them.

A few rules of thumb for choosing a level:

Start with Recommended. The default is tuned to produce a reasonable alerting rate on a broad range of datasets and is the right starting point unless you already know you need to shift.
Move toward Conservative if your team is experiencing alert fatigue, or if the monitored system is genuinely noisy.
Move toward Aggressive if missing an anomaly is more costly than investigating the occasional false positive.

When in doubt, run a model test against an incident whose shape you already know and compare how each sensitivity level reacts before committing to a change in production.