Datasets

Dataset configs determine how data is extracted from the Zabbix server into datasets. Each dataset is listed in the config that was used to create it. Datasets are used as training data for the AI model and as validation sets to evaluate model performance in model tests.

To access and create dataset configs, navigate to AI > Dataset.

Creating dataset configs

To create a dataset config, click the blue plus button. This opens a form:

createDataset — *AI: Create dataset config*

Name: Name for the dataset config.
DataForge user group: Select a DataForge user group. Similar to the reporting feature, a user group is required so the service user can provide access to data on the Zabbix server.

Click Create to continue.

Configuring datasets

To configure a dataset, click the dataset entry. This opens the dataset config details:

General

Change the dataset name or add a description.

Dataset configuration

This configuration is split into two categories:

configDatasets — *Configure dataset: Datasets*

Hosts: Hosts from which the data is retrieved. Each host has an AI item count in parentheses next to its name.
Schedule: The schedule (interval) that determines the time span from which data for the dataset is processed and generated.

Hosts

When selecting one or more hosts, DataForge extracts the item histories from the host on the Zabbix server.

DataForge only extracts histories from items tagged with df-ai. This tag must be set manually for each item to be included in the dataset.

Schedule

Two possible options are available for scheduling:

Preset: Datasets are generated in a defined interval. Data recorded within the interval is included.
Manual: Start date, end date and time are set manually. Only one dataset is created, using data from that period.

Storage options

Storage options can be used to organize datasets. A time period in days and/or hours can define how long a dataset is kept. When older than the defined time, the dataset is deleted.

configStorage — *Configure dataset: Storage options*

The default setting `Days: 0, Hours: 0` keeps datasets indefinitely.

Datasets

A list of all created datasets and the option to create a dataset on demand.

configDataset — *Configure dataset: Datasets*

Create a dataset on demand by clicking Extract data. This opens a modal to provide a name and description. Click Create to finish.

Each dataset can be inspected further by clicking its entry. This opens the dataset details.

Dataset details

To inspect a dataset, navigate to AI > Dataset. Select the config which contains the dataset. Created datasets are listed at the bottom of the page. Click the dataset to open its details.

Extraction progress

While data is still being extracted, progress can be followed in real time in the job log.

Job log

The job log shows each job carried out to create the dataset. Each step can be expanded for more details.

Processing time: Duration of the step.
Date and time: When the step was processed.
Type: Type of event.
Event: Event name.
Microservice: Microservice responsible for the event.
Full event: Full event stack trace.

Finished extracting

After extraction is complete, two additional sections are displayed.

General

This section contains basic information about the dataset:

Configured start time: Configured extraction start.
Configured end time: Configured extraction end.
Effective start time: First moment when data is effectively usable. Each item must have at least one value before the dataset becomes visible. Values existing before another series begins cannot be used.
Total datapoints available: Number of datapoints received from Zabbix.
Missing datapoints: Datapoints expected but missing in the time series.
Unusable datapoints: Datapoints collected at times when not every item has had values yet and are thus discarded from the dataset.
Dataset size: Size of the dataset.

Dataset items

Detailed information about each extracted item:

Each card shows information about the time series and item name.

Two time series states exist: Regular and Irregular. A regular time series has a defined update interval. Expected, but missing data points are displayed as such. In an irregular time series, missing points cannot be determined.

Total datapoints available: Number of datapoints received from Zabbix.
Missing datapoints: Number of expected datapoints missing. Only for regular time series.
Unusable datapoints: Discarded datapoints collected when not yet available for other items.
Major/Minor outages: Amount of major and minor outages. Outages occur when expected points are missing in a regular series. If missing data exceeds 5% of the total, it is considered a major outage. A Details button provides further information.

datasetOutages — *Dataset details: Item outages details*

Value density: Measures the concentration of data points in the time series. Higher density indicates more points. Also indicates the effective start time, when each item had at least one value.

datasetDensity — *Dataset details: Item value density*