Create Patterns from Logs

Discover log patterns on the Edge.

5 minute read

Overview

The Log to Pattern node finds patterns in logs, and then groups (or clusters) these patterns based on similarities.

DRAIN algorithm

It uses the DRAIN algorithm to group similar log messages into patterns:

The tree organizes log messages into a hierarchical structure based on token similarity.
Logs are parsed top-down, following a branch that best matches their structure.
Variable parts of logs (e.g., timestamps, IDs) are replaced with wildcards to generalize patterns.
New log patterns are dynamically added as new logs appear.

The node generates a cluster_pattern_and_sample data item that contains one or more samples of logs used to generate the pattern. You can view aggregated pattern results in the Patterns explorer.

Parameters

The following parameters can be configured in the Log to Pattern node:

drain_tree_depth

The drain tree organizes logs into a structured hierarchy. The depth determines how many levels the tree can have before logs are grouped into a pattern. The drain_tree_depth parameter controls how deep the drain tree hierarchy extends when identifying log patterns. Increasing drain_tree_depth creates more granular patterns (higher specificity) and helps differentiate logs with subtle differences but it consumes more memory due to a larger tree structure. Decreasing drain_tree_depth leads to broader patterns (higher generalization) and uses less memory but may group dissimilar logs together.

drain_tree_max_child

Each node in the tree represents a grouping of logs with similar structures. The number of child nodes per level defines how logs are distributed before they are merged into a pattern. The drain_tree_max_child parameter controls the maximum number of child nodes each node in the drain tree can have. It plays a key role in log pattern identification by determining how logs are grouped at each level. Increasing drain_tree_max_child allows more granular distinctions between log patterns and helps prevent different logs from being merged prematurely but it consumes more memory due to a larger tree structure. Decreasing drain_tree_max_child leads to broader pattern generalization (logs merge sooner) and uses less memory, but different logs may be grouped incorrectly. Choosing a lower value is better for resource-limited environments. A higher value is useful for applications needing precise log classification.

similarity_threshold

When a new log entry arrives, its similarity to existing patterns is computed. The similarity_threshold parameter defines how similar a new log entry must be to an existing pattern before it is grouped into that pattern. It helps in deciding whether a log should be merged with an existing pattern or form a new one. Increasing similarity_threshold requires logs to be very similar to be merged into the same pattern and lLeads to more distinct patterns (finer granularity) but it consumes more memory as more patterns are created. Decreasing similarity_threshold allows loosely similar logs to be grouped together and leads to fewer, more generalized patterns (coarser grouping). It uses less memory but may merge unrelated logs. 0.0 means every log entry is its own pattern (too strict, impractical). 1.0 means logs must be exactly identical to merge (also impractical). A typical default is around 0.5 - 0.7 for a balanced trade-off.

num_of_clusters

Logs are grouped into clusters based on similarity. The num_of_clusters parameter defines the maximum number of clusters that can be maintained at run-time per input. Once the limit is reached, older or less relevant clusters may be removed. Increasing num_of_clusters allows more granular clustering, preserving finer details but it uses more memory. Decreasing num_of_clusters forces more logs into fewer clusters (broader generalization) and saves memory but may merge unrelated logs.

samples_per_cluster

The samples_per_cluster parameter defines the maximum number of log messages stored per cluster before older messages are replaced by new ones. It helps in maintaining a manageable memory footprint while ensuring clusters remain representative of the latest data. Increasing samples_per_cluster stores more messages per cluster and preserves historical context for longer but it uses more memory. For short-lived, dynamic logs (real-time monitoring) lower values help keep logs fresh. While for long-term trend analysis higher values help retain past messages for better insights.

reporting_frequency:

The reporting_frequency parameter defines how often the cluster pattern and cluster samples are sent to output nodes. It helps control the cadence of updates in log clustering. Increasing reporting_frequency reduces the number of updates sent and saves bandwidth and processing resources but it may cause a delay in detecting new patterns. For real-time monitoring use a lower value for quick pattern detection. For post processing log analysis use a higher value to optimize resource use.

field_path

By default the data item body is evaluated for patterns. This parameter allows you to focus clustering on a specific field rather than the body. It helps in customizing pattern detection based on relevant log data.

retire_period

When a log pattern is detected, it stays active as long as new logs matching it continue to appear. If no new logs match a pattern for a period equal to retire_period, the pattern is removed (retired) from memory. Increasing retire_period retains patterns for a longer time, even if they are infrequent. This helps track rare but important log patterns but it consumes more memory as patterns accumulate. Decreasing the retire_period removes inactive patterns quickly and saves memory but may cause patterns to be lost too soon, especially for sporadic logs. For short-lived, high-volume logs use a shorter retire period to avoid excessive memory usage. While for long-term log analysis (rare errors or security logs) use a longer retire period to retain important patterns.

throttle_limit_per_sec

The throttle_limit_per_sec parameter controls how many logs are clustered per second per source. It helps regulate processing speed and prevent system overload. Increasing throttle_limit_per_sec allows for faster clustering and real-time processing but it can lead to higher CPU and memory usage. Decreasing throttle_limit_per_sec helps prevent system overload in high-volume environments but it may delay log processing, potentially missing real-time insights.

group_by

The group_by parameter defines how incoming log entries are aggregated into clusters. This means it influences how individual log entries are associated or categorized into clusters based on the expressions provided. For example, logs can be grouped by fields such as service name or environment, thereby allowing log items sharing similar attributes to be clustered together. The intent is to allow the logging system to categorize and analyze logs based on specific attributes, which can provide more contextual insights into the behavior or issues reflected by the logs—like service-specific errors or environment-specific trends.