Create Patterns from Logs
5 minute read
Overview
The Log to Pattern node finds patterns in logs, and then groups (or clusters) these patterns based on similarities.
DRAIN algorithm
It uses the DRAIN algorithm to group similar log messages into patterns:
- The tree organizes log messages into a hierarchical structure based on token similarity.
- Logs are parsed top-down, following a branch that best matches their structure.
- Variable parts of logs (e.g., timestamps, IDs) are replaced with wildcards to generalize patterns.
- New log patterns are dynamically added as new logs appear.
The node generates a cluster_pattern_and_sample data item that contains one or more samples of logs used to generate the pattern. You can view aggregated pattern results in the Patterns explorer.
Parameters
The following parameters can be configured in the Log to Pattern node:
drain_tree_depth
The drain tree organizes logs into a structured hierarchy. The depth determines how many levels the tree can have before logs are grouped into a pattern. The drain_tree_depth
parameter controls how deep the drain tree hierarchy extends when identifying log patterns. Increasing drain_tree_depth
creates more granular patterns (higher specificity) and helps differentiate logs with subtle differences but it consumes more memory due to a larger tree structure. Decreasing drain_tree_depth
leads to broader patterns (higher generalization) and uses less memory but may group dissimilar logs together.
drain_tree_max_child
Each node in the tree represents a grouping of logs with similar structures. The number of child nodes per level defines how logs are distributed before they are merged into a pattern. The drain_tree_max_child
parameter controls the maximum number of child nodes each node in the drain tree can have. It plays a key role in log pattern identification by determining how logs are grouped at each level. Increasing drain_tree_max_child
allows more granular distinctions between log patterns and helps prevent different logs from being merged prematurely but it consumes more memory due to a larger tree structure. Decreasing drain_tree_max_child
leads to broader pattern generalization (logs merge sooner) and uses less memory, but different logs may be grouped incorrectly. Choosing a lower value is better for resource-limited environments. A higher value is useful for applications needing precise log classification.
similarity_threshold
When a new log entry arrives, its similarity to existing patterns is computed. The similarity_threshold
parameter defines how similar a new log entry must be to an existing pattern before it is grouped into that pattern. It helps in deciding whether a log should be merged with an existing pattern or form a new one. Increasing similarity_threshold
requires logs to be very similar to be merged into the same pattern and lLeads to more distinct patterns (finer granularity) but it consumes more memory as more patterns are created. Decreasing similarity_threshold
allows loosely similar logs to be grouped together and leads to fewer, more generalized patterns (coarser grouping). It uses less memory but may merge unrelated logs. 0.0 means every log entry is its own pattern (too strict, impractical). 1.0 means logs must be exactly identical to merge (also impractical). A typical default is around 0.5 - 0.7 for a balanced trade-off.
num_of_clusters
Logs are grouped into clusters based on similarity. The num_of_clusters
parameter defines the maximum number of clusters that can be maintained at run-time per input. Once the limit is reached, older or less relevant clusters may be removed. Increasing num_of_clusters allows more granular clustering, preserving finer details but it uses more memory. Decreasing num_of_clusters
forces more logs into fewer clusters (broader generalization) and saves memory but may merge unrelated logs.
samples_per_cluster
The samples_per_cluster
parameter defines the maximum number of log messages stored per cluster before older messages are replaced by new ones. It helps in maintaining a manageable memory footprint while ensuring clusters remain representative of the latest data. Increasing samples_per_cluster
stores more messages per cluster and preserves historical context for longer but it uses more memory. For short-lived, dynamic logs (real-time monitoring) lower values help keep logs fresh. While for long-term trend analysis higher values help retain past messages for better insights.
reporting_frequency:
The reporting_frequency
parameter defines how often the cluster pattern and cluster samples are sent to output nodes. It helps control the cadence of updates in log clustering. Increasing reporting_frequency
reduces the number of updates sent and saves bandwidth and processing resources but it may cause a delay in detecting new patterns. For real-time monitoring use a lower value for quick pattern detection. For post processing log analysis use a higher value to optimize resource use.
field_path
By default the data item body is evaluated for patterns. This parameter allows you to focus clustering on a specific field rather than the body. It helps in customizing pattern detection based on relevant log data.
retire_period
When a log pattern is detected, it stays active as long as new logs matching it continue to appear. If no new logs match a pattern for a period equal to retire_period
, the pattern is removed (retired) from memory. Increasing retire_period
retains patterns for a longer time, even if they are infrequent. This helps track rare but important log patterns but it consumes more memory as patterns accumulate. Decreasing the retire_period
removes inactive patterns quickly and saves memory but may cause patterns to be lost too soon, especially for sporadic logs. For short-lived, high-volume logs use a shorter retire period to avoid excessive memory usage. While for long-term log analysis (rare errors or security logs) use a longer retire period to retain important patterns.
throttle_limit_per_sec
The throttle_limit_per_sec
parameter controls how many logs are clustered per second per source. It helps regulate processing speed and prevent system overload. Increasing throttle_limit_per_sec
allows for faster clustering and real-time processing but it can lead to higher CPU and memory usage. Decreasing throttle_limit_per_sec
helps prevent system overload in high-volume environments but it may delay log processing, potentially missing real-time insights.
group_by
The group_by
parameter defines how incoming log entries are aggregated into clusters. This means it influences how individual log entries are associated or categorized into clusters based on the expressions provided. For example, logs can be grouped by fields such as service name or environment, thereby allowing log items sharing similar attributes to be clustered together. The intent is to allow the logging system to categorize and analyze logs based on specific attributes, which can provide more contextual insights into the behavior or issues reflected by the logs—like service-specific errors or environment-specific trends.