Edge Delta Deduplicate Node

Deduplicate logs in a pipeline.

3 minute read

Overview

The Dedup node is used to deduplicate identical logs discovered within a specified interval. Identical logs are those with the same body, attributes, and resources. This node is useful in scenarios where you want to reduce data volume.

Duplicate logs may come about as a result of parallel pipeline design or from systems upstream of the agent.

If duplicates are found, one instance will pass with a log_count attribute for the number of replicas dropped. The count attribute counts the number of dropped logs - two identical logs will result in a single log with a log_count attribute of 1. The timestamp is ignored so identical logs with different timestamps but within the same interval will be rolled up into a single log.

Example Configuration

In this example, identical logs within a 2 minute interval will be dropped. An attribute, http_status_code, is ignored. This means variation in this field will not prevent rolling up. In addition, the log_count field is renamed to dedup_log_count.

- name: dedup_processor
  type: dedup
  interval: 2m
  excluded_field_paths:
    - item.attributes.http_status_code
  count_field_path: item.attributes.dedup_log_count

Example Input

{
  "id": "5a8fa619-54c6-4e99-98b4-e207827d7f7f",
  "timestamp": 1732507791786,
  "severity_text": "",
  "body": "User login attempt",
  "resource": {
    ...
  },
  "attributes": {
    "user_id": "56789",
    "http_status_code": "200"
  }
}
{
  "id": "c3ad2a78-af53-47e5-8407-c3956ff2c9a1",
  "timestamp": 1732507791798,
  "severity_text": "",
  "body": "User login attempt",
  "resource": {
    ...
  },
  "attributes": {
    "user_id": "56789",
    "http_status_code": "401"
  }
}

Example Output

{
  "id": "c3ad2a78-af53-47e5-8407-c3956ff2c9a1",
  "timestamp": 1732507791798,
  "severity_text": "",
  "body": "User login attempt",
  "resource": {
    ...
  },
  "attributes": {
    "user_id": "56789",
    "http_status_code": "200",
    "dedup_log_count": 1
  }
}

A single log instance is passed and enriched with a dedup_log_count attribute indicating the number of log instances that were rolled up into the single item.

Required Parameters

name

A descriptive name for the node. This is the name that will appear in Visual Builder and you can reference this node in the YAML using the name. It must be unique across all nodes. It is a YAML list element so it begins with a - and a space followed by the string. It is a required parameter for all nodes.

nodes:
  - name: <node name>
    type: <node type>

type: dedup

The type parameter specifies the type of node being configured. It is specified as a string from a closed list of node types. It is a required parameter.

nodes:
  - name: <node name>
    type: <node type>

Optional Parameters

count_field_path

The count_field_path parameter specifies the name of the attribute field that will contain the integer for the number of logs that were rolled up. You specify it as a string. It is optional and the default is log_count.

- name: dedup_processor
  type: dedup
  count_field_path: item.attributes.dedup_log_count

excluded_field_paths

The excluded_field_paths parameter specifies the fields that should not be evaluated for variation. This means that even if these fields are different, the log might be rolled up. You specify one or more items as a list and it is optional. By default, the timestamp is excluded while the body field cannot be excluded.

- name: dedup_processor
  type: dedup
  excluded_field_paths:
    - item.attributes.http_status_code

interval

The interval parameter defines the window in which the node evaluates logs for duplicates. If one identical log falls within each interval it will not be dropped and rolled up. It is specified as a duration, the default is 30s and it is optional.

- name: dedup_processor
  type: dedup
  interval: 2m