Edge Delta Google Cloud Storage Output

Configure the Google Cloud Storage output node to archive data using GCS with options for compression and encoding.

17 minute read

Overview

The GCS Destination archives items in a Google Cloud Storage destination. These items are raw archive bytes that are buffered with the archive buffer processor. Use the configuration wizard below to generate a starter YAML configuration.

incoming_data_types: log

This node requires Edge Delta agent version v0.1.59 or higher.

Note: Agent versions v2.8.0, v2.9.0, and v2.10.0 contain a regression that causes excessive GCS write requests, potentially increasing costs and API usage. Upgrade to v2.11.0 or later to resolve this issue.

Configuring GCS

Before you can create an output, you must have a Google Cloud Storage HMAC access key for a service account that contains the Storage Admin HMAC role. See how to Prepare GCS.

Example Configuration

nodes:
  - name: my_gcs
    type: gcs_output
    bucket: <REDACTED>
    hmac_access_key: <REDACTED>
    hmac_secret: <REDACTED>
    compression: zstd
    encoding: parquet
    use_native_compression: true
    path_prefix:
      order:
      - Year
      - Month
      - Day
      - Hour
      - 2 Minute
      - tag
      - host
      format: ver=parquet/year=%s/month=%s/day=%s/hour=%s/min=%s/tag=%s/host=%s/

Required Parameters

name

A descriptive name for the node. This is the name that will appear in pipeline builder and you can reference this node in the YAML using the name. It must be unique across all nodes. It is a YAML list element so it begins with a - and a space followed by the string. It is a required parameter for all nodes.

nodes:
  - name: <node name>
    type: <node type>

type: gcs_output

The type parameter specifies the type of node being configured. It is specified as a string from a closed list of node types. It is a required parameter.

nodes:
  - name: <node name>
    type: <node type>

bucket

The bucket parameter defines the target bucket to use. It is specified as a string and is required.

nodes:
  - name: <node name>
    type: gcs_output
    bucket: <target bucket>

hmac_access_key

The hmac_access_key parameter is the GCS HMAC access key that has permissions to upload files to the bucket. It is used with hmac_secret. It is specified as a string and is required.

This field supports secret references for secure credential management. Instead of hardcoding sensitive values, you can reference a secret configured in your pipeline.

To use a secret in the GUI:

Create a secret in your pipeline’s Settings > Secrets section (see Secrets)
In this field, select the secret name from the dropdown list that appears

To use a secret in YAML: Reference it using the syntax: '{{ SECRET secret-name }}'

Example:

field_name: '{{ SECRET my-credential }}'

Note: The secret reference must be enclosed in single quotes when using YAML. Secret values are encrypted at rest and resolved at runtime, ensuring no plaintext credentials appear in logs or API responses.

nodes:
  - name: <node name>
    type: gcs_output
    bucket: <target bucket>
    hmac_access_key: <access key>
    hmac_secret: <key secret>

hmac_secret

The hmac_secret parameter is the GCS HMAC secret associated with the access key. It is used with hmac_access_key. It is specified as a string and is required.

This field supports secret references for secure credential management. Instead of hardcoding sensitive values, you can reference a secret configured in your pipeline.

To use a secret in the GUI:

Create a secret in your pipeline’s Settings > Secrets section (see Secrets)
In this field, select the secret name from the dropdown list that appears

To use a secret in YAML: Reference it using the syntax: '{{ SECRET secret-name }}'

Example:

field_name: '{{ SECRET my-credential }}'

nodes:
  - name: <node name>
    type: gcs_output
    bucket: <target bucket>
    hmac_access_key: <access key>
    hmac_secret: <key secret>

Optional Parameters

disabled

The disabled parameter disables the node in the pipeline. When set to true, the agent ignores the node at startup. You can toggle nodes on or off in the pipeline builder. It is specified as a boolean, defaults to false, and is optional.

nodes:
  - name: <node name>
    type: <node type>
    disabled: true

bucket_expression

Minimum Agent Version: v2.8.0

The bucket_expression parameter allows you to dynamically route data to different GCS buckets based on data item attributes using an OTTL expression. When specified, this expression is evaluated for each data item and the result is used as the destination bucket, overriding the static bucket parameter.

nodes:
  - name: <node name>
    type: gcs_output
    bucket: default-bucket
    bucket_expression: attributes["target_bucket"]
    hmac_access_key: <access key>
    hmac_secret: <key secret>

keep_overridden_bucket

Minimum Agent Version: v2.8.0

The keep_overridden_bucket parameter specifies whether to retain the original bucket value in the data item after applying the bucket_expression. When set to true, the attribute used in the expression remains in the data. When set to false (default), the attribute is removed after being used for routing. It is specified as a Boolean and is optional.

nodes:
  - name: <node name>
    type: gcs_output
    bucket: default-bucket
    bucket_expression: attributes["target_bucket"]
    keep_overridden_bucket: true
    hmac_access_key: <access key>
    hmac_secret: <key secret>

disable_compaction

YAML Only

This parameter configures whether to disable compaction by the Compactor Agent for data from this node before it is sent to the data destination. It is specified as a boolean, the default is false and it is optional.

nodes:
  - name: <node name>
    type: gcs_output
    bucket: <target bucket>
    disable_compaction: true

compression

The compression parameter specifies the compression format. It can be gzip, zstd, snappy or uncompressed. It is specified as a string, has a default of gzip, and it is optional.

nodes:
  - name: <node name>
    type: gcs_output
    bucket: <target bucket>
    compression: gzip | zstd | snappy | uncompressed

schema

The schema parameter specifies the schema format for archived data. It can be Archive or Raw. It is specified as a string, has a default of Archive, and is optional.

Archive - Uses the structured Edge Delta archive format (ArchiveLogPayload) for the archived data. This is the standard format for OTel-structured data, preserving fields such as body, attributes, resource, and timestamp in their original nested structure.
Raw - Uploads incoming items directly as a flat map[string]any to the archive destination. Use this only with deotelized data. When data has been processed by the Deotel processor, the OTel envelope is removed and fields are promoted to the top level — Raw preserves that flattened structure as-is. Raw does not extract or restructure fields; if the data still has OTel structure (with nested attributes, resource, or body), those fields will remain nested in the output.

nodes:
  - name: <node name>
    type: <output type>
    schema: Archive | Raw

user_schema (Schema Spec)

Minimum Agent Version: v2.8.0

The user_schema parameter (displayed as Schema Spec in the UI) provides an optional schema definition that tells the serializer what column names and types to expect in the output. You can specify your schema in two ways:

Inline JSON - Define the schema directly in YAML format
File path - Reference an external JSON schema file using file://<path> format (requires agent v2.10.0+)

Important: user_schema applies only when encoding is set to parquet. It is not used for JSON encoding. It is also not a field extraction or restructuring tool — it does not select, rename, or promote fields from the incoming data. It tells the Parquet serializer what column names and types to expect for data that has already been flattened. For this reason, user_schema is intended for use with schema: Raw and deotelized data — the Deotel processor removes the OTel envelope and promotes fields to the top level, and user_schema then describes those top-level fields to the serializer.

If your data is still OTel-structured (fields nested under attributes, resource, or body), user_schema will not flatten those fields into columns. You need to deotelized the data first.

Inline schema example (with deotelized data):

nodes:
  - name: <node name>
    type: <archive_output_type>
    schema: Raw
    encoding: parquet
    user_schema: |
      {
        "fields": [
          {"name": "timestamp", "type": "timestamp"},
          {"name": "message", "type": "string"},
          {"name": "level", "type": "string"},
          {"name": "service", "type": "string"}
        ]
      }

File path example (v2.10.0+):

nodes:
  - name: <node name>
    type: <archive_output_type>
    schema: Raw
    encoding: parquet
    user_schema: file:///etc/edgedelta/schemas/custom-schema.json

Use Cases:

Define specific column types for Parquet files to ensure compatibility with query engines like Athena or BigQuery
Ensure type consistency across pipelines when working with deotelized data
Share schema definitions across multiple pipelines using external schema files

encoding

The encoding parameter specifies the encoding format. It can be json or parquet. It is specified as a string, has a default of json, and it is optional.

nodes:
  - name: <node name>
    type: gcs_output
    bucket: <target bucket>
    encoding: json | parquet

flush_interval

YAML Only

The flush_interval parameter specifies the duration to flush (or force) data to the destination, including buffered data. It is specified as a duration and is optional.

If not set on the node, the agent uses the following fallback chain:

settings.archive_flush_interval in the pipeline configuration
The ED_ARCHIVE_FLUSH_INTERVAL environment variable
30 minutes (hardcoded default)

nodes:
  - name: <node name>
    type: gcs_output
    bucket: <target bucket>
    flush_interval: 10m

max_byte_limit

YAML Only

The max_byte_limit parameter specifies the maximum bytes before flushing buffered raw data to archive destination. It is specified with a data size and is optional.

If not set on the node, the agent uses the following fallback chain:

settings.archive_max_byte_limit in the pipeline configuration
The ED_ARCHIVE_MAX_BYTE_LIMIT environment variable
16 MB (hardcoded default)

nodes:
  - name: <node name>
    type: gcs_output
    bucket: <target bucket>
    max_byte_limit: 32MB

path_prefix

The path_prefix parameter configures the path prefix using order and format child parameters. It is optional.

Note: The path template organizes files into folders but does not control upload cadence. For example, changing the path from /year/month/day/hour/minute/ to /year/month/day/hour/ does not make uploads happen hourly. The flush_interval setting controls when data is written to the destination.

The order child parameter lists the formatting items that will define the path prefix:

You can refer to Year, Month, Day, <any number that can divide 60> Minute, Hour, tag, host, OtherTags.<item related tags> and LogFields.<log related tags>.
For ECS, ecs_cluster, ecs_container_name, ecs_task_family and ecs_task_version are available.
For K8s, k8s_namespace, k8s_controller_kind, k8s_controller_logical_name, k8s_pod_name, k8s_container_name and k8s_container_image are available.
For Docker, docker_container_name and docker_image_name are available

The format child parameter specifies a format string that has %s as placeholders per each order item.

nodes:
  - name: <node name>
    type: gcs_output
    bucket: <target bucket>
    path_prefix:
      order:
      - Year
      - Month
      - Day
      - Hour
      - 2 Minute
      - tag
      - host
    format: ver=parquet/year=%s/month=%s/day=%s/hour=%s/min=%s/tag=%s/host=%s/

use_native_compression

The use_native_compression parameter configures whether, for parquet encoding, to only compress data segments for each archive file, not the whole file. It is specified as a Boolean, has a default of false, and it is optional.

nodes:
  - name: <node name>
    type: gcs_output
    bucket: <target bucket>
    use_native_compression: true

credentials_path

The credentials_path parameter specifies the path to a service account credentials JSON file. Use this when you need to authenticate with GCS using a service account instead of HMAC keys. It is specified as a string and is optional.

nodes:
  - name: <node name>
    type: gcs_output
    bucket: <target bucket>
    credentials_path: /etc/edgedelta/gcs-credentials.json

persistent_queue

The persistent_queue configuration enables disk-based buffering to prevent data loss during destination failures or slowdowns. When enabled, the agent stores data on disk and automatically retries delivery when the destination recovers.

Complete example:

persistent_queue:
  path: /var/lib/edgedelta/outputbuffer
  mode: error
  max_byte_size: 1GB
  drain_rate_limit: 1000

How it works:

Normal operation: Data flows directly to the destination (for error and backpressure modes) or through the disk buffer (for always mode)
Destination issue detected: Based on the configured mode, data is written to disk at the configured path
Recovery: When the destination recovers, buffered data drains at the configured drain_rate_limit while new data continues flowing
Completion: Buffer clears and normal operation resumes

Key benefits:

Data durability: Logs preserved during destination outages and slowdowns
Agent protection: Slow backends don’t cascade failures into the agent cluster
Automatic recovery: No manual intervention required
Configurable behavior: Choose when and how buffering occurs based on your needs

Learn more: Buffer Configuration - Conceptual overview, sizing guidance, and troubleshooting

path

The path parameter specifies the directory where buffered data is stored on disk. This parameter is required when configuring a persistent queue.

Example:

persistent_queue:
  path: /var/lib/edgedelta/outputbuffer

Requirements:

Required field - persistent queue will not function without a valid path
The directory must have sufficient disk space for the configured max_byte_size
The agent process must have read/write permissions to this location
The path should be on a persistent volume (not tmpfs or memory-backed filesystem)

Best practices:

Use dedicated storage for buffer data separate from logs
Monitor disk usage to prevent buffer from filling available space
Ensure the path persists across agent restarts to maintain buffered data

max_byte_size

The max_byte_size parameter defines the maximum disk space the persistent buffer is allowed to use. Once this limit is reached, any new incoming items are dropped, ensuring the buffer never grows beyond the configured maximum.

Note: This limit is total storage for the persistent queue, not per-worker. If you configure max_byte_size: 1GB and the destination has 15 workers, the buffer uses 1GB total, not 15GB.

Example:

persistent_queue:
  path: /var/lib/edgedelta/outputbuffer
  max_byte_size: 1GB

Sizing guidance:

Small deployments (1-10 logs/sec): 100MB - 500MB
Medium deployments (10-100 logs/sec): 500MB - 2GB
Large deployments (100+ logs/sec): 2GB - 10GB

Calculation example:

Average log size: 1KB
Expected outage duration: 1 hour
Log rate: 100 logs/sec

Buffer size = 1KB × 100 logs/sec × 3600 sec = 360MB
Recommended: 500MB - 1GB (with safety margin)

Important: Set this value based on your disk space availability and expected outage duration. The buffer will accumulate data during destination failures and drain when the destination recovers.

mode

The mode parameter determines when data is buffered to disk. Three modes are available:

error (default) - Buffers data only when the destination returns errors (connection failures, HTTP 5xx errors, timeouts). During healthy operation, data flows directly to the destination without buffering.
backpressure - Buffers data when the in-memory queue reaches 80% capacity OR when destination errors occur. This mode helps handle slow destinations that respond successfully but take longer than usual to process requests.
always - Uses write-ahead-log behavior where all data is written to disk before being sent to the destination. This provides maximum durability but adds disk I/O overhead to every operation.

Example:

persistent_queue:
  path: /var/lib/edgedelta/outputbuffer
  mode: error
  max_byte_size: 1GB

Mode comparison:

Mode	Protects Against	Trade-off	Recommended For
`error`	Destination outages and failures	No protection during slow responses	Reliable destinations with consistent response times
`backpressure`	Outages + slow/degraded destinations	Slightly more disk writes during slowdowns	Most production deployments
`always`	All scenarios including agent crashes	Disk I/O on every item reduces throughput	Maximum durability requirements

Why choose error mode:

The error mode provides the minimal protection layer needed to prevent data loss when destinations temporarily fail. Without any persistent queue, a destination outage means data is lost. With error mode enabled, data is preserved on disk during failures and delivered automatically when the destination recovers.

Why choose backpressure mode:

The backpressure mode provides everything error mode offers, plus protection against slow destinations. When a destination is slow but not completely down:

Without backpressure: Data delivery becomes unreliable, and the backend’s slowness propagates to the agent—the agent can get stuck waiting before sending subsequent payloads
With backpressure: The agent spills data to disk and continues processing, isolating itself from the slow backend

This prevents a slow destination from cascading failures into your agent cluster. For most production environments, backpressure provides the best balance of protection and performance.

Why choose always mode:

The always mode is designed for customers with extremely strict durability requirements. It forces the agent to write every item to disk before attempting delivery, then reads from disk for transmission. This guarantees that data survives even sudden agent crashes or restarts.

Important: This mode introduces a measurable performance cost. Each agent performs additional disk I/O on every item, which reduces overall throughput. Most deployments do not require this level of durability—this feature addresses specialized needs that apply to a small minority of customers.

Only enable always mode if you have a specific, well-understood requirement where the durability guarantee outweighs the throughput reduction.

strict_ordering

The strict_ordering parameter controls how items are consumed from the persistent buffer.

When strict_ordering: true, the agent runs in strict ordering mode with a single processing thread. This mode always prioritizes draining buffered items first—new incoming data waits until all buffered items are processed in exact chronological order. When strict_ordering: false (default), multiple workers process data in parallel, and new data flows directly to the destination while buffered data drains in the background.

Example:

persistent_queue:
  path: /var/lib/edgedelta/outputbuffer
  mode: always
  strict_ordering: true
parallel_workers: 1

Default value: false

Important: Strict ordering is a specialized feature needed by a very small minority of deployments. Most users should keep the default value of false. Only enable strict ordering if you have a specific, well-understood requirement for exact event sequencing.

Required setting: When strict_ordering: true, you must set parallel_workers: 1. Pipeline validation will fail if parallel_workers is greater than 1 because parallel processing inherently breaks ordering guarantees.

Behavior:

Value	Processing Model	Buffer Priority	Recovery Latency
`false` (default)	Parallel workers	Buffered data drains in background	Lower - current state visible immediately
`true`	Single-threaded	Buffered items always drain first	Higher - queue must drain before new data

Why the default is false:

In most observability use cases, data freshness is more valuable than strict ordering. When a destination recovers from an outage, operators typically want to see current system state on dashboards immediately, while historical data backfills in the background. The default behavior prioritizes this real-time visibility.

When to enable strict ordering:

Strict ordering is primarily needed by security-focused customers who build systems where events must arrive in the exact delivery order. These customers typically run stateful security streaming engines that depend on precise temporal sequencing.

Specific use cases:

Stateful security streaming engines - Security systems that maintain state across events and detect patterns based on exact event order
Audit and compliance logs - Regulatory requirements that mandate audit trails preserve exact temporal sequence
State reconstruction - Systems that replay events to rebuild state require chronological order

When to keep default (false):

The vast majority of deployments should keep the default:

Real-time monitoring dashboards - Current state visibility is more important than historical order
High-volume log ingestion - Faster drain times reduce recovery period
Stateless analytics - When each log is analyzed independently without temporal correlation

drain_rate_limit

The drain_rate_limit parameter controls the maximum items per second when draining the persistent buffer after a destination recovers from a failure.

Example:

persistent_queue:
  path: /var/lib/edgedelta/outputbuffer
  drain_rate_limit: 1000

Default value: 0 (no limit - drain as fast as the destination accepts)

Why rate limiting matters:

When a destination recovers from an outage, it may still be fragile. Immediately flooding it with hours of backlogged data can trigger another failure. The drain rate limit allows gradual, controlled recovery that protects destination stability.

Choosing the right rate:

Scenario	Recommended Rate	Reasoning
Stable, well-provisioned destination	`0` (unlimited)	Minimize recovery time when destination can handle full load
Shared or multi-tenant destination	20-50% of capacity	Leave headroom for live traffic and other tenants
Recently recovered destination	10-25% of capacity	Gentle ramp-up to prevent re-triggering failure
Rate-limited destination (e.g., SaaS)	Below API rate limit	Avoid throttling or quota exhaustion

Impact on recovery time:

Buffer size: 1GB
Average log size: 1KB
Total items: ~1,000,000 logs

At unlimited (0): Depends on destination capacity
At 5000:      ~3.5 minutes to drain
At 1000:      ~17 minutes to drain
At 100:       ~2.8 hours to drain

memory_buffer

YAML Only

This parameter requires Edge Delta agent version v2.12.0 or higher.

The memory_buffer parameter configures the in-memory buffer queue for the destination. The size sub-parameter sets the maximum number of events to hold in the memory buffer queue before blocking. It is specified as an integer, has a default of 1000, and it is optional.

nodes:
  - name: <destination-name>
    type: <destination-type>
    memory_buffer:
      size: 2000

Configuration Wizard

Use this interactive wizard to generate a starter configuration:

Google Cloud Storage Destination Configuration Wizard

Format Compression Config

Select encoding format

JSON

Standard JSON format for logs. Good compatibility with most tools.

Parquet

Columnar format optimized for analytics. Better compression and query performance.

Configuration Summary

Storage Settings

Encoding

Compression

Generated YAML Configuration

Replace these placeholders in the YAML:

MY_GCS_DESTINATION - Replace with your destination node name (e.g., gcs_logs)
GCS_BUCKET_NAME - Replace with your GCS bucket name (e.g., my-logs-bucket)
GCS_HMAC_ACCESS_KEY - Replace with your GCS HMAC access key (e.g., GOOG1E...)
GCS_HMAC_SECRET - Replace with your GCS HMAC secret. You can reference secrets using {{ secret "secret_name" }} syntax

Ensure your service account has the Storage Admin HMAC role. See Prepare GCS for setup instructions.

Your answers are private

Edge Delta Google Cloud Storage Output

Overview

Configuring GCS

Example Configuration

Required Parameters

name

type: gcs_output

bucket

hmac_access_key

hmac_secret

Optional Parameters

disabled

bucket_expression

keep_overridden_bucket

disable_compaction

compression

schema

user_schema (Schema Spec)

encoding

flush_interval

max_byte_limit

path_prefix

use_native_compression

credentials_path

persistent_queue

path

max_byte_size

mode

strict_ordering

drain_rate_limit

memory_buffer

Configuration Wizard

Google Cloud Storage Destination Configuration Wizard

Edge Delta AI Assistant

Conversations

Hi! I'm your Edge Delta AI Assistant

Current Context