Edge Delta Apache Kudu Destination
7 minute read
Overview
The Apache Kudu destination node sends data to Apache Kudu tables. Apache Kudu is a distributed columnar storage engine optimized for fast analytics on fast data. This node supports flexible schema mappings, batch configuration, and both insert and upsert write modes.
- incoming_data_types: metric, cluster_pattern_and_sample, log, custom
Note: This node is currently in beta and is available for Enterprise tier accounts.
This node requires Edge Delta agent version v2.7.0 or higher. Kerberos authentication and encryption support requires version v2.8.0 or higher.
Important: Apache Kudu clusters typically require Kerberos authentication for security. You must configure the
kudu_securityblock with Kerberos credentials (principal, keytab, and realm) to connect to production Kudu clusters. See kudu_security for configuration details.
Example Configuration

This configuration writes user data to a Kudu table named user_table. It uses upsert mode to update existing records or insert new ones based on the user_id key column. The schema maps incoming data attributes to table columns with appropriate data types, where user_id and created_at are required fields.
nodes:
- name: my_apache_kudu
type: apache_kudu_output
hosts:
- localhost:7051
table_name: user_table
mode: upsert
schema_mappings:
- column_name: user_id
column_type: string
expression: attributes["user_id"]
is_key: true
required: true
- column_name: age
column_type: int32
expression: attributes["age"]
required: false
- column_name: created_at
column_type: int64
expression: attributes["created_at"]
required: true
- column_name: email
column_type: string
expression: attributes["email"]
required: false
- column_name: is_active
column_type: bool
expression: attributes["is_active"]
default_value: "true"
Required Parameters
name
A descriptive name for the node. This is the name that will appear in pipeline builder and you can reference this node in the YAML using the name. It must be unique across all nodes. It is a YAML list element so it begins with a - and a space followed by the string. It is a required parameter for all nodes.
nodes:
- name: <node name>
type: <node type>
type: apache_kudu_output
The type parameter specifies the type of node being configured. It is specified as a string from a closed list of node types. It is a required parameter.
nodes:
- name: <node name>
type: <node type>
hosts
The hosts parameter specifies the list of Apache Kudu master server addresses. It is specified as an array of strings in the format host:port and is required.
- name: <node name>
type: apache_kudu_output
hosts:
- master1.example.com:7051
- master2.example.com:7051
- master3.example.com:7051
table_name: <target table>
table_name
The table_name parameter defines the name of the Kudu table to write data to. It is specified as a string and is required.
- name: <node name>
type: apache_kudu_output
hosts:
- localhost:7051
table_name: my_table
schema_mappings
The schema_mappings parameter defines the list of column mappings for the Kudu table schema. Each mapping specifies how to extract data from incoming items and map them to Kudu table columns. This parameter is required.
Each schema mapping contains the following fields:
| Field | Required | Description | Type | Options |
|---|---|---|---|---|
column_name | Yes | Name of the column in the Kudu table | string | - |
column_type | Yes | Data type of the column | string | string, int32, int64, bool, float, double, binary |
expression | No | Expression to extract value from the data | string | - |
is_key | No | Whether this column is a key column | boolean | true, false |
required | No | Whether this column is required (non-null) | boolean | true, false (default: false) |
default_value | No | Default value for the column if no value is provided | string | - |
- name: <node name>
type: apache_kudu_output
hosts:
- localhost:7051
table_name: my_table
schema_mappings:
- column_name: id
column_type: string
expression: attributes["id"]
is_key: true
required: true
- column_name: timestamp
column_type: int64
expression: attributes["timestamp"]
required: true
- column_name: message
column_type: string
expression: body
- column_name: severity
column_type: string
expression: attributes["severity"]
default_value: "INFO"
Optional Parameters
mode
The mode parameter specifies the write mode for data insertion. It accepts two values:
upsert: Updates existing rows or inserts new ones (default)insert: Only inserts new rows
- name: <node name>
type: apache_kudu_output
hosts:
- localhost:7051
table_name: my_table
mode: upsert
schema_mappings:
# ... mappings ...
kudu_batch_config
The kudu_batch_config parameter configures batching behavior for writing data to Kudu. It helps optimize performance by grouping multiple write operations.
| Field | Description | Type | Default | Example |
|---|---|---|---|---|
rows_limit | Maximum number of rows per batch | integer | 100 | 1000 |
row_size_limit | Maximum size limit per row | string | - | 1MB, 512KB |
flush_interval | Time interval to flush batched data | string | - | 5s, 1m |
flush_mode | Mode for flushing batched data | string | auto | auto, manual |
- name: <node name>
type: apache_kudu_output
hosts:
- localhost:7051
table_name: my_table
kudu_batch_config:
rows_limit: 500
row_size_limit: 2MB
flush_interval: 10s
flush_mode: auto
schema_mappings:
# ... mappings ...
kudu_connection
The kudu_connection parameter configures connection management settings.
| Field | Description | Type | Default | Example |
|---|---|---|---|---|
timeout | Connection timeout | string | - | 30s, 1m |
retry_attempts | Number of retry attempts for failed operations | integer | 3 | 5 |
retry_delay | Delay between retry attempts | string | - | 1s, 500ms |
max_connections | Maximum number of concurrent connections | integer | 10 | 20 |
- name: <node name>
type: apache_kudu_output
hosts:
- localhost:7051
table_name: my_table
kudu_connection:
timeout: 45s
retry_attempts: 5
retry_delay: 2s
max_connections: 15
schema_mappings:
# ... mappings ...
parallel_worker_count
The parallel_worker_count parameter specifies the number of workers that run in parallel to process and send data to Kudu. Increasing this value can improve throughput for high-volume data streams.
- name: <node name>
type: apache_kudu_output
hosts:
- localhost:7051
table_name: my_table
parallel_worker_count: 10
schema_mappings:
# ... mappings ...
Default: 5
kudu_security
The kudu_security parameter configures authentication and encryption for connecting to secured Apache Kudu clusters. This enables integration with enterprise Kudu deployments that require Kerberos authentication. For detailed setup instructions including keytab management and troubleshooting, see Kerberos Authentication.
auth_type
The auth_type field specifies the authentication mechanism. Available options:
none- No authentication (default)kerberos- Kerberos (GSSAPI) authentication
kerberos
When auth_type is set to kerberos, configure the kerberos block with the following options:
| Parameter | Required | Description |
|---|---|---|
principal | Yes | Kerberos principal name for the Edge Delta agent (e.g., edgedelta-agent@EXAMPLE.COM) |
keytab | Yes | Absolute path to the keytab file for Kerberos authentication |
realm | No | Kerberos realm (extracted from principal if not specified) |
sasl_protocol_name | No | SASL protocol name (defaults to kudu) |
krb5_conf_path | No | Path to krb5.conf file (uses system default if not specified) |
tls
The tls block within kudu_security configures TLS encryption for the Kudu connection. TLS is enabled when ca_file is specified.
| Parameter | Description |
|---|---|
ca_file | Path to the CA certificate file for server verification (enables TLS) |
cert_file | Path to the client certificate file (for mutual TLS) |
key_file | Path to the client private key file (for mutual TLS) |
skip_verify | Skip server certificate verification (not recommended for production) |
Example: Kerberos with TLS
This example shows a complete Kudu destination configuration with Kerberos authentication and TLS encryption. This is the typical configuration for production Kudu clusters:
nodes:
- name: secure_kudu
type: apache_kudu_output
hosts:
- kudu-master1.example.com:7051
- kudu-master2.example.com:7051
table_name: secure_table
mode: upsert
kudu_security:
auth_type: kerberos
kerberos:
principal: edgedelta-agent@EXAMPLE.COM
keytab: /etc/security/keytabs/edgedelta.keytab
realm: EXAMPLE.COM
sasl_protocol_name: kudu
krb5_conf_path: /etc/krb5.conf
tls:
ca_file: /etc/ssl/certs/kudu-ca.crt
kudu_batch_config:
rows_limit: 100
flush_interval: 5s
flush_mode: auto
kudu_connection:
timeout: 30s
retry_attempts: 3
retry_delay: 1s
max_connections: 10
schema_mappings:
- column_name: timestamp
column_type: int64
expression: timestamp
is_key: true
required: true
- column_name: message
column_type: string
expression: body
- column_name: id
column_type: string
expression: attributes["id"]
is_key: true
required: true
Note: When Kerberos is enabled on the Kudu cluster (via
--rpc_authentication=requiredon master and tablet servers), the agent must provide valid Kerberos credentials. Connections without proper authentication will be rejected.
Performance Considerations
When configuring the Apache Kudu destination, consider the following for optimal performance:
- Batch Size: Adjust
rows_limitinkudu_batch_configbased on your data volume and latency requirements. Larger batches improve throughput but increase latency. - Write Mode: Use
upsertmode when you need to handle duplicate keys, but be aware it has slightly higher overhead thaninsertmode. - Connection Pool: Set
max_connectionsbased on your Kudu cluster capacity and expected throughput. - Schema Design: Define key columns (
is_key: true) carefully as they determine the primary key and affect write performance. - Flush Interval: Balance between data freshness and write efficiency with the
flush_intervalsetting.
Troubleshooting
For comprehensive troubleshooting of Apache Kudu destination issues including connection problems, schema mismatches, performance optimization, and debugging techniques, see the Apache Kudu Troubleshooting Guide.