Edge Delta Apache Kudu Destination
10 minute read
Overview
The Apache Kudu destination node sends data to Apache Kudu tables. Apache Kudu is a distributed columnar storage engine optimized for fast analytics on fast data. This node supports flexible schema mappings, batch configuration, and both insert and upsert write modes.
- incoming_data_types: metric, cluster_pattern_and_sample, log, custom
Note: This node is currently in beta and is available for Enterprise tier accounts.
This node requires Edge Delta agent version v2.7.0 or higher.
Important: The Apache Kudu destination requires an Extended Pipeline Type (Edge-extended or Cloud-extended). When creating your pipeline in the Edge Delta UI, ensure you select the extended pipeline type option to deploy the specialized agent binary that includes Kudu support. Standard agent binaries do not include the Apache Kudu client libraries required for this destination.
Example Configuration
This configuration writes user data to a Kudu table named user_table. It uses upsert mode to update existing records or insert new ones based on the user_id key column. The schema maps incoming data attributes to table columns with appropriate data types, where user_id and created_at are required fields.
nodes:
- name: my_apache_kudu
type: apache_kudu_output
hosts:
- localhost:7051
table_name: user_table
mode: upsert
schema_mappings:
- column_name: user_id
column_type: string
expression: attributes["user_id"]
is_key: true
required: true
- column_name: age
column_type: int32
expression: attributes["age"]
required: false
- column_name: created_at
column_type: int64
expression: attributes["created_at"]
required: true
- column_name: email
column_type: string
expression: attributes["email"]
required: false
- column_name: is_active
column_type: bool
expression: attributes["is_active"]
default_value: "true"
Required Parameters
name
A descriptive name for the node. This is the name that will appear in pipeline builder and you can reference this node in the YAML using the name. It must be unique across all nodes. It is a YAML list element so it begins with a - and a space followed by the string. It is a required parameter for all nodes.
nodes:
- name: <node name>
type: <node type>
type: apache_kudu_output
The type parameter specifies the type of node being configured. It is specified as a string from a closed list of node types. It is a required parameter.
nodes:
- name: <node name>
type: <node type>
hosts
The hosts parameter specifies the list of Apache Kudu master server addresses. It is specified as an array of strings in the format host:port and is required.
- name: <node name>
type: apache_kudu_output
hosts:
- master1.example.com:7051
- master2.example.com:7051
- master3.example.com:7051
table_name: <target table>
table_name
The table_name parameter defines the name of the Kudu table to write data to. It is specified as a string and is required.
- name: <node name>
type: apache_kudu_output
hosts:
- localhost:7051
table_name: my_table
schema_mappings
The schema_mappings parameter defines the list of column mappings for the Kudu table schema. Each mapping specifies how to extract data from incoming items and map them to Kudu table columns. This parameter is required.
Each schema mapping contains the following fields:
| Field | Required | Description | Type | Options |
|---|---|---|---|---|
column_name | Yes | Name of the column in the Kudu table | string | - |
column_type | Yes | Data type of the column | string | string, int32, int64, bool, float, double, binary |
expression | No | Expression to extract value from the data | string | - |
is_key | No | Whether this column is a key column | boolean | true, false |
required | No | Whether this column is required (non-null) | boolean | true, false (default: false) |
default_value | No | Default value for the column if no value is provided | string | - |
- name: <node name>
type: apache_kudu_output
hosts:
- localhost:7051
table_name: my_table
schema_mappings:
- column_name: id
column_type: string
expression: attributes["id"]
is_key: true
required: true
- column_name: timestamp
column_type: int64
expression: attributes["timestamp"]
required: true
- column_name: message
column_type: string
expression: body
- column_name: severity
column_type: string
expression: attributes["severity"]
default_value: "INFO"
Optional Parameters
mode
The mode parameter specifies the write mode for data insertion. It accepts two values:
upsert: Updates existing rows or inserts new ones (default)insert: Only inserts new rows
- name: <node name>
type: apache_kudu_output
hosts:
- localhost:7051
table_name: my_table
mode: upsert
schema_mappings:
# ... mappings ...
batch_config
The batch_config parameter configures batching behavior for writing data to Kudu. It helps optimize performance by grouping multiple write operations.
| Field | Description | Type | Default | Example |
|---|---|---|---|---|
rows_limit | Maximum number of rows per batch | integer | 100 | 1000 |
row_size_limit | Maximum size limit per row | string | - | "1MB", "512KB" |
flush_interval | Time interval to flush batched data | string | - | "5s", "1m" |
flush_mode | Mode for flushing batched data | string | "auto" | "auto", "manual" |
- name: <node name>
type: apache_kudu_output
hosts:
- localhost:7051
table_name: my_table
batch_config:
rows_limit: 500
row_size_limit: "2MB"
flush_interval: "10s"
flush_mode: auto
schema_mappings:
# ... mappings ...
connection
The connection parameter configures connection management settings.
| Field | Description | Type | Default | Example |
|---|---|---|---|---|
timeout | Connection timeout | string | - | "30s", "1m" |
retry_attempts | Number of retry attempts for failed operations | integer | 3 | 5 |
retry_delay | Delay between retry attempts | string | - | "1s", "500ms" |
max_connections | Maximum number of concurrent connections | integer | 10 | 20 |
- name: <node name>
type: apache_kudu_output
hosts:
- localhost:7051
table_name: my_table
connection:
timeout: "45s"
retry_attempts: 5
retry_delay: "2s"
max_connections: 15
schema_mappings:
# ... mappings ...
parallel_worker_count
The parallel_worker_count parameter specifies the number of workers that run in parallel to process and send data to Kudu. Increasing this value can improve throughput for high-volume data streams.
- name: <node name>
type: apache_kudu_output
hosts:
- localhost:7051
table_name: my_table
parallel_worker_count: 10
schema_mappings:
# ... mappings ...
Default: 5
tls
Configure TLS settings for secure connections to this destination. TLS is optional and typically used when connecting to endpoints that require encrypted transport (HTTPS) or mutual TLS.
YAML Configuration Example:
nodes:
- name: <node name>
type: <destination type>
tls:
<tls options>
Enable TLS
Enables TLS encryption for outbound connections to the destination endpoint. When enabled, all communication with the destination will be encrypted using TLS/SSL. This should be enabled when connecting to HTTPS endpoints or any service that requires encrypted transport. (YAML parameter: enabled)
Default: false
When to use: Enable when the destination requires HTTPS or secure connections. Always enable for production systems handling sensitive data, connections over untrusted networks, or when compliance requirements mandate encryption in transit.
YAML Configuration Example:
nodes:
- name: <node name>
type: <destination type>
tls:
enabled: true
Ignore Certificate Check
Disables TLS certificate verification, allowing connections to servers with self-signed, expired, or invalid certificates. This bypasses security checks that verify the server’s identity and certificate validity. (YAML parameter: ignore_certificate_check)
Default: false
When to use: Only use in development or testing environments with self-signed certificates. NEVER enable in production—this makes your connection vulnerable to man-in-the-middle attacks. For production with self-signed certificates, use ca_file or ca_path to explicitly trust specific certificates instead.
YAML Configuration Example:
nodes:
- name: <node name>
type: <destination type>
tls:
ignore_certificate_check: true # Only for testing!
CA Certificate File
Specifies the absolute path to a CA (Certificate Authority) certificate file used to verify the destination server’s certificate. This allows you to trust specific CAs beyond the system’s default trusted CAs, which is essential when connecting to servers using self-signed certificates or private CAs. (YAML parameter: ca_file)
When to use: Required when connecting to servers with certificates signed by a private/internal CA, or when you want to restrict trust to specific CAs only. Choose either ca_file (single CA certificate) or ca_path (directory of CA certificates), not both.
YAML Configuration Example:
nodes:
- name: <node name>
type: <destination type>
tls:
ca_file: /certs/ca.pem
CA Certificate Path
Specifies a directory path containing one or more CA certificate files for verifying the destination server’s certificate. Use this when you need to trust multiple CAs or when managing CA certificates across multiple files. All certificate files in the directory will be loaded. (YAML parameter: ca_path)
When to use: Alternative to ca_file when you have multiple CA certificates to trust. Useful for environments with multiple private CAs or when you need to rotate CA certificates without modifying configuration. Choose either ca_file or ca_path, not both.
YAML Configuration Example:
nodes:
- name: <node name>
type: <destination type>
tls:
ca_path: /certs/ca-certificates/
Certificate File
Path to the client certificate file (public key) used for mutual TLS (mTLS) authentication with the destination server. This certificate identifies the client to the server and must match the private key. The certificate should be in PEM format. (YAML parameter: crt_file)
When to use: Required only when the destination server requires mutual TLS authentication, where both client and server present certificates. Must be used together with key_file. Not needed for standard client TLS connections where only the server presents a certificate.
YAML Configuration Example:
nodes:
- name: <node name>
type: <destination type>
tls:
crt_file: /certs/client-cert.pem
key_file: /certs/client-key.pem
Private Key File
Path to the private key file corresponding to the client certificate. This key must match the public key in the certificate file and is used during the TLS handshake to prove ownership of the certificate. Keep this file secure with restricted permissions. (YAML parameter: key_file)
When to use: Required for mutual TLS authentication. Must be used together with crt_file. If the key file is encrypted with a password, also specify key_password. Only needed when the destination server requires client certificate authentication.
YAML Configuration Example:
nodes:
- name: <node name>
type: <destination type>
tls:
crt_file: /certs/client-cert.pem
key_file: /certs/client-key.pem
key_password: <password> # Only if key is encrypted
Private Key Password
Password (passphrase) used to decrypt an encrypted private key file. Only needed if your private key file is password-protected. If your key file is unencrypted, omit this parameter. (YAML parameter: key_password)
When to use: Optional. Only required if key_file is encrypted/password-protected. For enhanced security, use encrypted keys in production environments. If you receive “bad decrypt” or “incorrect password” errors, verify the password matches the key file encryption.
YAML Configuration Example:
nodes:
- name: <node name>
type: <destination type>
tls:
crt_file: /certs/client-cert.pem
key_file: /certs/encrypted-client-key.pem
key_password: mySecurePassword123
Minimum TLS Version
Minimum TLS protocol version to use when connecting to the destination server. This enforces a baseline security level by refusing to connect if the server doesn’t support this version or higher. (YAML parameter: min_version)
Available versions:
TLSv1_0- Deprecated, not recommended (security vulnerabilities)TLSv1_1- Deprecated, not recommended (security vulnerabilities)TLSv1_2- Recommended minimum for production (default)TLSv1_3- Most secure, use when destination supports it
Default: TLSv1_2
When to use: Set to TLSv1_2 or higher for production deployments. Only use TLSv1_0 or TLSv1_1 if connecting to legacy servers that don’t support newer versions, and be aware of the security risks. TLS 1.0 and 1.1 are officially deprecated.
YAML Configuration Example:
nodes:
- name: <node name>
type: <destination type>
tls:
min_version: TLSv1_2
Maximum TLS Version
Maximum TLS protocol version to use when connecting to the destination server. This is typically used to restrict newer TLS versions if compatibility issues arise with specific server implementations. (YAML parameter: max_version)
Available versions:
TLSv1_0TLSv1_1TLSv1_2TLSv1_3
When to use: Usually left unset to allow the most secure version available. Only set this if you encounter specific compatibility issues with TLS 1.3 on the destination server, or for testing purposes. In most cases, you should allow the latest TLS version.
YAML Configuration Example:
nodes:
- name: <node name>
type: <destination type>
tls:
max_version: TLSv1_3
Performance Considerations
When configuring the Apache Kudu destination, consider the following for optimal performance:
- Batch Size: Adjust
rows_limitinbatch_configbased on your data volume and latency requirements. Larger batches improve throughput but increase latency. - Write Mode: Use
upsertmode when you need to handle duplicate keys, but be aware it has slightly higher overhead thaninsertmode. - Connection Pool: Set
max_connectionsbased on your Kudu cluster capacity and expected throughput. - Schema Design: Define key columns (
is_key: true) carefully as they determine the primary key and affect write performance. - Flush Interval: Balance between data freshness and write efficiency with the
flush_intervalsetting.
Troubleshooting
For comprehensive troubleshooting of Apache Kudu destination issues including connection problems, schema mismatches, performance optimization, and debugging techniques, see the Apache Kudu Troubleshooting Guide.