Edge Delta Apache Kudu Destination
6 minute read
Overview
The Apache Kudu destination node sends data to Apache Kudu tables. Apache Kudu is a distributed columnar storage engine optimized for fast analytics on fast data. This node supports flexible schema mappings, batch configuration, and both insert and upsert write modes.
- incoming_data_types: metric, cluster_pattern_and_sample, log, custom
Note: This node is currently in beta and is available for Enterprise tier accounts.
Example Configuration
This configuration writes user data to a Kudu table named user_table
. It uses upsert mode to update existing records or insert new ones based on the user_id
key column. The schema maps incoming data attributes to table columns with appropriate data types, where user_id
and created_at
are required fields.
nodes:
- name: my_apache_kudu
type: apache_kudu_output
hosts:
- localhost:7051
table_name: user_table
mode: upsert
schema_mappings:
- column_name: user_id
column_type: string
expression: attributes["user_id"]
is_key: true
required: true
- column_name: age
column_type: int32
expression: attributes["age"]
required: false
- column_name: created_at
column_type: int64
expression: attributes["created_at"]
required: true
- column_name: email
column_type: string
expression: attributes["email"]
required: false
- column_name: is_active
column_type: bool
expression: attributes["is_active"]
default_value: "true"
Required Parameters
name
A descriptive name for the node. This is the name that will appear in pipeline builder and you can reference this node in the YAML using the name. It must be unique across all nodes. It is a YAML list element so it begins with a -
and a space followed by the string. It is a required parameter for all nodes.
nodes:
- name: <node name>
type: <node type>
type: apache_kudu_output
The type
parameter specifies the type of node being configured. It is specified as a string from a closed list of node types. It is a required parameter.
nodes:
- name: <node name>
type: <node type>
hosts
The hosts
parameter specifies the list of Apache Kudu master server addresses. It is specified as an array of strings in the format host:port
and is required.
- name: <node name>
type: apache_kudu_output
hosts:
- master1.example.com:7051
- master2.example.com:7051
- master3.example.com:7051
table_name: <target table>
table_name
The table_name
parameter defines the name of the Kudu table to write data to. It is specified as a string and is required.
- name: <node name>
type: apache_kudu_output
hosts:
- localhost:7051
table_name: my_table
schema_mappings
The schema_mappings
parameter defines the list of column mappings for the Kudu table schema. Each mapping specifies how to extract data from incoming items and map them to Kudu table columns. This parameter is required.
Each schema mapping contains the following fields:
Field | Required | Description | Type | Options |
---|---|---|---|---|
column_name |
Yes | Name of the column in the Kudu table | string | - |
column_type |
Yes | Data type of the column | string | string , int32 , int64 , bool , float , double , binary |
expression |
No | Expression to extract value from the data | string | - |
is_key |
No | Whether this column is a key column | boolean | true , false |
required |
No | Whether this column is required (non-null) | boolean | true , false (default: false ) |
default_value |
No | Default value for the column if no value is provided | string | - |
- name: <node name>
type: apache_kudu_output
hosts:
- localhost:7051
table_name: my_table
schema_mappings:
- column_name: id
column_type: string
expression: attributes["id"]
is_key: true
required: true
- column_name: timestamp
column_type: int64
expression: attributes["timestamp"]
required: true
- column_name: message
column_type: string
expression: body
- column_name: severity
column_type: string
expression: attributes["severity"]
default_value: "INFO"
Optional Parameters
mode
The mode
parameter specifies the write mode for data insertion. It accepts two values:
upsert
: Updates existing rows or inserts new ones (default)insert
: Only inserts new rows
- name: <node name>
type: apache_kudu_output
hosts:
- localhost:7051
table_name: my_table
mode: upsert
schema_mappings:
# ... mappings ...
batch_config
The batch_config
parameter configures batching behavior for writing data to Kudu. It helps optimize performance by grouping multiple write operations.
Field | Description | Type | Default | Example |
---|---|---|---|---|
rows_limit |
Maximum number of rows per batch | integer | 100 | 1000 |
row_size_limit |
Maximum size limit per row | string | - | "1MB" , "512KB" |
flush_interval |
Time interval to flush batched data | string | - | "5s" , "1m" |
flush_mode |
Mode for flushing batched data | string | "auto" |
"auto" , "manual" |
- name: <node name>
type: apache_kudu_output
hosts:
- localhost:7051
table_name: my_table
batch_config:
rows_limit: 500
row_size_limit: "2MB"
flush_interval: "10s"
flush_mode: auto
schema_mappings:
# ... mappings ...
connection
The connection
parameter configures connection management settings.
Field | Description | Type | Default | Example |
---|---|---|---|---|
timeout |
Connection timeout | string | - | "30s" , "1m" |
retry_attempts |
Number of retry attempts for failed operations | integer | 3 | 5 |
retry_delay |
Delay between retry attempts | string | - | "1s" , "500ms" |
max_connections |
Maximum number of concurrent connections | integer | 10 | 20 |
- name: <node name>
type: apache_kudu_output
hosts:
- localhost:7051
table_name: my_table
connection:
timeout: "45s"
retry_attempts: 5
retry_delay: "2s"
max_connections: 15
schema_mappings:
# ... mappings ...
parallel_worker_count
The parallel_worker_count
parameter specifies the number of workers that run in parallel to process and send data to Kudu. Increasing this value can improve throughput for high-volume data streams.
- name: <node name>
type: apache_kudu_output
hosts:
- localhost:7051
table_name: my_table
parallel_worker_count: 10
schema_mappings:
# ... mappings ...
Default: 5
tls
The tls
parameter is a dictionary that configures TLS settings for secure connections to the destination. It is optional and typically used when connecting to endpoints that require encrypted transport (HTTPS) or mutual TLS.
nodes:
- name: <node name>
type: <destination type>
tls:
<tls options>
enabled
Specifies whether TLS is enabled. This is a Boolean value. Default is false
.
nodes:
- name: <node name>
type: <destination type>
tls:
enabled: true
ignore_certificate_check
Disables certificate verification. Useful for test environments. Default is false
.
nodes:
- name: <node name>
type: <destination type>
tls:
ignore_certificate_check: true
ca_file
Specifies the absolute path to a CA certificate file for verifying the remote server’s certificate.
nodes:
- name: <node name>
type: <destination type>
tls:
ca_file: /certs/ca.pem
ca_path
Specifies a directory containing one or more CA certificate files.
nodes:
- name: <node name>
type: <destination type>
tls:
ca_path: /certs/
crt_file
Path to the client certificate file for mutual TLS authentication.
nodes:
- name: <node name>
type: <destination type>
tls:
crt_file: /certs/client-cert.pem
key_file
Path to the private key file used for client TLS authentication.
nodes:
- name: <node name>
type: <destination type>
tls:
key_file: /certs/client-key.pem
key_password
Password for the TLS private key file, if required.
nodes:
- name: <node name>
type: <destination type>
tls:
key_password: <password>
client_auth_type
Controls how client certificates are requested and validated during the TLS handshake. Valid options:
noclientcert
requestclientcert
requireanyclientcert
verifyclientcertifgiven
requireandverifyclientcert
nodes:
- name: <node name>
type: <destination type>
tls:
client_auth_type: requireandverifyclientcert
max_version
Maximum supported version of the TLS protocol.
TLSv1_0
TLSv1_1
TLSv1_2
TLSv1_3
nodes:
- name: <node name>
type: <destination type>
tls:
max_version: TLSv1_3
min_version
Minimum supported version of the TLS protocol. Default is TLSv1_2
.
nodes:
- name: <node name>
type: <destination type>
tls:
min_version: TLSv1_2
Performance Considerations
When configuring the Apache Kudu destination, consider the following for optimal performance:
- Batch Size: Adjust
rows_limit
inbatch_config
based on your data volume and latency requirements. Larger batches improve throughput but increase latency. - Write Mode: Use
upsert
mode when you need to handle duplicate keys, but be aware it has slightly higher overhead thaninsert
mode. - Connection Pool: Set
max_connections
based on your Kudu cluster capacity and expected throughput. - Schema Design: Define key columns (
is_key: true
) carefully as they determine the primary key and affect write performance. - Flush Interval: Balance between data freshness and write efficiency with the
flush_interval
setting.
Troubleshooting
For comprehensive troubleshooting of Apache Kudu destination issues including connection problems, schema mismatches, performance optimization, and debugging techniques, see the Apache Kudu Troubleshooting Guide.