Edge Delta Apache Kudu Destination

Configure the Apache Kudu destination node to send data to Apache Kudu tables with schema mappings and batch configuration.

Overview

The Apache Kudu destination node sends data to Apache Kudu tables. Apache Kudu is a distributed columnar storage engine optimized for fast analytics on fast data. This node supports flexible schema mappings, batch configuration, and both insert and upsert write modes.

Note: This node is currently in beta and is available for Enterprise tier accounts.

This node requires Edge Delta agent version v2.7.0 or higher. Kerberos authentication and encryption support requires version v2.8.0 or higher.

Important: Apache Kudu clusters typically require Kerberos authentication for security. You must configure the kudu_security block with Kerberos credentials (principal, keytab, and realm) to connect to production Kudu clusters. See kudu_security for configuration details.

Example Configuration

Screenshot Screenshot

This configuration writes user data to a Kudu table named user_table. It uses upsert mode to update existing records or insert new ones based on the user_id key column. The schema maps incoming data attributes to table columns with appropriate data types, where user_id and created_at are required fields.

nodes:
- name: my_apache_kudu
  type: apache_kudu_output
  hosts:
    - localhost:7051
  table_name: user_table
  mode: upsert
  schema_mappings:
  - column_name: user_id
    column_type: string
    expression: attributes["user_id"]
    is_key: true
    required: true
  - column_name: age
    column_type: int32
    expression: attributes["age"]
    required: false
  - column_name: created_at
    column_type: int64
    expression: attributes["created_at"]
    required: true
  - column_name: email
    column_type: string
    expression: attributes["email"]
    required: false
  - column_name: is_active
    column_type: bool
    expression: attributes["is_active"]
    default_value: "true"

Required Parameters

name

A descriptive name for the node. This is the name that will appear in pipeline builder and you can reference this node in the YAML using the name. It must be unique across all nodes. It is a YAML list element so it begins with a - and a space followed by the string. It is a required parameter for all nodes.

nodes:
  - name: <node name>
    type: <node type>

type: apache_kudu_output

The type parameter specifies the type of node being configured. It is specified as a string from a closed list of node types. It is a required parameter.

nodes:
  - name: <node name>
    type: <node type>

hosts

The hosts parameter specifies the list of Apache Kudu master server addresses. It is specified as an array of strings in the format host:port and is required.

- name: <node name>
  type: apache_kudu_output
  hosts:
    - master1.example.com:7051
    - master2.example.com:7051
    - master3.example.com:7051
  table_name: <target table>

table_name

The table_name parameter defines the name of the Kudu table to write data to. It is specified as a string and is required.

- name: <node name>
  type: apache_kudu_output
  hosts:
    - localhost:7051
  table_name: my_table

schema_mappings

The schema_mappings parameter defines the list of column mappings for the Kudu table schema. Each mapping specifies how to extract data from incoming items and map them to Kudu table columns. This parameter is required.

Each schema mapping contains the following fields:

FieldRequiredDescriptionTypeOptions
column_nameYesName of the column in the Kudu tablestring-
column_typeYesData type of the columnstringstring, int32, int64, bool, float, double, binary
expressionNoExpression to extract value from the datastring-
is_keyNoWhether this column is a key columnbooleantrue, false
requiredNoWhether this column is required (non-null)booleantrue, false (default: false)
default_valueNoDefault value for the column if no value is providedstring-
- name: <node name>
  type: apache_kudu_output
  hosts:
    - localhost:7051
  table_name: my_table
  schema_mappings:
  - column_name: id
    column_type: string
    expression: attributes["id"]
    is_key: true
    required: true
  - column_name: timestamp
    column_type: int64
    expression: attributes["timestamp"]
    required: true
  - column_name: message
    column_type: string
    expression: body
  - column_name: severity
    column_type: string
    expression: attributes["severity"]
    default_value: "INFO"

Optional Parameters

mode

The mode parameter specifies the write mode for data insertion. It accepts two values:

  • upsert: Updates existing rows or inserts new ones (default)
  • insert: Only inserts new rows
- name: <node name>
  type: apache_kudu_output
  hosts:
    - localhost:7051
  table_name: my_table
  mode: upsert
  schema_mappings:
    # ... mappings ...

kudu_batch_config

The kudu_batch_config parameter configures batching behavior for writing data to Kudu. It helps optimize performance by grouping multiple write operations.

FieldDescriptionTypeDefaultExample
rows_limitMaximum number of rows per batchinteger1001000
row_size_limitMaximum size limit per rowstring-1MB, 512KB
flush_intervalTime interval to flush batched datastring-5s, 1m
flush_modeMode for flushing batched datastringautoauto, manual
- name: <node name>
  type: apache_kudu_output
  hosts:
    - localhost:7051
  table_name: my_table
  kudu_batch_config:
    rows_limit: 500
    row_size_limit: 2MB
    flush_interval: 10s
    flush_mode: auto
  schema_mappings:
    # ... mappings ...

kudu_connection

The kudu_connection parameter configures connection management settings.

FieldDescriptionTypeDefaultExample
timeoutConnection timeoutstring-30s, 1m
retry_attemptsNumber of retry attempts for failed operationsinteger35
retry_delayDelay between retry attemptsstring-1s, 500ms
max_connectionsMaximum number of concurrent connectionsinteger1020
- name: <node name>
  type: apache_kudu_output
  hosts:
    - localhost:7051
  table_name: my_table
  kudu_connection:
    timeout: 45s
    retry_attempts: 5
    retry_delay: 2s
    max_connections: 15
  schema_mappings:
    # ... mappings ...

parallel_worker_count

The parallel_worker_count parameter specifies the number of workers that run in parallel to process and send data to Kudu. Increasing this value can improve throughput for high-volume data streams.

- name: <node name>
  type: apache_kudu_output
  hosts:
    - localhost:7051
  table_name: my_table
  parallel_worker_count: 10
  schema_mappings:
    # ... mappings ...

Default: 5

kudu_security

The kudu_security parameter configures authentication and encryption for connecting to secured Apache Kudu clusters. This enables integration with enterprise Kudu deployments that require Kerberos authentication. For detailed setup instructions including keytab management and troubleshooting, see Kerberos Authentication.

auth_type

The auth_type field specifies the authentication mechanism. Available options:

  • none - No authentication (default)
  • kerberos - Kerberos (GSSAPI) authentication

kerberos

When auth_type is set to kerberos, configure the kerberos block with the following options:

ParameterRequiredDescription
principalYesKerberos principal name for the Edge Delta agent (e.g., edgedelta-agent@EXAMPLE.COM)
keytabYesAbsolute path to the keytab file for Kerberos authentication
realmNoKerberos realm (extracted from principal if not specified)
sasl_protocol_nameNoSASL protocol name (defaults to kudu)
krb5_conf_pathNoPath to krb5.conf file (uses system default if not specified)

tls

The tls block within kudu_security configures TLS encryption for the Kudu connection. TLS is enabled when ca_file is specified.

ParameterDescription
ca_filePath to the CA certificate file for server verification (enables TLS)
cert_filePath to the client certificate file (for mutual TLS)
key_filePath to the client private key file (for mutual TLS)
skip_verifySkip server certificate verification (not recommended for production)

Example: Kerberos with TLS

This example shows a complete Kudu destination configuration with Kerberos authentication and TLS encryption. This is the typical configuration for production Kudu clusters:

nodes:
- name: secure_kudu
  type: apache_kudu_output
  hosts:
    - kudu-master1.example.com:7051
    - kudu-master2.example.com:7051
  table_name: secure_table
  mode: upsert
  kudu_security:
    auth_type: kerberos
    kerberos:
      principal: edgedelta-agent@EXAMPLE.COM
      keytab: /etc/security/keytabs/edgedelta.keytab
      realm: EXAMPLE.COM
      sasl_protocol_name: kudu
      krb5_conf_path: /etc/krb5.conf
    tls:
      ca_file: /etc/ssl/certs/kudu-ca.crt
  kudu_batch_config:
    rows_limit: 100
    flush_interval: 5s
    flush_mode: auto
  kudu_connection:
    timeout: 30s
    retry_attempts: 3
    retry_delay: 1s
    max_connections: 10
  schema_mappings:
    - column_name: timestamp
      column_type: int64
      expression: timestamp
      is_key: true
      required: true
    - column_name: message
      column_type: string
      expression: body
    - column_name: id
      column_type: string
      expression: attributes["id"]
      is_key: true
      required: true

Note: When Kerberos is enabled on the Kudu cluster (via --rpc_authentication=required on master and tablet servers), the agent must provide valid Kerberos credentials. Connections without proper authentication will be rejected.

Performance Considerations

When configuring the Apache Kudu destination, consider the following for optimal performance:

  1. Batch Size: Adjust rows_limit in kudu_batch_config based on your data volume and latency requirements. Larger batches improve throughput but increase latency.
  2. Write Mode: Use upsert mode when you need to handle duplicate keys, but be aware it has slightly higher overhead than insert mode.
  3. Connection Pool: Set max_connections based on your Kudu cluster capacity and expected throughput.
  4. Schema Design: Define key columns (is_key: true) carefully as they determine the primary key and affect write performance.
  5. Flush Interval: Balance between data freshness and write efficiency with the flush_interval setting.

Troubleshooting

For comprehensive troubleshooting of Apache Kudu destination issues including connection problems, schema mismatches, performance optimization, and debugging techniques, see the Apache Kudu Troubleshooting Guide.