Edge Delta Apache Kudu Destination

Configure the Apache Kudu destination node to send data to Apache Kudu tables with schema mappings and batch configuration.

Overview

The Apache Kudu destination node sends data to Apache Kudu tables. Apache Kudu is a distributed columnar storage engine optimized for fast analytics on fast data. This node supports flexible schema mappings, batch configuration, and both insert and upsert write modes.

Note: This node is currently in beta and is available for Enterprise tier accounts.

This node requires Edge Delta agent version v2.7.0 or higher.

Important: The Apache Kudu destination requires an Extended Pipeline Type (Edge-extended or Cloud-extended). When creating your pipeline in the Edge Delta UI, ensure you select the extended pipeline type option to deploy the specialized agent binary that includes Kudu support. Standard agent binaries do not include the Apache Kudu client libraries required for this destination.

Example Configuration

This configuration writes user data to a Kudu table named user_table. It uses upsert mode to update existing records or insert new ones based on the user_id key column. The schema maps incoming data attributes to table columns with appropriate data types, where user_id and created_at are required fields.

nodes:
- name: my_apache_kudu
  type: apache_kudu_output
  hosts:
    - localhost:7051
  table_name: user_table
  mode: upsert
  schema_mappings:
  - column_name: user_id
    column_type: string
    expression: attributes["user_id"]
    is_key: true
    required: true
  - column_name: age
    column_type: int32
    expression: attributes["age"]
    required: false
  - column_name: created_at
    column_type: int64
    expression: attributes["created_at"]
    required: true
  - column_name: email
    column_type: string
    expression: attributes["email"]
    required: false
  - column_name: is_active
    column_type: bool
    expression: attributes["is_active"]
    default_value: "true"

Required Parameters

name

A descriptive name for the node. This is the name that will appear in pipeline builder and you can reference this node in the YAML using the name. It must be unique across all nodes. It is a YAML list element so it begins with a - and a space followed by the string. It is a required parameter for all nodes.

nodes:
  - name: <node name>
    type: <node type>

type: apache_kudu_output

The type parameter specifies the type of node being configured. It is specified as a string from a closed list of node types. It is a required parameter.

nodes:
  - name: <node name>
    type: <node type>

hosts

The hosts parameter specifies the list of Apache Kudu master server addresses. It is specified as an array of strings in the format host:port and is required.

- name: <node name>
  type: apache_kudu_output
  hosts:
    - master1.example.com:7051
    - master2.example.com:7051
    - master3.example.com:7051
  table_name: <target table>

table_name

The table_name parameter defines the name of the Kudu table to write data to. It is specified as a string and is required.

- name: <node name>
  type: apache_kudu_output
  hosts:
    - localhost:7051
  table_name: my_table

schema_mappings

The schema_mappings parameter defines the list of column mappings for the Kudu table schema. Each mapping specifies how to extract data from incoming items and map them to Kudu table columns. This parameter is required.

Each schema mapping contains the following fields:

FieldRequiredDescriptionTypeOptions
column_nameYesName of the column in the Kudu tablestring-
column_typeYesData type of the columnstringstring, int32, int64, bool, float, double, binary
expressionNoExpression to extract value from the datastring-
is_keyNoWhether this column is a key columnbooleantrue, false
requiredNoWhether this column is required (non-null)booleantrue, false (default: false)
default_valueNoDefault value for the column if no value is providedstring-
- name: <node name>
  type: apache_kudu_output
  hosts:
    - localhost:7051
  table_name: my_table
  schema_mappings:
  - column_name: id
    column_type: string
    expression: attributes["id"]
    is_key: true
    required: true
  - column_name: timestamp
    column_type: int64
    expression: attributes["timestamp"]
    required: true
  - column_name: message
    column_type: string
    expression: body
  - column_name: severity
    column_type: string
    expression: attributes["severity"]
    default_value: "INFO"

Optional Parameters

mode

The mode parameter specifies the write mode for data insertion. It accepts two values:

  • upsert: Updates existing rows or inserts new ones (default)
  • insert: Only inserts new rows
- name: <node name>
  type: apache_kudu_output
  hosts:
    - localhost:7051
  table_name: my_table
  mode: upsert
  schema_mappings:
    # ... mappings ...

batch_config

The batch_config parameter configures batching behavior for writing data to Kudu. It helps optimize performance by grouping multiple write operations.

FieldDescriptionTypeDefaultExample
rows_limitMaximum number of rows per batchinteger1001000
row_size_limitMaximum size limit per rowstring-"1MB", "512KB"
flush_intervalTime interval to flush batched datastring-"5s", "1m"
flush_modeMode for flushing batched datastring"auto""auto", "manual"
- name: <node name>
  type: apache_kudu_output
  hosts:
    - localhost:7051
  table_name: my_table
  batch_config:
    rows_limit: 500
    row_size_limit: "2MB"
    flush_interval: "10s"
    flush_mode: auto
  schema_mappings:
    # ... mappings ...

connection

The connection parameter configures connection management settings.

FieldDescriptionTypeDefaultExample
timeoutConnection timeoutstring-"30s", "1m"
retry_attemptsNumber of retry attempts for failed operationsinteger35
retry_delayDelay between retry attemptsstring-"1s", "500ms"
max_connectionsMaximum number of concurrent connectionsinteger1020
- name: <node name>
  type: apache_kudu_output
  hosts:
    - localhost:7051
  table_name: my_table
  connection:
    timeout: "45s"
    retry_attempts: 5
    retry_delay: "2s"
    max_connections: 15
  schema_mappings:
    # ... mappings ...

parallel_worker_count

The parallel_worker_count parameter specifies the number of workers that run in parallel to process and send data to Kudu. Increasing this value can improve throughput for high-volume data streams.

- name: <node name>
  type: apache_kudu_output
  hosts:
    - localhost:7051
  table_name: my_table
  parallel_worker_count: 10
  schema_mappings:
    # ... mappings ...

Default: 5

tls

Configure TLS settings for secure connections to this destination. TLS is optional and typically used when connecting to endpoints that require encrypted transport (HTTPS) or mutual TLS.

YAML Configuration Example:

nodes:
  - name: <node name>
    type: <destination type>
    tls:
      <tls options>

Enable TLS

Enables TLS encryption for outbound connections to the destination endpoint. When enabled, all communication with the destination will be encrypted using TLS/SSL. This should be enabled when connecting to HTTPS endpoints or any service that requires encrypted transport. (YAML parameter: enabled)

Default: false

When to use: Enable when the destination requires HTTPS or secure connections. Always enable for production systems handling sensitive data, connections over untrusted networks, or when compliance requirements mandate encryption in transit.

YAML Configuration Example:

nodes:
  - name: <node name>
    type: <destination type>
    tls:
      enabled: true

Ignore Certificate Check

Disables TLS certificate verification, allowing connections to servers with self-signed, expired, or invalid certificates. This bypasses security checks that verify the server’s identity and certificate validity. (YAML parameter: ignore_certificate_check)

Default: false

When to use: Only use in development or testing environments with self-signed certificates. NEVER enable in production—this makes your connection vulnerable to man-in-the-middle attacks. For production with self-signed certificates, use ca_file or ca_path to explicitly trust specific certificates instead.

YAML Configuration Example:

nodes:
  - name: <node name>
    type: <destination type>
    tls:
      ignore_certificate_check: true  # Only for testing!

CA Certificate File

Specifies the absolute path to a CA (Certificate Authority) certificate file used to verify the destination server’s certificate. This allows you to trust specific CAs beyond the system’s default trusted CAs, which is essential when connecting to servers using self-signed certificates or private CAs. (YAML parameter: ca_file)

When to use: Required when connecting to servers with certificates signed by a private/internal CA, or when you want to restrict trust to specific CAs only. Choose either ca_file (single CA certificate) or ca_path (directory of CA certificates), not both.

YAML Configuration Example:

nodes:
  - name: <node name>
    type: <destination type>
    tls:
      ca_file: /certs/ca.pem

CA Certificate Path

Specifies a directory path containing one or more CA certificate files for verifying the destination server’s certificate. Use this when you need to trust multiple CAs or when managing CA certificates across multiple files. All certificate files in the directory will be loaded. (YAML parameter: ca_path)

When to use: Alternative to ca_file when you have multiple CA certificates to trust. Useful for environments with multiple private CAs or when you need to rotate CA certificates without modifying configuration. Choose either ca_file or ca_path, not both.

YAML Configuration Example:

nodes:
  - name: <node name>
    type: <destination type>
    tls:
      ca_path: /certs/ca-certificates/

Certificate File

Path to the client certificate file (public key) used for mutual TLS (mTLS) authentication with the destination server. This certificate identifies the client to the server and must match the private key. The certificate should be in PEM format. (YAML parameter: crt_file)

When to use: Required only when the destination server requires mutual TLS authentication, where both client and server present certificates. Must be used together with key_file. Not needed for standard client TLS connections where only the server presents a certificate.

YAML Configuration Example:

nodes:
  - name: <node name>
    type: <destination type>
    tls:
      crt_file: /certs/client-cert.pem
      key_file: /certs/client-key.pem

Private Key File

Path to the private key file corresponding to the client certificate. This key must match the public key in the certificate file and is used during the TLS handshake to prove ownership of the certificate. Keep this file secure with restricted permissions. (YAML parameter: key_file)

When to use: Required for mutual TLS authentication. Must be used together with crt_file. If the key file is encrypted with a password, also specify key_password. Only needed when the destination server requires client certificate authentication.

YAML Configuration Example:

nodes:
  - name: <node name>
    type: <destination type>
    tls:
      crt_file: /certs/client-cert.pem
      key_file: /certs/client-key.pem
      key_password: <password>  # Only if key is encrypted

Private Key Password

Password (passphrase) used to decrypt an encrypted private key file. Only needed if your private key file is password-protected. If your key file is unencrypted, omit this parameter. (YAML parameter: key_password)

When to use: Optional. Only required if key_file is encrypted/password-protected. For enhanced security, use encrypted keys in production environments. If you receive “bad decrypt” or “incorrect password” errors, verify the password matches the key file encryption.

YAML Configuration Example:

nodes:
  - name: <node name>
    type: <destination type>
    tls:
      crt_file: /certs/client-cert.pem
      key_file: /certs/encrypted-client-key.pem
      key_password: mySecurePassword123

Minimum TLS Version

Minimum TLS protocol version to use when connecting to the destination server. This enforces a baseline security level by refusing to connect if the server doesn’t support this version or higher. (YAML parameter: min_version)

Available versions:

  • TLSv1_0 - Deprecated, not recommended (security vulnerabilities)
  • TLSv1_1 - Deprecated, not recommended (security vulnerabilities)
  • TLSv1_2 - Recommended minimum for production (default)
  • TLSv1_3 - Most secure, use when destination supports it

Default: TLSv1_2

When to use: Set to TLSv1_2 or higher for production deployments. Only use TLSv1_0 or TLSv1_1 if connecting to legacy servers that don’t support newer versions, and be aware of the security risks. TLS 1.0 and 1.1 are officially deprecated.

YAML Configuration Example:

nodes:
  - name: <node name>
    type: <destination type>
    tls:
      min_version: TLSv1_2

Maximum TLS Version

Maximum TLS protocol version to use when connecting to the destination server. This is typically used to restrict newer TLS versions if compatibility issues arise with specific server implementations. (YAML parameter: max_version)

Available versions:

  • TLSv1_0
  • TLSv1_1
  • TLSv1_2
  • TLSv1_3

When to use: Usually left unset to allow the most secure version available. Only set this if you encounter specific compatibility issues with TLS 1.3 on the destination server, or for testing purposes. In most cases, you should allow the latest TLS version.

YAML Configuration Example:

nodes:
  - name: <node name>
    type: <destination type>
    tls:
      max_version: TLSv1_3

Performance Considerations

When configuring the Apache Kudu destination, consider the following for optimal performance:

  1. Batch Size: Adjust rows_limit in batch_config based on your data volume and latency requirements. Larger batches improve throughput but increase latency.
  2. Write Mode: Use upsert mode when you need to handle duplicate keys, but be aware it has slightly higher overhead than insert mode.
  3. Connection Pool: Set max_connections based on your Kudu cluster capacity and expected throughput.
  4. Schema Design: Define key columns (is_key: true) carefully as they determine the primary key and affect write performance.
  5. Flush Interval: Balance between data freshness and write efficiency with the flush_interval setting.

Troubleshooting

For comprehensive troubleshooting of Apache Kudu destination issues including connection problems, schema mismatches, performance optimization, and debugging techniques, see the Apache Kudu Troubleshooting Guide.