Edge Delta Apache Kudu Destination

Configure the Apache Kudu destination node to send data to Apache Kudu tables with schema mappings and batch configuration.

Overview

The Apache Kudu destination node sends data to Apache Kudu tables. Apache Kudu is a distributed columnar storage engine optimized for fast analytics on fast data. This node supports flexible schema mappings, batch configuration, and both insert and upsert write modes.

Note: This node is currently in beta and is available for Enterprise tier accounts.

Example Configuration

This configuration writes user data to a Kudu table named user_table. It uses upsert mode to update existing records or insert new ones based on the user_id key column. The schema maps incoming data attributes to table columns with appropriate data types, where user_id and created_at are required fields.

nodes:
- name: my_apache_kudu
  type: apache_kudu_output
  hosts:
    - localhost:7051
  table_name: user_table
  mode: upsert
  schema_mappings:
  - column_name: user_id
    column_type: string
    expression: attributes["user_id"]
    is_key: true
    required: true
  - column_name: age
    column_type: int32
    expression: attributes["age"]
    required: false
  - column_name: created_at
    column_type: int64
    expression: attributes["created_at"]
    required: true
  - column_name: email
    column_type: string
    expression: attributes["email"]
    required: false
  - column_name: is_active
    column_type: bool
    expression: attributes["is_active"]
    default_value: "true"

Required Parameters

name

A descriptive name for the node. This is the name that will appear in pipeline builder and you can reference this node in the YAML using the name. It must be unique across all nodes. It is a YAML list element so it begins with a - and a space followed by the string. It is a required parameter for all nodes.

nodes:
  - name: <node name>
    type: <node type>

type: apache_kudu_output

The type parameter specifies the type of node being configured. It is specified as a string from a closed list of node types. It is a required parameter.

nodes:
  - name: <node name>
    type: <node type>

hosts

The hosts parameter specifies the list of Apache Kudu master server addresses. It is specified as an array of strings in the format host:port and is required.

- name: <node name>
  type: apache_kudu_output
  hosts:
    - master1.example.com:7051
    - master2.example.com:7051
    - master3.example.com:7051
  table_name: <target table>

table_name

The table_name parameter defines the name of the Kudu table to write data to. It is specified as a string and is required.

- name: <node name>
  type: apache_kudu_output
  hosts:
    - localhost:7051
  table_name: my_table

schema_mappings

The schema_mappings parameter defines the list of column mappings for the Kudu table schema. Each mapping specifies how to extract data from incoming items and map them to Kudu table columns. This parameter is required.

Each schema mapping contains the following fields:

Field Required Description Type Options
column_name Yes Name of the column in the Kudu table string -
column_type Yes Data type of the column string string, int32, int64, bool, float, double, binary
expression No Expression to extract value from the data string -
is_key No Whether this column is a key column boolean true, false
required No Whether this column is required (non-null) boolean true, false (default: false)
default_value No Default value for the column if no value is provided string -
- name: <node name>
  type: apache_kudu_output
  hosts:
    - localhost:7051
  table_name: my_table
  schema_mappings:
  - column_name: id
    column_type: string
    expression: attributes["id"]
    is_key: true
    required: true
  - column_name: timestamp
    column_type: int64
    expression: attributes["timestamp"]
    required: true
  - column_name: message
    column_type: string
    expression: body
  - column_name: severity
    column_type: string
    expression: attributes["severity"]
    default_value: "INFO"

Optional Parameters

mode

The mode parameter specifies the write mode for data insertion. It accepts two values:

  • upsert: Updates existing rows or inserts new ones (default)
  • insert: Only inserts new rows
- name: <node name>
  type: apache_kudu_output
  hosts:
    - localhost:7051
  table_name: my_table
  mode: upsert
  schema_mappings:
    # ... mappings ...

batch_config

The batch_config parameter configures batching behavior for writing data to Kudu. It helps optimize performance by grouping multiple write operations.

Field Description Type Default Example
rows_limit Maximum number of rows per batch integer 100 1000
row_size_limit Maximum size limit per row string - "1MB", "512KB"
flush_interval Time interval to flush batched data string - "5s", "1m"
flush_mode Mode for flushing batched data string "auto" "auto", "manual"
- name: <node name>
  type: apache_kudu_output
  hosts:
    - localhost:7051
  table_name: my_table
  batch_config:
    rows_limit: 500
    row_size_limit: "2MB"
    flush_interval: "10s"
    flush_mode: auto
  schema_mappings:
    # ... mappings ...

connection

The connection parameter configures connection management settings.

Field Description Type Default Example
timeout Connection timeout string - "30s", "1m"
retry_attempts Number of retry attempts for failed operations integer 3 5
retry_delay Delay between retry attempts string - "1s", "500ms"
max_connections Maximum number of concurrent connections integer 10 20
- name: <node name>
  type: apache_kudu_output
  hosts:
    - localhost:7051
  table_name: my_table
  connection:
    timeout: "45s"
    retry_attempts: 5
    retry_delay: "2s"
    max_connections: 15
  schema_mappings:
    # ... mappings ...

parallel_worker_count

The parallel_worker_count parameter specifies the number of workers that run in parallel to process and send data to Kudu. Increasing this value can improve throughput for high-volume data streams.

- name: <node name>
  type: apache_kudu_output
  hosts:
    - localhost:7051
  table_name: my_table
  parallel_worker_count: 10
  schema_mappings:
    # ... mappings ...

Default: 5

tls

The tls parameter is a dictionary that configures TLS settings for secure connections to the destination. It is optional and typically used when connecting to endpoints that require encrypted transport (HTTPS) or mutual TLS.

nodes:
  - name: <node name>
    type: <destination type>
    tls:
      <tls options>

enabled

Specifies whether TLS is enabled. This is a Boolean value. Default is false.

nodes:
  - name: <node name>
    type: <destination type>
    tls:
      enabled: true

ignore_certificate_check

Disables certificate verification. Useful for test environments. Default is false.

nodes:
  - name: <node name>
    type: <destination type>
    tls:
      ignore_certificate_check: true

ca_file

Specifies the absolute path to a CA certificate file for verifying the remote server’s certificate.

nodes:
  - name: <node name>
    type: <destination type>
    tls:
      ca_file: /certs/ca.pem

ca_path

Specifies a directory containing one or more CA certificate files.

nodes:
  - name: <node name>
    type: <destination type>
    tls:
      ca_path: /certs/

crt_file

Path to the client certificate file for mutual TLS authentication.

nodes:
  - name: <node name>
    type: <destination type>
    tls:
      crt_file: /certs/client-cert.pem

key_file

Path to the private key file used for client TLS authentication.

nodes:
  - name: <node name>
    type: <destination type>
    tls:
      key_file: /certs/client-key.pem

key_password

Password for the TLS private key file, if required.

nodes:
  - name: <node name>
    type: <destination type>
    tls:
      key_password: <password>

client_auth_type

Controls how client certificates are requested and validated during the TLS handshake. Valid options:

  • noclientcert
  • requestclientcert
  • requireanyclientcert
  • verifyclientcertifgiven
  • requireandverifyclientcert
nodes:
  - name: <node name>
    type: <destination type>
    tls:
      client_auth_type: requireandverifyclientcert

max_version

Maximum supported version of the TLS protocol.

  • TLSv1_0
  • TLSv1_1
  • TLSv1_2
  • TLSv1_3
nodes:
  - name: <node name>
    type: <destination type>
    tls:
      max_version: TLSv1_3

min_version

Minimum supported version of the TLS protocol. Default is TLSv1_2.

nodes:
  - name: <node name>
    type: <destination type>
    tls:
      min_version: TLSv1_2

Performance Considerations

When configuring the Apache Kudu destination, consider the following for optimal performance:

  1. Batch Size: Adjust rows_limit in batch_config based on your data volume and latency requirements. Larger batches improve throughput but increase latency.
  2. Write Mode: Use upsert mode when you need to handle duplicate keys, but be aware it has slightly higher overhead than insert mode.
  3. Connection Pool: Set max_connections based on your Kudu cluster capacity and expected throughput.
  4. Schema Design: Define key columns (is_key: true) carefully as they determine the primary key and affect write performance.
  5. Flush Interval: Balance between data freshness and write efficiency with the flush_interval setting.

Troubleshooting

For comprehensive troubleshooting of Apache Kudu destination issues including connection problems, schema mismatches, performance optimization, and debugging techniques, see the Apache Kudu Troubleshooting Guide.