Kafka Connectors: More than just configs

As a data engineer, you’ve probably seen a Kafka connector config file and thought:

“Is that it? A simple JSON?”

Yes, it starts as a simple JSON, but it represents a modular and extensible architecture capable of transformations, conditional logic, schema governance, error handling, and high performance. Wow, that was a mouthful, but in this post, I will take you on a journey through the layers of Kafka Connectors architecture to prove that they are powerful building blocks in modern streaming pipelines. I will share my real experiences and learnings from working with Kafka Connectors in various projects.

So, let’s see for yourself, this is a minimal example config that will read orders from postgres and store them in kafka:

{
  "name": "orders-source",
  "config": {
    "connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
    "tasks.max": "1",

    "connection.url": "jdbc:postgresql://db:5432/orders",
    "connection.user": "app",
    "connection.password": "secret",

    "mode": "incrementing",
    "incrementing.column.name": "id",
    "table.whitelist": "orders",
    "topic.prefix": "orders-source-",

    // --- Single Message Transform ---
    "transforms": "TimestampConverter",
    "transforms.TimestampConverter.type": "org.apache.kafka.connect.transforms.TimestampConverter$Value",
    "transforms.TimestampConverter.field": "timestamp",
    "transforms.TimestampConverter.format": "yyyy-MM-dd HH:mm:ss.S",
    "transforms.TimestampConverter.target.type": "string"

    // --- Converters (serialization) ---
    "key.converter": "io.confluent.connect.avro.AvroConverter",
    "key.converter.schema.registry.url": "http://schema-registry:8081",
    "value.converter": "io.confluent.connect.avro.AvroConverter",
    "value.converter.schema.registry.url": "http://schema-registry:8081",

    // --- Error handling and DLQ ---
    "errors.tolerance": "all",
    "errors.deadletterqueue.topic.name": "dlq.postgres.orders",
    "errors.deadletterqueue.context.headers.enable": "true",
    "errors.log.enable": "true",
    "errors.log.include.messages": "true"
  }
}

But before I start to explain the details, let’s do a quick refresher on Kafka Connect architecture.

Recap: Kafka Connect 101

At the core, Kafka Connect is a framework for moving data between Apache Kafka and external systems. Scalability and fault tolerance are built-in from the beginning, and its architecture is surprisingly elegant. Making it straightforward to start with.

Key Components

also called glossary:

Connect Cluster: A group of worker nodes that run the Kafka Connect framework. These are usually docker instances deployed in kubernetes or VMs.
Workers: Java processes that host connectors and tasks. They coordinate work loads, manage resources and ensure fault tolerance. These run in the connect cluster.
Connectors: Defined in JSON, they are high-level job definitions of data pipelines. e.g. “read from Postgres” or “write to Elasticsearch”. Connectors are responsible for creating and managing tasks.
Tasks: The actual units of work that perform the data transfer. A definition of a connector is logically partitionable into tasks for parallel processing.
Converters: Handle serialization and deserialization of data between Kafka and the external systems.
Offsets & Status Topics: Connectors store offsets (so they know where to resume after failure). They also store status metrics in internal Kafka topics.

Parallelism & Scalability

Connectors declare how many tasks they can support so that the Connect framework can coordinate workers to distribute tasks across the cluster. This horizontal scaling enables parallel throughput. The architecture is designed to handle scaling of tasks, workers, rebalancing and reliability. This is all managed by the Connect framework, and you don’t really have to worry about it. I am mentioning this just for completeness. What you need to know is what happens inside the pipelines/connectors, and that is what we will explore next.

The Data Flow From Source to Sink

At a high level, data flows from a source system (like a database or file system) into Kafka topics, and then from those topics to a sink system (like another database, platform, or data warehouse). So the center piece will always be a Kafka topic, hence the name Kafka Connect.

Connectors

Source connector reads from an external system (e.g., database, API).
Saves schema to Schema Registry.
The record lands in a Kafka topic.
Sink connector reads from that topic
Reads the exact version of schema from Schema Registry.
Write to a target system (e.g., DB insert/upsert, file, HTTP).
if source or sink connector fails, it will write to DLQ (dead letter queue) topic for later analysis.

But the connectors are not limited to pure passing through the data, and they can shape, filter, mutate, enrich, or gate each record before or after Kafka stores it. We will not do it in another ETL framework but inside the connector itself.

Lifecycle of a Record

Let’s trace a record end to end from inside the connector task as it flows through the layers of the source connector. As an example, we will simulate a sink connector that reads from a Kafka topic and writes to a database:

Sink Task internals

Reads bytes: Sink connector reads from a Kafka topic, retrieves bytes and schema metadata (if relevant).
Deserialization: Convert bytes into an internal record form (schema from Schema Registry and value from kafka). (e.g. AvroConverter, JSONConverter, ProtobufConverter).
Applies all transformations configured on the sink connector (SMTs).
Write to a target system. Usually each connector transforms kafka record to the target system’s format (e.g., SQL insert statement, HTTP request).
Commit offsets in a Kafka topic, so tasks can resume reliably.

This is just an example of a connector, because each connector can have its own specific logic and optimizations. But the key takeaway is that connectors are not just simple data movers; they can actively shape and manage the data.

Schema Support, Converters & Schema Registry

Kafka itself is schema-agnostic. That means that it doesn’t care what data you put in it. Kafka Connect, builds on top of it by filling this gap and uses support of schemas. So suddenly we don’t have just raw bytes, but we have a structured representation of our data. Of course, this is optional; you can go schema-aware or schema-less.

Choose right schema strategy

Schema-aware Converters support formats like Avro, Protobuf, or JSON Schema, and these converters integrate with a Schema Registry (like Confluent Schema Registry). If you go this path, you can do schema validating and enforcing compatibility rules for schema evolution (BACKWARD, FORWARD, FULL).
Schema-less Converters treat data as raw bytes (basically just as a string) without schema enforcement (schemas.enable=false). I can guarantee you that this is not what you want in production.

In real projects, you will have changes like:

Adding fields
Renaming fields
Changing data types (narrowing/widening)
Deprecating fields
Nested structures changes

Without schema governance, these shifts will break downstream consumers or lead to data quality issues. You want to catch these issues early by shifting most work to the left with schema validation. This schema evolution goes hand in hand with transformations, which we will explore next.

Single Message Transforms (SMTs)

SMTs (Single Message Transforms) are lightweight java classes that can be installed on kafka connector instances and extend the functionality of connectors. They can adjust individual records as they flow through the connector. SMTs are not as powerful as full-blown stream processing frameworks (like Kafka Streams or Flink), but they are very useful for simple tasks like:

Cast: Change field’s type (e.g., int to string).
Drop: Remove value from a record.
InsertField: Add timestamp or metadata fields.
Rename/ReplaceField: Change field names.
MaskField: Obfuscate sensitive data - PII compliance.
HoistField: Wrap the entire record in a new field.
Flatten/Unwrap: Change nested structures to flat or vice versa.
ValueToKey: Promote a field from the value to the key (useful for partitioning).

This is just a small subset of the available transforms, and you can also write your own custom SMTs if needed or install third-party ones. Each sink and source connector can use these SMTs, so they are reusable building blocks (like lego). They can be chained together to form a pipeline of transformations (transforms=addThis,removeThat,dropKey) and they run inline within the connector task, so you don’t need a separate streaming job.

Lightweight inline record tweaks

Predicates & Conditional Transforms

One of the more underappreciated features is predicates. Predicates allow you to apply transforms conditionally only when certain conditions are met. For example use RecordIsTombstone to filter out delete records, or use FieldEquals to apply a transform only when a field has a specific value. You could easily filter out huge datasets with job applications to identify only those candidates who have the right attitude, like I am doing here:

{
  "transforms": "sendOffer",
 
  "transforms.sendOffer.type": "org.apache.kafka.connect.transforms.InsertField$Value",
  "transforms.sendOffer.predicate": "isFunny",
  "transforms.sendOffer.static.field": "send_job_offer",
  "transforms.sendOffer.static.value": "true",
 
  "predicates": "isFunny",
  "predicates.isFunny.type": "org.apache.kafka.connect.transforms.predicates.FieldEquals$Value",
  "predicates.isFunny.field": "sense_of_humor",
  "predicates.isFunny.value": "true"
}

This is a simple example, but you can chain multiple predicates together with logical operators (AND, OR, NOT) to create complex conditional logic. You see?, you can write conditional logic without writing a single line of JAVA code. If you filter out bad apples early (left shift), you can save a lot of issues later in the pipeline downstream and create a single source of truth. Don’t let analytics teams work with dirty data, because they will implement multiple filters for the same columns, and you will end up with data silos, where each team has their own version of truth.

Of course, I am just joking here (meaning I am funny 🤔) …

But jokes aside, we have still not written anything to the target system.

Write Modes: inserts, updates & deletes

On the sink side, connectors often need to support different write modes. The most common modes are:

InsertOnly: naive writes, appends new records without checking for existing ones.
Upsert: updates existing records based on a key or inserts if the record doesn’t exist.
Delete: handle tombstone records, removes records based on a key.
Bulk: batch writes for efficiency, often used for databases or data warehouses. (max.batch.size=10000)

Note: Tombstone records are special records with a null value instead of a record that indicates a deletion. They are often used in CDC (change data capture) scenarios to represent deleted rows in the source system.

Exactly once semantics

Latest connectors (like JDBC Sink) support transactional writes and exactly once semantics when used with Kafka’s idempotent producers and transactions. This ensures data integrity even in the face of failures or retries (so no duplicates). The point is that you don’t need to write a separate Airflow job and think about handling these cases. The connector can handle it for you. Either way, you will surely want to know more about error handling.

Error Handling & Dead Letter Queues (DLQs)

In many ETL projects, you will unfortunately see broken messages, transient network issues, or schema mismatches. Sometimes you will want to stop all processing and fix the issue before continuing with data. And in some cases you want to continue processing data and put bad eggs into a DLQ topic for later analysis. This is also a good way to set up your monitoring, so if some DLQ suddenly starts to receive more messages, you can trigger an alert to bring that to attention. Of course, it is absolutely fine to not use DLQ if you don’t care about bad data (but don’t be that guy!).

Kafka Connectors provide built-in error handling strategies to deal with these issues:

Fail Fast: Stops the connector on the first error (errors.tolerance=none).
Fail on bad data: Stops only on data-related errors, ignores transient issues (errors.tolerance=data).
Log and Continue: Logs the error and skips the problematic record (errors.tolerance=all).
Dead Letter Queue (DLQ): Redirects problematic records to a separate Kafka topic for later analysis (errors.deadletterqueue.topic.name=dlq-topic).

Actually, to know that something is wrong, you would need monitoring in place. Luckily, Kafka Connect exposes various metrics (error rates, DLQ counts, throughput) that you can monitor with Prometheus/Grafana or any other monitoring tool.

Note: DLQs (Dead Letter Queues) are special Kafka topics where you can send records that failed processing. This allows you to investigate and reprocess them later without losing data. A nice thing about DLQs is that errors can be added to the record header, so you can see what went wrong and why.

Unfortunately, connectors don’t have built-in retry logic for failed records, so you will need to implement that yourself if needed. A quick solution could be to create another connector that reads from the DLQ topic and pushes them back to the original topic after fixing the issues. Or, you could write an extension that does retries with backoff. But wait, what are extensions?

Extensions and Libraries

Basic SMTs are built-in, and there are community and vendor-provided transformation libraries that extend what you can do. You can also build custom SMTs using Java and deploy a jar file to your Connect cluster. This is what I usually did in the past, when I needed custom logic that wasn’t covered by existing transforms.

In addition to SMTs, you can also write custom connectors if you need to connect to a system that isn’t yet supported. Some connectors have various strategies for handling specific use cases, like MongoDB Connector supports different write strategies (insert, upsert, replace) and various modes for handling schema changes. You could extend them by writing custom strategy logic.

Kafka Connectors are very extensible, and currently the ecosystem is very rich. You can find connectors for almost any system, from databases (Postgres, MySQL, MongoDB, BigQuery) to cloud services (S3, GCS, Azure Blob) to search engines (Elasticsearch, Solr), to messaging systems (JMS, MQTT) and many more.

KIPs & Recent Improvements

Kafka Connectors were quite basic when they introduced them in 2015. I remember when I first started using them in Booking.com around 2016. There were just simple source and sink connectors with basic functionality. But over the years, the kafka’s active community has added many features and improvements. Some interesting KIPs (Kafka Improvement Proposals) include:

KIP-196: Add metrics to Kafka Connect framework
KIP-437: Custom replacement for MaskField SMT = More flexible PII masking.
KIP-582: Add a “continue” option for Kafka Connect error.tolerance
KIP-585: Filter and Conditional SMTs
KIP-618: Exactly-Once Support for Source Connectors
KIP-769: Connect APIs to list all connector plugins and retrieve their configuration definitions
KIP-793: Allow sink connectors to be used with topic-mutating SMTs
KIP-802: Validation Support for Kafka Connect SMT and Converter Options
KIP-808: Add support for different unix precisions in TimestampConverter SMT
KIP-821: Connect Transforms support for nested structures
KIP-850: REST API for filtering Connector plugins by type
KIP-875: First-class offsets support in Kafka Connect
KIP-910: Update Source offsets for Source Connectors without producing records

Also, there are companies like Confluent or Aiven that provide enterprise-grade connectors with additional features, support, and optimizations. Either get a Data Engineer into your team, so that you can implement specific extensions for your use case, or you might want to consider those paid options.

Stateless vs Stateful

Connectors are not designed for complex stateful processing. They are best suited for stateless transformations, so let’s cover some limits:

There is no state shared between records. Each record is processed independently. No cross-record transformations.
SMTs are per records. They are stateless, so no windowing, aggregations, or joins. For complex analytics you still need a stream processing framework (like Kafka Streams, Flink, Spark).
very heavy transformations can impact performance, so keep them lightweight. Don’t do external DB joins or ML inference in SMTs.
Connectors work with records. Don’t try to change the whole data flows in a hacky way.
If your connector config is longer than 25 lines, you are probably making them too complex or consider moving logic into Kafka Streams / Flink Use stream.
The connector itself is stateless, but Connect’s runtime is stateful. It manages offsets, partitions, status topics, and internal state. Remind yourself about this when you need to coordinate multiple tasks/workers.

Really Stateless?

Performance, Scaling & Stability

What I would like to highlight from my personal experience is that Kafka Connectors are very stable, even under heavy loads. I have experienced when entire networks failed and came up again, and the connectors just resumed from where they left off. The built-in offset management and fault tolerance work really well. I usually use the REST API to monitor and manage connectors (postman collections are lifesavers). That is a place where you can resume failed connectors, restart tasks, check status, and update configurations. But in real life, when you deploy a connector, that is the last time you hear about it. It will work for months without any issues.

Best Practices

Some things I learned the hard way and would like to share as best practices:

treat connector configs as code and store them in version control (git).
use infrastructure as code (terraform) to deploy connectors.
monitor connector metrics (dlq spikes, error rates, throughput, latency).
high CPU means complex transformations or high throughput. Scale out by adding more workers or tasks.
test schema changes in dev or staging environment before production!
use DLQs to catch bad data. It will help you analyze errors later.
naming is hard. Please give your transformations/predicates meaningful names so that you can understand them later.

Conclusion

A Kafka connector JSON config seems simple, with its 10 lines of code, but it represents a powerful, stable, and extendable pipeline. Now you know that they support transformations, conditional logic, schema governance, error handling, write modes, or anything you decide to extend them with. Often, a single connector can replace an entire complex ETL job with separate microservices.

I love chatting about Kafka Connect, data pipelines, and real-world connector setups. If you’ve built something interesting or fought with a tricky SMT, share your experience or tag me on LinkedIn or X (Twitter). I always enjoy comparing notes and hearing how others use Kafka Connect in production.

When you realize a simple connector replaces your entire microservice architecture

Kafka Connectors: More than just configs

Kafka Connectors: More than just configs

Recap: Kafka Connect 101

Key Components

Parallelism & Scalability

The Data Flow From Source to Sink

Lifecycle of a Record

Schema Support, Converters & Schema Registry

Single Message Transforms (SMTs)

Predicates & Conditional Transforms

Write Modes: inserts, updates & deletes

Error Handling & Dead Letter Queues (DLQs)

Extensions and Libraries

KIPs & Recent Improvements

Stateless vs Stateful

Performance, Scaling & Stability

Best Practices

Conclusion

CATALOG

FEATURED TAGS