Even if you’re moderately familiar with data integration or technologies like real-time analytics, the chances that you’ve stumbled upon Apache Kafka are high. Besides a great name, though, what makes Kafka so advantageous that close to 80% of all Fortune 100 companies trust and use it?
The goal of this post is to provide a first glance at what Apache Kafka is able to do and what problems it solves. Furthermore, we’ll see what the Connect framework is and how developers can benefit from using it. We’ll also present a few popular Kafka connectors, and, finally, we’ll take a closer look at the Kafka Connect Scalyr connector.
What Is Kafka, and What Does It Do?
Kafka was originally developed by LinkedIn in 2009 and later outsourced to The Apache Software Foundation in 2011. In a nutshell, Kafka is a high-volume, high-throughput, super scalable, and reliable distributed streaming platform.
To better understand that condensed definition, let’s introduce a real-world problem that’s more than common.
An organization, for example, has several departments, each of which could have several applications. These applications generate massive amounts of data. Data could be stored as various types like SQL, NoSQL, queues, and logs, to name a few. Different applications within the same organization need to talk to one another and make use of the data; therefore, there’s a lot of data movement.
The organization’s architecture is already complex, and we can only expect things to get more complicated as the data grows over time, increasing the need for storage and backup strategies.
Apache Kafka is a great solution to weather the storm.
How Kafka Can Help
Kafka, as a distributed streaming platform, runs as a cluster that could span multiple servers. We call servers that run Kafka brokers. Streaming platforms can process data as it’s generated. Kafka allows the storage of data from many sources, such as sensors, web logs, etc., for some period of time. Sources are known as the producers of data. Kafka publishes the stored data to anyone that requests it. The requests are known as the consumers of the data.
The streams of data that Kafka can store get subdivided and replicated. This allows a high volume of users—or consumers, as we mentioned above—to request them simultaneously without a visible lag in performance. In the meantime, because Kafka stores the data, even if a system fails to operate or if an interruption of any kind occurs, consumers will continue getting data from exactly where they left off. For this reason, Kafka is what we call fault tolerant.
You can find a detailed Apache Kafka tutorial here.
Kafka Connectors and How They Work
Kafka provides the opportunity to create connectors through the Connect framework. Let’s see a bit more about what connectors are.
The producer and consumer applications that we saw earlier provide the ability to get data in and out of Kafka. It’s a common necessity to bring data stored in other systems into Kafka and retrieve data from Kafka back to those systems.
One of the major advantages of Kafka is that data from external systems can easily get into Kafka, and with the same ease, it can work the other way around. This power comes with the Connect framework. The Connect framework provides an ecosystem of pluggable and reusable producers and consumers. It includes two types of connectors: source and sink.
In a Kafka cluster, it’s more than likely that developers at some point might need to integrate the same type of data coming from different sources. MongoDB, HDFS, and JDBC are only a few of the data sources available. Although there are definitely many data sources, they aren’t infinite, and there’s a great chance that someone else has already developed a necessary integration.
The Connect framework allows the development of a connector only one time. For example, let’s say that the developer team of an application creates a connector. From there, the developers that want to integrate with it don’t have to start from zero. They just have to configure it in order to insert that data source into the cluster.
Connect also simplifies the process of pulling data from Kafka. As mentioned above, data sources aren’t infinite, and the same is true of data stores. Most likely, developers would want to store data in the same applications, like Elasticsearch, Amazon S3, HDFS, or Scalyr. As I mentioned before, it’s possible that someone has already written a connector that makes the integration a breeze.
Now that we’ve covered what the Kafka Connect framework does, let’s dive a little bit deeper into its architecture. Sources of data communicate with the cluster through the Connect cluster. The Connect cluster consists of workers that pull data from sources by taking advantage of the prewritten custom connectors that are available. From there, the Connect cluster pushes the data to the Kafka cluster.
The next logical step is to request data from the Kafka cluster in order to use it with applications like Scalyr or Elasticsearch. To be able to do so, you should appropriately configure the Connect cluster and the connector to pull data from Kafka and then to write the data to the application.
Kafka Connector Options
A plethora of connectors are available. Open source or commercial, there’s a great chance that a connector for the application you want to integrate with exists. Let’s take a look at some of those options.
JDBC Connector (Source and Sink)
The JDBC source and sink connectors allow you to import and export data from a variety of relational databases into Kafka.
Amazon S3 Sink Connector
Simple Storage Service (S3) is an object storage service by Amazon. The S3 sink connector allows you to export data from Kafka to S3 objects. The format of the objects could be JSON, Avro, or bytes.
With MongoDB’s connector, we can extract data from Kafka. In this case, MongoDB operates as a data sink, and any changes that take place in MongoDB can be published back to Kafka, making it work as a data source as well.
Scalyr Kafka Connector
The Scalyr Kafka Connector allows seamless integration with Kafka. With Scalyr’s connector, users can easily pull log data from Kafka. Then, they can push them into Scalyr and take advantage of Scalyr’s blazingly fast log management and observability software.
With Scalyr’s Event Data Cloud, which works as a Kafka consumer, we can handle massive volumes of event data ingested through Kafka. Equally, we can increase gross margins by minimizing the COGS (cost of goods sold) and offer a frustration-free experience to users. Therefore, having a reliable connector that allows streaming Kafka events to Scalyr is very important.
Kafka Connect Takes the Headache Out of Stream Processing
Kafka is becoming more and more popular and provides top-level stream processing. The Scalyr connector can send log data from an existing Kafka infrastructure to Scalyr. It’s easy to configure, taking advantage of the straightforward process of integrating Kafka with an external system. It’s easy to scale, and, lastly, it’s easy to troubleshoot and handle errors. Other features of the Scalyr Kafka Connector include support for
- Elastic Filebeat log messages, which automatically convert to Scalyr log events.
- Custom application log messages, where users convert message fields to Scalyr log event attributes.
- Fluentd and Fluent Bit, where custom application event mappings are used.
Plus, Scalyr Kafka Connector prevents duplicate delivery by using the topic, partition, and offset to uniquely identify events.
Apache Kafka is a powerful system, and it’s here to stay. The Kafka Connect framework removes the headaches of integrating data from external systems. And Scalyr’s development of an open-source connector makes sending logs from Kafka to Scalyr easier than ever.
Have you tried Scalyr yet? Visit the Scalyr demo page here to get started!
This post was written by Alex Doukas. Alex’s main area of expertise is web development and everything that comes along with it. He also has extensive knowledge of topics such as UX design, big data, social media marketing, and SEO techniques.