Google Pub/Sub Lite for Kafka Users

 


Pub/Sub Lite is a new service from the Cloud Pub/Sub team that promises to give a more cost-effective alternative to Pub/Sub. It provides a managed service in particular for users that would like to run their own single-zone Apache Kafka cluster for cost considerations. This post includes a comparison of Pub/Sub Lite, Pub/Sub, and a self-managed Kafka configuration, as well as instructions on how to test your existing Kafka workloads on Pub/Sub Lite. Pub/Sub Lite, as a partitioned log with progress tracked through increasing offsets, shares more high-level notions with Kafka than Cloud Pub/Sub. As a result, it has a more similar API to Kafka, and users can use the Kafka client APIs to publish and consume messages.

Notable differences between Pub/Sub Lite and Kafka

Although Pub/Sub Lite is theoretically similar to Apache Kafka, it is a separate system with APIs geared at data input. While the differences should be irrelevant for stream ingestion and processing, they are relevant in a few specific use scenarios.

Kafka as a database

Pub/Sub Lite does not support transactional publishing or log compaction, which are aspects of Kafka that are more beneficial when used as a database rather than a messaging system. If this seems like your use case, you should think about setting up your own Kafka cluster or using a managed Kafka solution like Confluent Cloud. If neither of these options are viable, consider adopting a horizontally scalable database, such as Cloud Spanner, and deduplicating a table by commit timestamp using row keys.

Compatibility with Kafka Streams

Kafka streams is an on-top-of-Kafka data processing technology. While consumer clients can be injected, it requires access to all admin functions and stores internal metadata using Kafka's transactional database characteristics. Apache Beam is a streaming data processing solution that works with Kafka, Pub/Sub, Pub/Sub Lite, and other data sources and sinks. Dataflow can also be used to run beam pipelines in a completely managed manner.

Monitoring

Kafka clients have the ability to read server-side metrics. In Pub/Sub Lite, metrics relevant to publisher and subscriber behavior are exposed through Cloud Monitoring with no additional configuration.

Administration and Configuration

Capacity Management

The capacity of a Kafka topic is generally determined by the capacity of the cluster. Replication, key compaction, and batching settings will determine the capacity required to service any given topic on the Kafka cluster, but there are no direct throughput limits on a topic itself. By contrast, both storage and throughput capacity for a Pub/Sub Lite topic must be defined explicitly. Pub/Sub Lite topic capacity is determined by the number of partitions and adjustable read, write and storage capacity of each partition.

Authentication and Security

Apache Kafka supports several open authentication and encryption mechanisms. With Pub/Sub Lite, authentication is based on GCP’s IAM system. Security is assured through encryption at rest and in transit.

Configuration options

Kafka has a large number of configuration alternatives that control topic structure, limits and broker properties. Some common ones useful for data ingestion are presented below, with their equivalents in Pub/Sub Lite. Note that as a managed system, the user has no need to be concerned with many broker properties.

                    auto.create.topics.enable

No equivalent is available. Topics should be created beforehand using the admin API. Similarly, subscriptions (roughly equivalent to consumer groups) must be created before being used with the admin API.

                    retention.bytes

The equivalent in Pub/Sub Lite is Storage in step with partition, a per-topic property.

                    retention.ms

The equivalent in Pub/Sub Lite is Message retention period, a per-topic property.

                   flush.ms, acks

These are not configurable, but publishes will not be acknowledged until they are guaranteed to be persisted to replicated storage.

                 max.message.bytes

This is not configurable, 3.5 MiB is the maximum message size that can be sent to Pub/Sub Lite. Message sizes are calculated in a repeatable manner.

                 key.serializer, value.serializer, key.deserializer, value.deserializer

Pub/Sub Lite specifically implements Producer<byte[], byte[]> and Consumer<byte[], byte[]>. Any serialization (which can possibly fail) should be performed by user code.

                  retries

Pub/Sub Lite uses a streaming wire protocol and will retry transient publish errors such as unavailability indefinitely. Any failures which reach end-user code are permanent.

                  batch.size

Batching settings are configurable at client creation time.

                 message.timestamp.type

When using the Consumer implementation, the event timestamp will be chosen if present, and the publish timestamp used otherwise. Both publish and event timestamps are available when using Dataflow.

                 max.partition.fetch.bytes, max.poll.records

Flow control settings are configurable at client creation time.

                 enable.auto.commit

Autocommit is configurable at client creation time.

                enable.idempotence

This is not currently supported.

                 auto.offset.reset

This is not currently supported.

Getting commenced with Pub/Sub Lite

Pub/Sub Lite tooling makes it smooth to try out current Kafka workloads running on Pub/Sub Lite. When you have more than one tasks in a consumer group reading from a multi-producer Kafka topic, adapting your code to run with Pub/Sub Lite requires minimum modifications. These are mentioned below.

Create Pub/Sub Lite resources

To ingest and technique statistics with Pub/Sub Lite, you need to create a subject and subscription respectively. You should ensure when creating your subject that it has sufficient horizontal parallelism (partitions) to handle your peak publish load. In case your top publishing throughput is X MiB/s, you ought to provision X/four walls for your subject matter with four MiB/s of potential each (the default).

Copy data from Kafka

A Kafka Connect connector for Pub/Sub Lite is maintained by the Pub/Sub team, and is the perfect way to records data to Pub/Sub Lite. For experimentation, you can specifically run the copy tool script, which will download and run Kafka Connect locally in a single machine configuration. Ensure that you follow the Pre-Running steps to properly configure authentication before starting. An example properties file would look like this: This will mirror all data that is published to your kafka topic to Pub/Sub Lite while it is running. The Kafka Connect documentation provides more information on how to run a Kafka Connect job for your cluster.

name=PubSubLiteSourceConnector

connector.class=com.google.pubsublite.kafka.source.PubSubLiteSourceConnector

pubsublite.project=my-project

pubsublite.location=europe-south7-q

pubsublite.subscription=my-subscription

kafka.topic=my-kafka-topic

Once you start copying data, you should be able to see the Pub/Sub Lite topic’s topic/publish_message_count metric growing in the metrics explorer console, as the backlog of your Kafka topic is copied over.

Read data from Pub/Sub Lite

The Pub/Sub team maintains a Kafka Consumer API implementation that allows you to read data from Pub/Sub Lite with only minimal modifications to your existing code.

To do so, you will replace all instances of KafkaConsumer<byte[],byte[]> with a Pub/Sub Lite-specific implementation of the same interface. First, you must ensure that no client code references the concrete KafkaConsumer implementation — instead, you should replace them with the Consumer<byte[],byte[]> interface. Next, you should construct your Pub/Sub Lite Consumer implementation as detailed in the link above, and pass it through to your code.

When you call poll(), you will now be retrieving messages from Pub/Sub Lite instead of Kafka. Note that the Pub/Sub Lite Consumer will not automatically create a subscription for you: you must create a subscription beforehand using either the UI or gcloud.

As you get hold of messages and dedicate offsets, you may display the progress of your purchasers thru the backlog by using searching at the subscription/backlog_message_count metric within the metrics explorer console.

Write data to Pub/Sub Lite

Once all Consumers have been migrated to reading data from Pub/Sub Lite, you can begin migrating Producers to Pub/Sub Lite. Similarly to the Consumer case, you can replace all users of KafkaProducer<byte[], byte[]> with Producer<byte[],byte[]> as a no-op change. Then, following the instructions, you can construct a Pub/Sub Lite Producer implementation and pass it to your code. When you call send(), data will be sent to Pub/Sub Lite instead. When you update your producer jobs, your consumers reading from Pub/Sub Lite will be ambivalent whether the data is sent through Kafka (and copied to Pub/Sub Lite by Kafka Connect) or to Pub/Sub Lite directly. It is not an issue to have Kafka and Pub/Sub Lite producers running at the same time.

Reference : - Google Pub/Sub Lite for Kafka Users | by Daniel Collins | Medium:-Google Pub/Sub Lite for Kafka Users | by Daniel Collins | Google Cloud - Community | Medium

Comments

Popular posts from this blog

How does pub/sub model work?