Zalando Tech Blog – Technical Articles Eugen Kiss

Removing the burden of state management

State management and change propagation are arguably some of the hardest challenges in GUI programming. Many tools promised to save us from their burden. Only a few remained. Among them is MobX and its flavor of Transparent Reactive Programming.

To understand the appeal of MobX, it is helpful to first understand how React revolutionized GUI programming. Traditional approaches allow description of the initial state of the GUI. Further GUI state transitions must be accomplished with references to GUI elements and piece-wise mutations. This is error-prone as edge cases are easily missed. With React you describe the GUI at any given point in time. Put differently, taking care of GUI state transitions, e.g. manipulating the DOM, is a thing of the past: Your GUI code has become declarative.

React’s crucial advantage is making the “how” of updating the GUI transparent. That idea is reapplied by MobX. Not for GUI manipulation code, but instead for state management and change propagation. In fact, combining both React and MobX is synergistic since even though React nicely takes care of how to update the GUI, when to update the GUI remains cumbersome without MobX. Cross communication between components is the biggest pain point.

Let us explore an example in React and JavaScript illustrating MobX’s advantages:

import { observable } from ‘mobx’
import { observer } from ‘mobx-react’
const store = observable({
 itemCount: 0,
 lastItem: null
const handleBuy = (name) => () => {
 store.itemCount += 1
 store.lastItem = name

const handleClearCart = () => {
store.itemCount = 0
 store.lastItem = null
const Cart = observer(() =>
   <span>: {store.itemCount}</span>
   <button onClick={handleClearCart}>Clear cart</button>
const Header = observer(({children}) =>
const LastBought = observer(() =>
 <span>Last bought item: {store.lastItem}.</span>

const Main = observer(() =>
   <button onClick={handleBuy("shoe")}>Buy shoes</button>
   <button onClick={handleBuy("shirt")}>Buy shirt</button>

The example represents a very simplified e-commerce page. Any resemblance to Zalando’s fashion store is, of course, coincidental. Here is a demo and its live-editable source code. The behavior is as follows: Initially, your cart is empty. When you click on “Buy shoe” your cart item count increases by one and the recently bought component shows “shoe”. A click on “Buy shirt” does the same for “shirt”. You can also clear your cart. In the live example you will see that only the cart in the header is re-rendered but not the header itself. You will also see that the recently bought component does not rerender when you buy the same product in succession.

Notice that the dependency between the observable variables itemCount and lastItem, and the components is not explicitly specified. Yet, the components correctly and efficiently rerender on changes. You may wonder how this is accomplished. The answer is that MobX implicitly builds up a dependency graph during execution of the components’ render functions that tracks which components to rerender when an observable variable changes. A way to think of MobX is in terms of a spreadsheet where your components are formulas of observable variables. Regardless of how the “magic” works underneath, and which analogy to use, the result is clear: You are freed from the burden of explicitly managing change propagation!

To sum up, MobX is a pragmatic, non-ceremonial, and efficient solution to the challenge of state management and change propagation. It works by building up a run-time dependency graph between observable variables and components. Using both React and MobX together is synergistic. Currently, MobX is used in just a few projects inside Zalando. Seeing as MobX offers a great deal to anyone writing GUI programs, I am confident that its adoption will rise in the future.

Keep in touch with fresh Zalando Tech news. Follow us on Twitter.

Zalando Tech Blog – Technical Articles Han Xiao

How We Built the Next Generation Product Search from Scratch using a Deep Neural Network

Product search is one of the key components in an online retail store. A good product search can understand a user’s query in any language, retrieve as many relevant products as possible, and finally present the results as a list in which the preferred products should be at the top, and the less relevant products should be at the bottom.

Unlike text retrieval (e.g. Google web search), products are structured data. A product is often described by a list of key-value pairs, a set of pictures and some free text. In the developers’ world, Apache Solr and Elasticsearch are known as de-facto solutions for full-text search, making them a top contender for building e-commerce product searches.

At the core, Solr/Elasticsearch is a symbolic information retrieval (IR) system. Mapping queries and documents to a common string space is crucial to the search quality. This mapping process is an NLP pipeline implemented with Lucene Analyzer. In this post, I will reveal some drawbacks of such a symbolic-pipeline approach, and then present an end-to-end way of building a product search system from query logs using Tensorflow. This deep learning based system is less prone to spelling errors, leverages underlying semantics better, and scales out to multiple languages much easier.

Recap: Symbolic Approach for Product Search

Let’s first do a short review of the classic approach. Typically, an information retrieval system can be divided into three tasks: indexing, parsing and matching. As an example, the next figure illustrates a simple product search system:

  1. indexing: storing products in a database with attributes as keys, e.g. brand, color, category;
  2. parsing: extracting attribute terms from the input query, e.g. red shirt -> {"color": "red", "category": "shirt"};
  3. matching: filtering the product database by attributes.

Many existing solutions such as Apache Solr and Elasticsearch follow this simple idea.Note, at the core, they are symbolic IR systems that rely on NLP pipelines for getting effective string representation of the query and product.

Pain points of A Symbolic IR System

1. The NLP pipeline is fragile and doesn’t scale out to multiple languages
The NLP Pipeline in Solr/Elasticsearch is based on the Lucene Analyzer class. A simple analyzer such as StandardAnalyzer would just split the sequence by whitespace and remove some stopwords. Quite often you have to extend it by adding more and more functionalities, which eventually results in a pipeline as illustrated in the figure below.

While it looks legit, my experience is that such NLP pipelines suffer from the following drawbacks:

  • The system is fragile. As the output of every component is the input of the next, a defect in the upstream component can easily break down the whole system. For example, canyourtoken izer split thiscorrectly?
  • Dependencies between components can be complicated. A component can take from and output to multiple components, forming a directed acyclic graph. Consequently, you may have to introduce some asynchronous mechanisms to reduce the overall blocking time.
  • It is not straightforward to improve the overall search quality. An improvement in one or two components does not necessarily improve the end-user search experience.
  • The system doesn’t scale out to multiple languages. To enable cross-lingual search, developers have to rewrite those language-dependent components in the pipeline for every language, which increases the maintenance cost.

2. Symbolic Systems do not Understand Semantics without Hard Coding
A good IR system should understand trainer is sneaker by using some semantic knowledge. No one likes hard coding this knowledge, especially you machine learning guys. Unfortunately, it is difficult for Solr/Elasticsearch to understand any acronym/synonym unless you implement SynonymFilter class, which is basically a rule-based filter. This severely restricts the generalizability and scalability of the system, as you need someone to maintain a hard-coded language-dependent lexicon. If one can represent query/product by a vector in a space learned from actual data, then synonyms and acronyms could easily be found in the neighborhood without hard coding.

Neural IR System
The next figure illustrates a neural information retrieval framework, which looks pretty much the same as its symbolic counterpart, except that the NLP pipeline is replaced by a deep neural network and the matching job is done in a learned common space.

End-to-End Model Training
There are several ways to train a neural IR system. One of the most straightforward (but not necessarily the most effective) ways is end-to-end learning. Namely, your training data is a set of query-product pairs feeding on the top-right and top-left blocks in the last figure. All the other blocks are learned from data. Depending on the engineering requirements or resource limitations, one can also fix or pre-train some of the components.

Where Do Query-Product Pairs Come From?
To train a neural IR system in an end-to-end manner, you need some associations between query and product such as the query log. This log should contain what products a user interacted with after typing a query. Typically, you can fetch this information from the query/event log of your system. After some work on segmenting, cleaning and aggregating, you can get pretty accurate associations. In fact, any user-generated text can be good association data. This includes comments, product reviews, and crowdsourcing annotations.

Neural Network Architecture
The next figure illustrates the architecture of the neural network. The proposed architecture is composed of multiple encoders, a metric layer, and a loss layer. First, input data is fed to the encoders which generate vector representations. In the metric layer, we compute the similarity of a query vector with an image vector and an attribute vector, respectively. Finally, in the loss layer, we compute the difference of similarities between positive and negative pairs, which is used as the feedback to train encoders via backpropagation.

Query Encoder
Here we need a model that takes in a sequence and outputs a vector. Besides the content of a sequence, the vector representation should also encode language information and be resilient to misspellings. The character-RNN (e.g. LSTM, GRU, SRU) model is a good choice. By feeding RNN character by character, the model becomes resilient to misspelling such as adding/deleting/replacing characters. The misspelled queries would result in a similar vector representation as the genuine one. Moreover, as European languages (e.g. German and English) share some Unicode characters, one can train queries from different languages in one RNN model. To distinguish the words with the same spelling but different meanings in two languages, such as German rot (color red) and English rot, one can prepend a special character to indicate the language of the sequence, e.g. 🇩🇪 rot and 🇬🇧 rot.

Image Encoder
The image encoder rests on purely visual information. The RGB image data of a product is fed into a multi-layer convolutional neural network based on the ResNet architecture, resulting in an image vector representation in 128-dimensions.

Attribute Encoder
The attributes of a product can be combined into a sparse one-hot encoded vector. It is then supplied to a four-layer, fully connected deep neural network with steadily diminishing layer size. Activation was rendered nonlinear by standard ReLUs, and drop-out is applied to address overfitting. The output yields attribute vector representation in 20 dimensions.

Metric & Loss Layer
After a query-product pair goes through all three encoders, one can obtain a vector representation of the query, an image representation and an attribute representation of the product. It is now the time to squeeze them into a common latent space. In the metric layer, we need a similarity function which gives higher value to the positive pair than the negative pair. To understand how a similarity function works, I strongly recommend you read my other blog post on “Optimizing Contrastive/Rank/Triplet Loss in Tensorflow for Neural Information Retrieval”. It also explains the metric and loss layer implementation in detail.

For a neural IR system, doing inference means serving search requests from users. Since products are updated regularly (say once a day), we can pre-compute the image representation and attribute representation for all products and store them. During the inference time, we first represent user input as a vector using query encoder; then iterate over all available products and compute the metric between the query vector and each of them; finally, sort the results. Depending on the stock size, the metric computation part could take a while. Fortunately, this process can be easily parallelized.

Qualitative Results
Here, I demonstrated (cherry-picked) some results for different types of query. It seems that the system goes in the right direction. It is exciting to see that the neural IR system is able to correctly interpret named-entity, spelling errors and multilinguality without any NLP pipeline or hard-coded rule. However, one can also notice that some top ranked products are not relevant to the query, which leaves quite some room for improvement.

Speed-wise, the inference time is about two seconds per query on a quad-core CPU for 300,000 products. One can further improve the efficiency by using model compression techniques.

Query & Top-20 Results

🇩🇪 nike

🇩🇪 schwarz (black)

🇩🇪 nike schwarz

🇩🇪 nike schwarz shirts

🇩🇪 nike schwarz shirts langarm (long-sleeved)

🇬🇧 addidsa (misspelled brand)

🇬🇧 addidsa trosers (misspelled brand and category)

🇬🇧 addidsa trosers blue shorrt (misspelled brand and category and property)

🇬🇧 striped shirts woman

🇬🇧 striped shirts man

🇩🇪 kleider (dress)

🇩🇪 🇬🇧 kleider flowers (mix-language)

🇩🇪 🇬🇧 kleid ofshoulder (mix-language & misspelled off-shoulder)

If you are a search developer who is building a symbolic IR system with Solr/Elasticsearch/Lucene, this post should make you aware of the drawbacks of such a system.

This post should also answer your What?, Why? and How? questions regarding a neural IR system. Compared to the symbolic counterpart, the new system is more resilient to the input noise and requires little domain knowledge about the products and languages. Nonetheless, one should not take it as a “Team Symbol” or “Team Neural” kind of choice. Both systems have their own advantages and can complement each other pretty well. A better solution would be combining these two systems in a way that we can enjoy all advantages from both sides.

Some implementation details and tricks are omitted here but can be found in my other posts. I strongly recommend readers to continue with the following posts:

Last but not least, the open-source project MatchZoo contains many state-of-the-art neural IR algorithms. In addition to product search, one may find its application in conversational chatbot and question-answer systems.

To work with great people like Han, have a look at our jobs page.

Zalando Tech Blog – Technical Articles Javier Arrieta

Leveraging the full power of a functional programming language

In Zalando Dublin, you will find that most engineering teams are writing their applications using Scala. We will try to explain why that is the case and the reasons we love Scala.

This content is coming both from my own experience and the team I'm working with in building the new Zalando Customer Data Platform.

How I came to use Scala

I have been working with JVM for the last 18 years. I find there is a lot of good work making the Java Virtual Machine very efficient and very fast, utilizing the underlying infrastructure well.

I feel comfortable debugging complex issues, such identifying those caused by garbage collection, and improving our code to alleviate the pauses (see Martin Thompson’s blog post or Aleksey Shipilёv’s JVM Anatomy Park).

I liked Java. I didn’t mind the boilerplate code too much if it didn’t get in the way of expressing the intent of the code. However, what bugged me was the amount of code required to encourage immutability and not having lambdas to transform collections.

At the end of 2012, I had to design a service whose only mission was to back up files from customer mobile devices (think a cloud backup service). It was a simple enough service, accepting bytes from the customer device (using a REST API) and writing them to disk. We were using the Servlet API, and the system was working well. However, as the devices were mobile phones and the upload bandwidth wasn’t very high, the machines were mostly idle, waiting for the buffers to fill up. Unfortunately, we couldn't scale up. When the system reached a few hundred workers, it would start to quickly degrade due to excessive context switching.

We were using Netty in other components, but the programming model of callbacks wasn’t something I wanted to introduce, as it becomes very complex very quickly to compose the callbacks.

Introducing Scala

I had been looking at Scala for some years and started to look into async HTTP frameworks. I liked spray because it allowed us to use it at a high level or low level depending on our requirements. It also wasn’t a framework that forced us into adapting everything to it. I created a quick proof of concept and was amazed at the conciseness of the code and how efficient it was (spray is optimised for throughput, not latency), being able to handle thousands of concurrent uploads with a single core. Previously we were bound by CPU because of all the context switching, but with spray, we managed to overcome this and become limited mostly by IO.

From that point on I decided I wanted to learn Scala and Functional Programming. I finished the Coursera Functional Programming in Scala course, started writing my side projects in Scala and tried to find a position in a company that worked with Scala.

I evaluated other FP languages like Clojure, but I like strongly typed languages as my experience is that the systems written with them are easier to maintain in the long term. I also looked at Haskell, but I felt more confident with a JVM compiled language that could use all the existing Java libraries.

The first thing I fell in love with was Monad composition to define the program (or subprogram) as a series of stages composed in a for-comprehension. It is a very convenient way to model asynchronous computations using Future as a Monad (I know that Future is not strictly a Monad, but for our code point of view we can assume it is, see

We will see some examples below; including a snippet here:

for {
customer <- findCustomer(address)
segment <- getCustomerSegment(
email = promotionEmail(customer, segment)
result <- sendEmail(address, email)
} yield result

It becomes natural and straightforward to define the different stages of computation as a pipeline in a for-comprehension, layering your program in different components, each responsible for their steps inside a for-comprehension.

We could also run them in parallel using an Applicative instead of a Monad)

Now in Zalando

Zalando is a big company, currently with over 1,900 engineers working here. As we have mentioned in previous blog posts, we are empowered to use the technologies we choose to build our systems, so the teams pick the language, libraries, components and tools. As you can see on our public Tech Radar, Scala is one of our core languages, with several Scala libraries like AKKA and Play!.

So I cannot say how Zalando teams are working in Scala. Some teams are deep into the Functional Programming side while others are using the language mostly as a “better Java”;  adopting lambdas, case classes and pattern matching to make the code more concise and understandable.

But I can talk about how people are using the language in the Dublin office where the services and data pipelines are written mostly in Scala: How our team that is developing the Customer Data Platform is using Scala, what libraries we are using and what we like about Scala.

Things we love about Scala


We love types. Types help us understand what we are dealing with. String  or Int can often be meaningless; we don’t want to mix a Customer Password with a Customer Name,  Email Address, etc. We want to know what a given value is.

For this we are currently using two different approaches:

  • Tagged types: Using shapeless @@ we decorate the primitive type with the tag we want to attach.
  • Value classes: Using a single attribute case class that extends AnyVal, so that the compiler tries to remove the boxing/unboxing whenever it can. This is useful when we want to override toString for example we may want to redact sensitive customer data when it goes into logs.

Here you have a simple example of both (full code here ):

import java.util.UUID
import shapeless.tag, tag.@@
import, Validated._

object model {
final case class Password private(value: String) extends AnyVal {
override def toString = "***"

object Password {
def create(s: String): Validated[String, Password] = s match {
case candidate if candidate.isEmpty || candidate.length < 8 => Invalid("Minimum password length has to be 8")
case valid => Valid(Password(valid))
sealed trait UserIdTag
type UserId = UUID @@ UserIdTag
object UserId {
def apply(v: UUID) = tag[UserIdTag](v)

sealed trait EmailAddressTag

type EmailAddress = String @@ EmailAddressTag

object EmailAddress {
def apply(s: String): Validated[String, EmailAddress] = s match {
case invalid if !invalid.contains("@") => Invalid(s"$invalid is not a valid email address")
case valid => val tagged = tag[EmailAddressTag](valid); Valid(tagged)

Function Composition


One of my favourite features of Scala is how easy and elegant it is to compose functions to create more complex ones. The most common way of doing this is using Monads inside a for-comprehension.

This way we can run several operations sequentially and obtain a result. As soon as any of the operations fail, the comprehension will exit with that failure.

def promotionEmail(customer: Customer, segment: CustomerSegment): Email = ???

def sendEmail(address: EmailAddress, message: Email): Future[Unit] = ???

def findCustomer(address: EmailAddress): Future[Customer] = ???

def getCustomerSegment(id: CustomerId): Future[CustomerSegment] = ???

def sendPromotionalEmail(address: EmailAddress)(implicit ec: ExecutionContext): Future[Unit] = {
for {
customer <- findCustomer(address)
segment <- getCustomerSegment(
email = promotionEmail(customer, segment)
result <- sendEmail(address, email)
} yield result

For a full example see here.

If what you want to do is evaluate several functions in parallel and collect all the errors or the successful results, you can use an Applicative Functor. This is very common when doing validations of a complex entity, where we can present all the detected errors in one go to the client.

type ValidatedNel[A] = Validated[NonEmptyList[String], A]
final case class Customer(name: Name, email: EmailAddress, password: Password)
object Customer {
def apply(name: String, email: String, password: String): ValidatedNel[Customer] = {
.map3(Name(name), EmailAddress(email), Password(password))(Customer.apply)

For a full example see here.

Another combination is a simple function composition using compose or andThen, or if the arguments don’t match completely, using anonymous functions to combine them.

final case class Customer(id: CustomerId, address: EmailAddress, name: Name)
val findCustomer: EmailAddress => Customer
val sendEmail: Customer => Either[String, Unit]
val sendCustomerEmail: EmailAddress => Either[String, Unit] = findCustomer andThen sendEmail

For a full example see here.

Referential Transparency

We like being able to reason about a computation by using the substitution model, i.e., in a referential transparent computation you can always substitute a function with parameters, with the result of executing the function with those parameters.

This simplifies enormously the understanding of a complex system by understanding the components (functions) that together compose the system.

def sq(x: Int): Int = x * x
assert(sq(5) == 5 * 5)

The previous example might be too basic, but I hope it suffices to make the point. You can always replace calling the sq function with the result, and there is no difference between both. This is also very helpful when testing your program. For more detail, you can look at the Wikipedia article here.

One of the most common issues that stops people using referential transparency, apart from global mutable state, is the ability to blow up the stack throwing exceptions. Throwing exceptions across the call stack can be seen as very powerful.  However, it makes it difficult to reason about your program when composing functions.

Monad Transformers

One of the caveats of using effects (effects are orthogonal to the type you get, for instance Future is the effect of asynchrony, Option is the effect of optionality, Iterable of repeatability…) is that they become extremely cumbersome when composing more than two operations, or when composing and nesting.

One of the most popular solutions is using a Monad Transformer that allows us to stack two Monads in one and use them as if they were a standard Monad. You can visit this blog post for more detail.

Other option that we are not going to explore in this post is to use extensible effects, you can visit eff for more detail.

The reader can try this without using a Monad Transformer to see how complex it becomes, even if only composing two functions.

import scala.concurrent.{ExecutionContext, Future}
import cats.instances.future._
final case class Customer(id: CustomerId, email: EmailAddress, fullName: String)
def findCustomer(email: EmailAddress): EitherT[Future, Throwable, Customer] =
EitherT[Future, Throwable, Customer](Future.successful(Right(Customer("id", "", "John Doe"))))
def sendEmail(recipient: EmailAddress,
subject: String,
content: String): EitherT[Future, Throwable, Unit] =
EitherT[Future, Throwable, Unit](Future.successful {
println(s"Sending promotional email to $recipient, subject: '$subject', content: '$content'")
def promotionSubject(fullName: String): String = s"Amazing promotion $fullName, only for you"
def promotionContent(fullName: String): String = s"Click this link for your personalised promotion $fullName..."
def sendPromotionEmailToCustomer(email: EmailAddress)(implicit ec: ExecutionContext): EitherT[Future, Throwable, Unit] = {
for {
customer <- findCustomer(email)
subject = promotionSubject(customer.fullName)
content = promotionContent(customer.fullName)
result <- sendEmail(, subject, content)
} yield result

For full example see here


We like decoupling structure from behaviour. Some structures might encompass some business functionality, but we should not try as we did in OO to define everything we think a class can do as methods.

For this, we adopt the Typeclass pattern where we define laws for a given behaviour and then implement the particular functionality for a given class. Classic examples are Semigroup, Monoid, Applicative, Functor, Monad, etc.

As a simple example, we wanted to be able to serialise/deserialise data types on the wire. For this we defined two simple Typeclasses:

trait BytesEncoder[A] {
def apply(a: A): Array[Byte]
trait BytesDecoder[A] {
def apply(arr: Array[Byte]): Either[Throwable, A]

All we need to do is for every type we want to be able to serialise is to implement those interfaces. When we need to use an A serialiser, we will require an implicit BytesEncoder[A], and we will have to provide the implicit when we instantiate the user.

For example, we have a CustomerEntity, and we want to be able to write this to the wire using protobuf, we will then provide:

implicit def customerEntityProtoDecoder: BytesEncoder[CustomerEntity] = new BytesEncoder[CustomerEntity] {
override def apply(ce: CustomerEntity) = ProtoMapper.toProto(ce).toByteArray


We use extensively Either so we need to fold both cases to get a uniform response from them. We can use a fold to merge both cases into a common response to the client, for instance, a HttpResponse.

We also want to consolidate all the errors from a Validated[NonEmptyList[String],T] into a common response by folding the NoneEmptyList into a String or other adequate type.


As you can see we are very happy using Scala and finding our way into more advanced topics in the functional programming side of it. But there is nothing stopping you from implementing all these useful features step by step to get up to speed and experiment with the benefits of using an FP language.

Learn more about opportunities at Zalando by visiting our jobs page.

Zalando Tech Blog – Technical Articles Ian Duffy

Backing up Apache Kafka and Zookeeper to S3

What is Apache Kafka?

Apache Kafka is a distributed streaming platform used for building real-time data pipelines and streaming applications. It is horizontally scalable, fault-tolerant, and wicked fast. It runs in production in many companies.

Backups are important with any kind of data. Apache Kafka lowers this risk of data loss with replication across brokers. However, it is still necessary to have protection in place in the event of user error.

This post will demo and introduce tools that our small team of three at Zalando uses to backup and restore Apache Kafka and Zookeeper.

Backing up Apache Kafka

Getting started with Kafka Connect
Kafka Connect is a framework for connecting Kafka with external systems. Its purpose is to make it easy to add new systems to scalable and secure stream data pipelines.

By using a connector by, backing up and restoring the contents of a topic to S3 becomes a trivial task.

Download the prerequisite

Checkout the following repository

$ git clone 

It contains a docker-compose file for bring up a Zookeeper, Kafka, and Kafka Connect locally.

Kafka Connect will load all jars put in the ./kafka-connect/jars directory. Go ahead and download the kafka-connect-s3.jar

$ wget "" -O kafka-connect-s3.jar

Bring up the stack
To boot the stack, use docker-compose up

Create some data
Using the Kafka command line utilities, create a topic and a console producer:

$ kafka-topics --zookeeper localhost:2181 --create --topic example-topic --replication-factor 1 --partitions 1Created topic "example-topic".$ kafka-console-producer --topic example-topic --broker-list localhost:9092>hello world

Using a console consumer, confirm the data is successfully written:

$ kafka-console-consumer --topic example-topic --bootstrap-server localhost:9092 --from-beginning hello world

Create a bucket on S3 to store the backups:

$ aws s3api create-bucket --create-bucket-configuration LocationConstraint=eu-west-1 --region eu-west-1 --bucket example-kafka-backup-bucket

Create a bucket on S3 to store the backups:

$ cat << EOF > example-topic-backup-tasks.json
 "name": "example-topic-backup-tasks",
 "config": {
   "connector.class": "com.spredfast.kafka.connect.s3.sink.S3SinkConnector",
   "format.include.keys": "true",
   "topics": "example-topic",
   "tasks.max": "1",
   "format": "binary",
 "s3.bucket": "example-kafka-backup-bucket",
   "value.converter": "com.spredfast.kafka.connect.s3.AlreadyBytesConverter",
   "key.converter": "com.spredfast.kafka.connect.s3.AlreadyBytesConverter",
   "local.buffer.dir": "/tmp"
curl -X POST -H "Content-Type: application/json" -H "Accept: application/json" -d @example-topic-backup-tasks.json /api/kafka-connect-1/connectors

(Check out the Spredfast documentation for more configuration options.)

After a few moments the backup task will begin. By listing the Kafka Consumer groups, one can identify the consumer group related to the backup task and query for its lag to determine if the backup is finished.

$ kafka-consumer-groups --bootstrap-server localhost:9092 --listNote: This will only show information about consumers that use the Java consumer API (non-ZooKeeper-based consumers).connect-example-topic-backup-task
$ kafka-consumer-groups --describe --bootstrap-server localhost:9092 --group connect-example-topic-backup-tasks

Note: This will only show information about consumers that use the Java consumer API (non-ZooKeeper-based consumers).

TOPIC                          PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG        CONSUMER-ID                                       HOST                           CLIENT-ID
example-topic                  0          1               1               0          consumer-5-e95f5858-5c2e-4474-bab9-8edfa722db21   /                    consumer-5

The backup is completed when the lag reaches 0. On inspecting the S3 bucket, a folder of the raw backup data will be present.

Let’s destroy all of the containers and start fresh:

$ docker-compose rm -f -v
$ docker-compose up

Re-create the topic:

$ kafka-topics --zookeeper localhost:2181 --create --topic example-topic --replication-factor 1 --partitions 1Created topic "example-topic".

Create a source with Kafka Connect:

$ cat << EOF > example-restore.json
"name": "example-restore",
"config": {
"connector.class": "com.spredfast.kafka.connect.s3.source.S3SourceConnector",
"tasks.max": "1",
"topics": "example-topic",
"s3.bucket": "imduffy15-example-kafka-backup-bucket",
"key.converter": "com.spredfast.kafka.connect.s3.AlreadyBytesConverter",
"value.converter": "com.spredfast.kafka.connect.s3.AlreadyBytesConverter",
"format": "binary",
"format.include.keys": "true"
curl -X POST -H "Content-Type: application/json" -H "Accept: application/json" -d @example-restore.json /api/kafka-connect-1/connectors

(Check out the Spredfast documentation for more configuration options.)

The restore process should begin, in some time, the ‘hello world’ message will display when you run the Kafka console consumer again:

$ kafka-console-consumer --topic example-topic --bootstrap-server localhost:9092 --from-beginning 
hello world

The restore process should begin, in some time, the ‘hello world’ message will display when you run the Kafka console consumer again:

$ kafka-console-consumer --topic example-topic --bootstrap-server localhost:9092 --from-beginning 
hello world

Backing Up Zookeeper

Zookeeper’s role
Newer versions of Kafka ( >= version 0.10.x) use ZooKeeper in a small but still important coordination role. When new Kafka Brokers join the cluster, they use ZooKeeper to discover and connect to the other brokers. The cluster also uses ZooKeeper to elect the controller and track the controller epoch. Finally and perhaps most importantly, ZooKeeper stores the Kafka Broker topic partition mappings, which tracks the information stored on each broker. The data will still persist without these mappings, but won't be accessible or replicated.

is a popular supervisor system for ZooKeeper. It provides a number of housekeeping facilities for ZooKeeper as well as exposing a nice UI for exploring the stored data. It also provides some backup and restore capabilities for ZooKeeper out of the box. However, we should make sure we understand these features before relying on them.

On AWS, we run our Exhibitor brokers under the same stack. In this setup, when the stack auto scaling group is responsible for controlling when any of the Exhibitor instances are removed, it is relatively easy for multiple (or even all) Exhibitor brokers to be terminated at the same time. That’s why for our tests we set up an Exhibitor cluster and connected Kafka to it. We then indexed all the Kafka znodes to create an exhibitor backup. Finally, we tore down the Exhibitor stack and re-deployed it with the same config.

Unfortunately, after re-deploy, while the backup folder was definitely in S3, the new Exhibitor appliance did not recognise it as an existing index. With a bit of searching we found that this is actually the expected behaviour and the suggested solution is to read the S3 index and apply changes by hand.

Creating, deleting and re-assigning topics in Kafka is an uncommon occurrence for us, so we estimated that a daily backup task would be sufficient for our needs.

We came across Burry. Burry is a small tool which allows for snapshotting and restoring of a number of system critical stores, including ZooKeeper. It can save the snapshot dump locally or to various cloud storage options. Its backup dump is also conveniently organized along the znode structure making it very easy to work with manually if need be. Using this tool we set up a daily cron job on our production to get a full daily ZooKeeper snapshot and upload the resultant dump to an S3 bucket on AWS.

Conveniently, Burry also works as a restore tool using a previous ZooKeeper snapshot. It will try to recreate the full snapshot znode structure and znode data. It also tries to be careful to preserve existing data, so if a znode exists it will not overwrite it.

But there is a catch. Some of the Kafka-created znodes are ephemeral and expected to expire when the Kafka Brokers disconnect. Currently Burry snapshots these as any other znodes, so restoring to a fresh Exhibitor cluster will recreate them. If we were to restore Zookeeper before restarting our Kafka brokers, we'd restore from the snapshot of the ephemeral znodes with information about the Kafka brokers in our previous cluster. If we then bring up our Kafka cluster, our new broker node IDs, which must be unique, would conflict with the IDs restored from our Zookeeper backup. In other words, we'd be unable to start up our Kafka brokers.

We can easily get around this problem by starting our new Zookeeper and new Kafka clusters before restoring the Zookeeper content from our backup. By doing this, the Kafka brokers will create their ephemeral znodes, and the Zookeeper restore will not overwrite these, and will go on to recreate the topic and partition assignment information. After restarting the Kafka brokers, the data stored on their persisted disks will once again be correctly mapped and available, and consumers will be able to resume processing from their last committed position.

Be part of Zalando Tech. We're hiring!

Zalando Tech Blog – Technical Articles Ian Duffy

A closer look at the ingredients needed for ultimate stability

This is part of a series of posts on Kafka. See Ranking Websites in Real-time with Apache Kafka’s Streams API for the first post in the series.

Remora is a small application to track the monitoring of Kafka. Due to many teams deploying this to their production environments, open sourcing this application made sense. Here, I’ll go through some technical pieces of what is streaming, why it is useful and the drive behind this useful monitoring application.

Streaming architectures have become an important architecture pattern at Zalando. To have fast, highly available and scalable systems to process data across numerous teams and layers, having a good streaming infrastructure and monitoring is key.  

Without streaming, the system is not reactive to changes. An older style would manage changes through incremental batches. Batches, such as CRONs, can be hard to manage as you have to keep track of a large number of jobs, each taking care of a shard of data, this may also be hard to scale. Having a streaming component, such as Kafka, allows for a centralised scalable resource to make your architecture reactive.

Some use cloud infrastructure such as AWS Kinesis or SQS. Alternatively, to reach better throughput and use frameworks like AKKA streams, Kafka is chosen by Zalando.

Monitoring lag is important; without it we don’t know where the consumer is relative to the size of the queue. An analogy might be piloting a plane without knowing how many more miles are left on your journey. Zalando has trialled:

We needed a simple independent application, which may scrape metrics from a URL and place them into our metrics system. But what to include in this application? I put on my apron and played chef with the range of ingredients available to us.

Scala is big at Zalando; a majority of the developers know AKKA or Play. At the time in our team, we were designing systems using AKKA HTTP with an actor pattern. It is a very light, stable and an asynchronous framework. Could you just wrap the command line tools in a light Scala framework? Potentially, it could take less time and be more stable. Sounds reasonable, we thought, let’s do that.

The ingredients for ultimate stability were as follows: a handful of Kafka java command line tools with a pinch of AKKA Http and a hint of an actor design. Leave to code for a few days and take out of the oven. Lightly garnish with a performance test to ensure stability, throw in some docker. Deploy, monitor, alert, and top it off with a beautiful graph to impress. Present Remora to your friends so that everyone may have a piece of the cake, no matter where in the world you are!

Bon app-etit!

Zalando Tech Blog – Technical Articles Ian Duffy

Second in our series about the use of Apache Kafka’s Streams API by Zalando

This is the second in a series about the use of Apache Kafka’s Streams API by Zalando, Europe’s leading online fashion platform. See Ranking Websites in Real-time with Apache Kafka’s Streams API for the first post in the series.

This piece was first published on

Running Kafka Streams applications in AWS

At Zalando, Europe’s leading online fashion platform, we use Apache Kafka for a wide variety of use cases. In this blog post, we share our experiences and lessons learned to run our real-time applications built with Kafka’s Streams API in production on Amazon Web Services (AWS). Our team at Zalando was an early adopter of the Kafka Streams API. We have been using it since its initial release in Kafka 0.10.0 in mid-2016, so we hope you find this hands-on information helpful for running your own use cases in production.

What is Apache Kafka’s Streams API?

The Kafka Streams API is available as a Java library included in Apache Kafka that allows you to build real-time applications and microservices that process data from Kafka. It allows you to perform stateless operations such as filtering (where messages are processed independently from each other), as well as stateful operations like aggregations, joins, windowing, and more. Applications built with the Streams API are elastically scalable, distributed, and fault-tolerant. For example, the Streams API guarantees fault-tolerant data processing with exactly-once semantics, and it processes data based on event-time i.e., when the data was actually generated in the real world (rather than when it happens to be processed). This conveniently covers many of the production needs for mission-critical real-time applications.

An example of how we use Kafka Streams at Zalando is the aforementioned use case of ranking websites in real-time to understand fashion trends.

Library Upgrades of Kafka Streams

Largely due to our early adoption of Kafka Streams, we encountered many teething problems in running Streams applications in production. However, we stuck with it due to how easy it was to write Kafka Streams code. In our early days of adoption, we hit various issues around stream consumer groups rebalancing, issues with getting locks on the local RocksDB after a rebalance, and more. These eventually settled down and sorted themselves out in the release (April 2017) of the Kafka Streams API.


After upgrading to, our Kafka Streams applications were mostly stable, but we would still see what appeared to be random crashes every so often. These crashes occurred more frequently on components doing complex stream aggregations. We eventually discovered that the actual culprit was AWS rather than Kafka Streams; on AWS General purpose SSD (GP2) EBS volumes operate using I/O credits. The AWS pricing model allocates a baseline read and write IOPS allowance to a volume, based on the volume size. Each volume also has an IOPS burst balance, to act as a buffer if the base limit is exceeded. Burst balance replenishes over time but as it gets used up, the reading and writing to disks starts getting throttled to the baseline, leaving the application with an EBS that is very unresponsive. This ended up being the root cause of most of our issues with aggregations. When running Kafka Stream applications, we had initially assigned 10gb disks as we didn’t foresee much storage occurring on these boxes. However, under the hood, the applications performed lots of read/write operations on the RocksDBs which resulted in I/O credits being used up, and given the size of our disks, the I/O credits were not replenished quickly enough, grinding our application to a halt. We remediated this issue by provisioning Kafka Streams applications with much larger disks. This gave us more baseline IOPS, and the burst balance was replenished at a faster rate.


EBS Burst Balance

Our monitoring solution polls CloudWatch metrics and pulls back all AWS exposed metrics. For the issue outlined, the most important of these is EBS burst balance. As mentioned above, in many cases applications that use Kafka Streams rely on heavy utilization of locally persisted RocksDBs for storage and quick data access. This storage is persisted on the instance’s EBS volumes and generates a high read and write workload on the volumes. GP2 disks were used in preference to provisioned IOPS disks (IO1) since these were found to be much more cost-effective in our case.

Fine-tuning your application

With upgrades in the underlying Kafka Streams library, the Kafka community introduced many improvements to the underlying stream configuration defaults. Where in previous, more unstable iterations of the client library we spent a lot of time tweaking config values such as,, and to achieve some level of stability.

With new releases we found ourselves discarding these custom values and achieving better results. However, some timeout issues persisted on some of our services, where a service would frequently get stuck in a rebalancing state. We noticed that reducing the max.poll.records value for the stream configs would sometimes alleviate issues experienced by these services. From partition lag profiles we also saw that the consuming issue seemed to be confined to only a few partitions, while the others would continue processing normally between re-balances. Ultimately we realised that the processing time for a record in these services could be very long (up to minutes) in some edge cases. Kafka has a fairly large maximum offset commit time before a stream consumer is considered dead (five minutes), but with larger message batches of data, this timeout was still being exceeded. By the time the processing of the record was finished, the stream was already marked as failed and so the offset could not be committed. On rebalance, this same record would once again be fetched from Kafka, would fail to process in a timely manner and the situation would repeat. Therefore for any of the affected applications, we introduced a processing timeout, ensuring there was an upper bound on the time taken by any of our edge cases.


Consumer Lag

By looking at the metadata of a Kafka Consumer Group, we can determine a few key metrics. How many messages are being written to the partitions within the group? How many messages are being read from those partitions? The difference between these is called lag; it represents how far the Consumers lag behind the Producers.

The ideal running state is that lag is a near zero value. At Zalando, we wanted a way to monitor and plot this to see if our streams applications are functioning.

After trying out a number of consumer lag monitoring utilities, such as Burrow, Kafka Lag Monitor and Kafka Manager, we ultimately found these tools either too unstable or a poor fit for our use case. From this need, our co-worker, Mark Kelly, build a small utility called Remora. It is a simple HTTP wrapper around the Kafka consumer group “describe” command. By polling the Remora HTTP endpoints from our monitoring system at a set time interval, we were able to get good insights into our stream applications.


Our final issue was due to memory consumption. Initially we somewhat naively assigned very large heaps to the Java virtual machine (JVM). This was a bad idea because Kafka Streams applications utilize a lot of off-heap memory when configured to use RocksDB as their local storage engine, which is the default. By assigning large heaps, there wasn’t much free system memory. As a result, applications would eventually come to a halt and crash. For our applications we use M4.large instances, we assign 4gb of ram to the heap and usually utilize about 2gb of it, the system has a remaining 4gb of ram free for off-heap and system usage, utilization of overall system memory is at 70%. Additionally, we would recommend reviewing the memory management section of Confluent’s Kafka Streams documentation as customising the RocksDB configuration was necessary in some of our use cases.


JVM Heap Utilization

We expose JVM heap utilization using Dropwizard metrics via HTTP. This is polled on an interval by our monitoring solution and graphed. Many of our applications are fairly memory intensive, with in-memory caching, so it was important for us to be able to see at a glance how much memory was available to the application. Additionally, due to the relative complexity of many of our applications, we wanted to have easy visibility into garbage collection in the systems. Dropwizard metrics offered a robust, ready-made solution for these problems.

CPU, System Memory Utilization and Disk Usage

We run Prometheus node exporteron all of our servers; this exports lots of system metrics via HTTP. Again, our monitoring solution polls this on interval and graphs them. While the JVM monitoring provided a great insight into what was going on in the JVM, we needed to also have insight into what was going on in the instance. In general, most of the applications we ended up writing had a much greater network and memory overheads than CPU requirements. However, in many failure cases we saw, our instances were ultimately terminated by auto-scaling groups on failing their health checks. These health checks would fail because the endpoints became unresponsive due to high CPU loads as other resources were used up. While this was usually not due to high CPU use in the application processing itself, it was a great symptom to capture and dig further into where this CPU usage was coming from. Disk monitoring also proved very valuable, particularly for Kafka Streams applications consuming from topics with a large partitioning factor and/or doing more complex aggregations. These applications store a fairly large amount of data (200MB per partition) in RocksDB on the host instance, so it is very easy to accidentally run out of space. Finally, it is also good to monitor how much memory the system has available as a whole since this was frequently directly connected to CPU loads saturating on the instances, as briefly outlined above.

Conclusion: The Big Picture of our Journey

As mentioned in the beginning, our team at Zalando has been using the Kafka Streams API since its initial release in Kafka 0.10.0 in mid-2016.  While it wasn’t a smooth journey in the very beginning, we stuck with it and, with its recent versions, we now enjoy many benefits: our productivity as developers has skyrocketed, writing new real-time applications is now quick and easy with the Streams API’s very natural programming model, and horizontal scaling has become trivial.

In the next article of this series, we will discuss how we are backing up Apache Kafka and Zookeeper to Amazon S3 as part of our disaster recovery system.

About Apache Kafka’s Streams API

If you have enjoyed this article, you might want to continue with the following resources to learn more about Apache Kafka’s Streams API:

Join our ace tech team. We’re hiring!

Zalando Tech Blog – Technical Articles Hunter Kelly

Using Apache and the Kafka Streams API with Scala on AWS for real-time fashion insights

This piece was originally published on

The Fashion Web

Zalando, Europe’s leading online fashion platform, cares deeply about fashion. Our mission statement is to, “Reimagine fashion for the good of all”. To reimagine something, first you need to understand it. The Dublin Fashion Insight Centre was created to understand the “Fashion Web” – what is happening in the fashion world beyond the borders of what’s happening within Zalando’s shops.  

We continually gather data from fashion-related sites. We bootstrapped this process with a list of relevant sites from our fashion experts, but as we scale our coverage, and add support for multiple languages (spoken, not programming), we need to know what are the next “best” sites.

Rather than relying on human knowledge and intuition, we needed an automated, data-driven methodology to do this. We settled on a modified version of Jon Kleinberg’s HITS algorithm. HITS (Hyperlink Induced Topic Search) is also sometimes known as Hubs and Authorities, which are the main outputs of the algorithm. We use a modified version of the algorithm, where we flatten to the domain level (e.g., rather than on the original per-document level (e.g,

HITS in a Nutshell

The core concept in HITS is that of Hubs and Authorities. Basically, a Hub is an entity that points to lots of other “good” entities. An Authority is the complement; an entity pointed to by lots of other “good” entities. The entities here, for our purposes, are web sites represented by their domains such as or Domains have both Hub and Authority scores, and they are separate (this turns out to be important, which we’ll explain later).

These Hub and Authority scores are computed using an adjacency matrix. For every domain, we mark the other domains that it has links to. This is a directed graph, with the direction being who links to whom.

(Image courtesy of

Once you have the adjacency matrix, you perform some straightforward matrix calculations to calculate a vector of Hub scores and a vector of Authority scores as follows:

  • Sum across the columns and normalize, this becomes your Hub vector
  • Multiply the Hub vector element-wise across the adjacency matrix
  • Sum down the rows and normalize, this becomes your Authority vector
  • Multiply the Authority vector element-wise down the the adjacency matrix
  • Repeat

An important thing to note is that the algorithm is iterative: you perform the steps above until  eventually you reach convergence—that is, the vectors stop changing—and you’re done. For our purposes, we just pick a set number of iterations, execute them, and then accept the results from that point.  We’re mostly interested in the top entries, and those tend to stabilize pretty quickly.

So why not just use the raw counts from the adjacency matrix? The beauty of the HITS algorithm is that Hubs and Authorities are mutually supporting—the better the sites that something points at, the better a Hub it is; similarly, the better the sites that point to something, the better an Authority it is. That is why the iteration is necessary: it bubbles the good stuff up to the top.

(Technically, you don’t have to iterate. There’s some fancy matrix math you can do instead with calculating the eigenvectors. In practice, we found that when working with large, sparse matrices, the results didn’t turn out the way we expected, so we stuck with the iterative, straightforward method.)

Common Questions

What about non-fashion domains? Don’t they clog things up?
Yes, in fact, on the first run of the algorithm, sites like Facebook, Twitter, Instagram, et al. were right up at the top of the list. Our Fashion Librarians then curated that list to get a nice list of fashion-relevant sites to work with.

Why not PageRank?
PageRank needs to have nearly complete information on the web; with the resources needed to get this. We only have outgoing link data on the domains that are already in our working list. We need an algorithm that is robust in the face of partial information.  

This is where the power of the separate Hub and Authority scores comes in. Given the information for our seeds, they become our list of Hubs. We can then calculate the Authorities, filter out our seeds, and have a ranked list of stuff we don’t have. Voilà!  Problem solved, even in the face of partial knowledge.

But wait, you said Kafka Streams?

Kafka was already part of our solution, so it made sense to try to leverage that infrastructure and our experience using it. Here’s some of the thinking behind why we chose to go with Apache Kafka’s® Streams API to perform such real-time ranking of domains as described above:

  • It has all the primitives necessary for MapReduce-style computation: the “Map” step can be done with groupBy & groupByKey, the “Reduce” step can be done with reduce & aggregate.
  • Streaming allows us to have real-time, up-to-date data.
  • The focus stays on the data. We’re not thinking about distributed computing machinery.
  • It fits in naturally with the functional style of the rest of our application.

You may wonder at this point, “Why not use MapReduce? And why not use tools like Apache Hadoop or Apache Spark that provide implementations of MapReduce?” Given that MapReduce was invented originally to solve this type of ranking problem, why not use it for the very similar type computation we have here? There are a few reasons we didn’t go with it:

  • We’re a small team, with no previous Hadoop experience, which rules Hadoop out.
  • While we do run Spark jobs occasionally in batch mode, it is a high infrastructure cost to run full-time if you’re not using it for anything else.
  • Initial experience with Spark Streaming, snapshotting, and recovery didn’t go smoothly.

How We Do It

For the rest of this article we are going to assume at least a basic familiarity with the Kafka Streams API and its two core abstractions, KStream and KTable. If not, there are plenty of tutorials, blog posts and examples available.


The real-time ranking is performed by a set of Scala components (groups of related functionality) that use the Kafka Streams API. We deploy them via containers in AWS, where they interact with our Kafka clusters. The Kafka Streams API allows us to run each component, or group of components, in a distributed fashion across multiple containers depending on our scalability needs.

At a conceptual level, there are three main components of the application. The first two, the Domain Link Extractor and the Domain Reducer, are deployed together in a single JVM. The third component, the HITS Calculator and its associated API front end, is deployed as a separate JVM. In the diagram below, the curved bounding boxes represent deployment units; the rectangles represent the components.

Data, in the form of s3 URLs to stored web documents, comes into the system on an input topic. The Domain Link Extractor loads the document, extracts the links, and outputs a mapping from domain to associated domains, for that document. We’ll drill into this a little bit more below. At a high level, we use the flatMap KStream operation. We use flatMap rather than map to simplify the error handling—we’re not overly concerned with errors, so for each document we either emit one new stream element on success, or zero if there is an error. Using flatMap makes that straightforward.

These mappings then go to the Domain Reducer, where we use groupByKey and reduce to create a KTable. The key is the domain, and the value in the table is the union of all the domains that the key domain has linked to, across all the documents in that domain. From the KTable, we use toStream to convert back to a KStream and from there to an output topic, which is log-compacted.

The final piece of the puzzle is the HITS Calculator. It reads in the updates to domains, keeps the mappings in a local cache, uses these mappings to create the adjacency matrix, then perform the actual HITS calculation using the process described above. The ranked Hubs and Authorities are then made available via a REST API.

The flexibility of Kafka’s Streams API

Let’s dive into the Domain Link Extractor for a second, not to focus on the implementation, but as a means of exploring the flexibility that Kafka Streams gives.

The current implementation of the Domain Link Extractor component is a function that calls four more focused functions, tying everything together with a Scala for comprehension.  This all happens in a KStream flatMap call. Interestingly enough, the monadic style of the for comprehension fits in very naturally with the flatMap KStream call.

One of the nice things about working with Kafka Streams is the flexibility that it gives us. For example, if we wanted to add information to the extracted external links, it is very straightforward to capture the intermediate output and further process that data, without interfering with the existing calculation (shown below).

In Summary

For us, Apache Kafka is useful for much more than just collecting and sharing data in real-time; we use it to solve important problems in our business domain by building applications on top. Specifically, Kafka’s Streams API enables us to build real-time Scala applications that are easy to implement, elastically scalable, and that fit well into our existing, AWS-based deployment setup.  

The programming style fits in very naturally with our already existing functional approach, allowing us to quickly and easily tackle problems with a natural decomposition of the problem into flexible and scalable microservices.

Given that much of what we’re doing is manipulating and transforming data, we get to stay close to the data. Kafka Streams doesn’t force us to work at too abstract a level or distract us with unnecessary infrastructure concerns.

It’s for these reasons we use Apache Kafka’s Streams API in the Dublin Fashion Insight Centre as one of our go-to tools in building up our understanding of the Fashion Web.

About Apache Kafka’s Streams API

If you have enjoyed this article, you might want to continue with the following resources to learn more about Apache Kafka’s Streams API:

Interested in working in our team? Get in touch: we’re hiring.

Zalando Tech Blog – Technical Articles Conor Clifford

Zalando is using an event-driven approach for its new Fashion Platform. Conor Clifford examines why

In a recent post, I wrote about how we went about building the core Article services and applications, of Zalando’s new Fashion Platform, with a strong event first focus. That new platform also has a strong overall event-driven focus, rather than a more “traditional” service-oriented approach.

The concept of “event-driven” is not a new one; indeed, it has been quite well covered in recent years.

In this post, we look at why we are using an event-driven approach to build the new Fashion Platform in Zalando.

Serving Complexity

A “traditional” service/microservice architecture will be composed of many individual services, each with different responsibilities. Each service will likely have several, probably many, clients; each interacting with the service to fetch data as needed. And these clients may be services to other clients, etc.

Various clients will have different requirements for the data they are fetching, for example:

  • Regular high frequency individual primary key fetches
  • Intermittent, yet regular, large bulk fetches
  • Non-primary key based queries/searches, also with varieties of frequencies and volumes
  • All the above with differing expectations/requirements around response times, and request throughputs

Over time, as such systems grow and become more complex, the demands on each of these services grow, both in terms of new functional requirements, as well as operational demands. From generally easy beginnings, growth can lead to ever-increasing complexity of the overall system, resulting in systems that are difficult to operate, scale, maintain and improve over time.

While there are excellent tools and techniques for dealing with and managing these complexities, these target the symptoms, not the underlying root causes.

Perhaps there is a better way.

Want to know more about Zalando Dublin. Check out the video straight from our fashion insights center.

Inversion of flow

The basic underlying concept here is to invert this traditional flow of information. To change from a top-down, request oriented system to one where data flows from the bottom up, with changes to data causing new snapshot events to be published. These changes propagate upwards through the system, being handled appropriately by a variety of client subsystems on its way.

Rather than fetching data on demand, clients requiring the data in question can process it appropriately for their own needs, at their own pace. That can be processing transformation, merging and producing new events, or building an appropriate local persisted projection of the data, e.g. a high speed key-value store for fast lookups, populating an analytical database, maintaining a search cluster, or even maintaining a corpus of data for various data science/machine learning activities, etc. In fact, there can and will be clients that do a combination of such activities around the event data.

On-Demand Requests is Easy

Building a client that pulls data on demand from a service would appear the easier thing to do, with clients being free to just fetch data directly as needed. From the client perspective, it can even appear that there is no obvious benefit to an event-driven approach.

However, with a view to the wider platform ecology (many clients, many services, lots of data, etc.), the traditional “pull-based” approach will lead to much more complex and problematic systems, leading to a variety of challenges:

  • Operation and Maintenance - core services in pull-based systems grow to serve more and more clients over time; clients with different access requirements (PK fetches, batch fetches, periodic "fetch the world" cycles, etc.). As the number and types of such clients grow, operating and maintaining such core services becomes ever more complex and difficult.
  • Software Delivery - as clients of core services grow, so to will the list of requirements around different access patterns and capabilities of the underlying data services (e.g. inclusion of batch-fetches, alternative indexing, growing request loads, competing prioritizations, etc.). This workload has a strong tendency to ultimately swamp the delivery teams of core services, to the detriment of delivering new business value. In addition to the service's team, the client teams themselves would also be dependent on new/changed functionality in the services to allow them to move forward.
  • Runtime Complexity - Outages and other such incidents in "pull" based environments can have dramatic impacts. Core service outage would essentially break any client fetching data "on demand". Multiple dependent applications can be brought down by an outage in a single underlying service. There can also be interesting dynamics on recovery of such services, with potential thundering herds, etc., causing repeating outages during this recovery, prolonging, or worse, further degrading, the impact of the original incident. Even without outages, the complexity of systems built around a request/response approach makes forecasting and predicting load growth difficult, modelling the interplay of many different clients, with different request patterns is difficult. Attempting to do forecasting of growth for each of these becomes a real challenge.

By evolving to an event-driven system, there are many advantages over these, and other aspects:

  • Operationally - since clients receive information about change, the clients can react instantly and appropriately themselves. As the throughput of data is driving the system, the performance/load characteristics are much more predictable (and testable.) There is no non-determinism caused by the interplay of multiple clients, etc. In general, handing data over event streams allows for much looser coupling of clients and services results in simpler systems.
  • Delivery - With the ability to access complete data from the event streams, clients are no longer blocked by the service teams delivery backlog; they are completely free to move forward themselves. Similarly, service delivery team backlogs are not overloaded by requests for serving modifications/alterations, etc., and as such freed up to directly deliver new business value.
  • Outages - With clients receiving data changes, and handling these locally, an outage of the originating service essentially means clients working with some stale data (the data that would have been changed during that outage),  typically a much less invasive and problematic issue. In many cases, where clients depend on data that changes infrequently, if at all, once established, it’s not an issue.
  • Greater Runtime Simplicity - with data passing through event streams, and clients consuming these streams as they need, the overall dynamic of the system becomes  more predictable/less complicated.


“Time is an illusion. Lunchtime doubly so.” - Douglas Adams

There's no such thing as a free lunch. There’s likely more work up front in establishing such an architectural shift, as well as other concerns:

  • Client Burden - There will be an additional burden on clients in a purely event-driven system, with those clients having to implement and operate local persisted projections of the event stream(s) for their purposes. However, a non-trivial part of this extra work is offset by removing work (development and operational) around integrating with a synchronous API and all the details that entails; dealing with authentication, rate limiting, circuit breakers, outage mitigations, etc. There is also less work involved with not having an API that is 100% purpose built. In addition, the logic to maintain such local snapshot projections is straightforward (e.g. write an optionally transformed value to a “key value” store for fast local lookups).
  • Source of Truth - A common concern with having locally managed projections of the source data is that there is a loss of the single source of truth that the originating service represents in the traditional model. By following an “Event First” approach, with the event stream itself being the primary source of truth, and by allowing only changes from the event stream itself to cause changes to any of the local projections, the source of truth is kept true.
  • Traditional Clients - there may be clients that are not in a position to deal with a local data projection (e.g. clients that only require few requests processed, clients that facilitate certain types of custom/external integrations, etc.) In such cases, there may be a need to provide a more traditional “request-response” interface. These, though, can be built using the same foundations, i.e. a custom data projection, and a dedicated new service using this to address these clients’ needs. We need to ensure that any clients looking to fit the “traditional” model are appropriate candidates to do so. Care should be taken to resist the temptation to implement the “easy” solution, rather than the correct solution.


In the modern era of building growing systems, with many hundreds of engineers, dozens of teams, all trying move fast and deliver excellent software with key business value, there is a need for less fragile solutions.

In this post, we have looked at moving away from a more regular “service” oriented architecture, towards one driven by event streams. There are many advantages, but, with their own set of challenges. And of course, there is no such thing as a silver bullet.

In the next post, I will look at some of the lessons we have learned building the system, and some practices that should be encouraged.

If you are interested in working with these types of systems and challenges, join us. We’re hiring!

Zalando Tech Blog – Technical Articles Vadym Kukhtin

Two brothers examine the pros and cons of UI testing

Based on their different experiences in Partner-Solutions and Zalando Media Solutions respectively, we speak to front-end developers, Vadym Kukhtin and Aleks Kukhtin about their opposing opinions on UI testing.

The Case Against UI Testing - Vadym

TL;DR It depends on preference, but I believe that UI testing isn’t required in every instance

In my experience, it is a sisyphean task to force developers to write even basic Unit tests, nevermind UI and E2E. Only Spartans led by Leonidas can achieve UI and E2E testing.

Of course, the case for UI testing is more complex than a simple “good vs bad” dichotomy. For example, the scale and scope of the app should be taken into account. If the app is small or short-term, most probably UI tests aren’t required. If it’s a monster project that needs to be covered as much as possible, then unit and E2E tests are required.

In the real world app, any interaction should change some state of the app, whether it’s a click, hover or any custom event. With unit tests, the developer can test internal component or service functionality, and with E2E the developer can test common component interactions and connections to third-party services, and API and backend functionality.


  • Use case: User should be able to login using OAuth and see “Hello” board.
  • App: LoginComponent:
  • Test case:

This process can be incredibly time consuming, with developers spending time writing the tests and mocking all dependencies. For small or short-term apps, we have to ask ourselves: is the time worth it?

My answer would be no.

The Case for UI Testing - Aleks

TL;DR In most cases I think we don’t need to write UI tests.

Let’s start with a small illustration. You start a new project and a month later have a nicely working app. You then decide to change one component in your structure. It is only UI component, so you know, that logic has no changes. The change itself works, but now the app has some of errors: it looks like you forgot about some style changes and your component now looks awkward. So, you fix the styles, deploy the changes, sit back and relax. But now, you have unwanted style issues in other component. So you again do the same thing and deploy the changes. Ideally, UI tests can identify this kind of problems.

Like most things in life, UI tests has advantages and disadvantages. Some argue it takes too much time, but this time can be considered an investment, safeguarding against any unwanted games of “tennis” as seen in the example above. Testing UI helps us better understand our code, and what actually it should render.

Yes, testing is complicated. Complex UI logic is pretty hard to test, but not inconceivable. Problem appears here: the process has so much troubles for the advantages gained: you need to write a lot of tests, but in result even small changes (that often happened with UI) force to rewrite bigger part of it.

In Conclusion
The biggest takeaway from our discussion is that UI testing cannot be simply filed away under “good” or “bad”. In some circumstances – such as small apps or short-term projects – testing may not be the best use of time. In others, testing is a must for maintaining the integrity of the components and saving time in the future.

Got an opinion on UI testing and want to bring it to a dynamic team? Get in touch. We’re hiring.

Zalando Tech Blog – Technical Articles Andrey Dyachkov

At Zalando we’ve created Nakadi, a distributed event bus that implements a RESTful API abstraction on top of Kafka-like queues. It helps to provide an available, durable, and fault tolerant publish/subscribe messaging system for simple microservices communication.

A Kafka cluster is able to grow to a huge amount of data stored on the disks. Hosting Kafka requires support of instance termination (on purpose or just because the “cloud provider” decided to terminate the instance), which in our case introduces a node with no data: the rebalance of the whole cluster has to be accomplished in order to evenly distribute the data among the nodes, taking hours of data copying. Here, we are going to talk about how we avoided rebalance after node termination in a Kafka cluster hosted on AWS.

In the beginning, at least when I joined, our Kafka cluster had the following configuration:

  • 5 Kafka brokers: m3.medium 2TB SSD
  • Replication factor 3 and min insync replicas 2
  • 3 Zookeeper nodes: m3.medium
  • Ingest 250GB per day

Nowadays, the cluster is much bigger:

  • 15 Kafka brokers: m4.2xlarge 8TB HDD
  • Replication factor 3 and min insync replicas 2
  • 3 Zookeeper nodes: i3.large
  • Ingest 5TB per day and egress 30TB per day


The above setup results in a number of problems that we are looking to solve, such as:

Loss of instance

AWS is able to terminate or restart your instance without notifying you in advance of the fact. Once it has happened, the broker has lost its data, which means it has to copy the log from the other brokers. This is a pain point, because it takes hours to accomplish.

Changing instance type

The load is growing and at some point in time, the decision is to upgrade the AWS instance type to sustain the load. This could be a major issue in the sense of time as well as availability. It correlates with the first issue, but a different scenario of losing broker data.

Upgrading a Docker image

Zalando has to follow certain compliance guidelines, which is provided by using services like Senza and Taupage. In their turn, they have requirements themselves which is to have immutable Docker images that are not replaceable once the instance is running. To overcome this, one has to relaunch the instance, hence coping a lot of data from other Kafka brokers.

Kafka cluster upgrade

Upgrading your Kafka version (or maybe downgrading it) requires the building of a different image which holds new parameters for downloading a Kafka version. This again requires the termination of the instance involving data copying.

When the cluster is quite small, it is pretty fast to rebalance it, which takes around 10-20 mins. However the bigger the cluster, the longer it takes to rebalance. It has happened that a rebalance of our current cluster takes about 7 hours in the case that one broker is down.


Our Kafka brokers were already using attached EBS volumes, which is an additional volume, located somewhere in the AWS Data Center. This is connected to the instance via network in order to have durability, availability and more disk space available. The AWS documentation states: "Amazon Elastic Block Store (Amazon EBS) provides persistent block storage volumes for use with Amazon EC2 instances in the AWS Cloud."

The only tiny issue here is that instance termination would bring the EBS volume down together with the instance, introducing data loss for one of the brokers. The figure below shows how it was organized:

The solution we found was to detach the EBS volume from the machine before terminating the instance and attach it to the new running instance. There is one small detail here: it is better to terminate Kafka gracefully, in order to decrease the startup time. In case you terminated Kafka in a “dirty” way without stopping, it would rebuild the log index from the start, requiring a lot of time and depending on how much data is stored on the broker.

In the diagram above we see that the instance termination does not touch any EBS volume, because it was safely detached from the instance.

Losing a broker without detaching EBS (terminating it together with the instance) introduces under-replicated partitions on other brokers, which holds the same partitions. To recover from that state, the rebalance takes around around 6-7 hours. If during the rebalance other brokers go down, which have the same partitions, it will provoke offline partitions and it will not be possible to publish to them anymore. It is better not to lose any other broker.

Reattachment of EBS volumes is possible to accomplish using the AWS Console by clicking buttons there, but to be honest I have never done it myself. Our team went about automating it with Python scripts and the Boto 3 library from AWS.

The scripts are able to do the following:

  • Create a broker with attached EBS volume
  • Create a broker and attach an existing EBS volume  
  • Terminate a broker, detaching the EBS volume beforehand
  • Upgrade a Docker image reusing the same EBS volume

Kafka instance control scripts can be found in our GitHub account, where the usage is described. Basically, these are one line commands which consume configuration for the cluster in order to not pass in the script parameters. Remember, we use Senza and Taupage, so the scripts are a bit Zalando specific, but can be changed quite quickly with very little effort.

However, it’s important to note that the instance could have a Kernel panic or some hardware issues while running the broker. AWS Auto Recovery helps to address this kind of issue. In simple terms, it is a feature of the EC2 instance to be able to recover after network connectivity, hardware or software failure. What does recovery mean here? The instance will be rebooted after failure to preserve a lot of parameters from an impaired instance, among that being EBS volume attachments. This is exactly what we need!

In order to turn it on, just create CloudWatch Alarm for StatusCheckFailed_System and you are all set. The next time the instance has a failure scenario it will be rebooted, preserving the attached EBS volume, which helps to avoid data copying.


Our team no longer worries about losing a Kafka broker, as it can be recovered in a number of minutes without copying data and wasting money on traffic. It only takes 2 hours to upgrade 15 nodes of a Kafka cluster and it just so happens that it is 42x faster than our previous approach.

In the future, we plan to add this functionality directly to our Kafka supervisor, which will allow us to completely automate our Kafka cluster upgrades and failure scenarios.

Have any feedback or questions? Find me on Twitter at @a_dyachkov.