Zalando Tech Blog – Technical Articles Ian Duffy

A closer look at the ingredients needed for ultimate stability

This is part of a series of posts on Kafka. See Ranking Websites in Real-time with Apache Kafka’s Streams API for the first post in the series.

Remora is a small application to track the monitoring of Kafka. Due to many teams deploying this to their production environments, open sourcing this application made sense. Here, I’ll go through some technical pieces of what is streaming, why it is useful and the drive behind this useful monitoring application.

Streaming architectures have become an important architecture pattern at Zalando. To have fast, highly available and scalable systems to process data across numerous teams and layers, having a good streaming infrastructure and monitoring is key.  

Without streaming, the system is not reactive to changes. An older style would manage changes through incremental batches. Batches, such as CRONs, can be hard to manage as you have to keep track of a large number of jobs, each taking care of a shard of data, this may also be hard to scale. Having a streaming component, such as Kafka, allows for a centralised scalable resource to make your architecture reactive.

Some use cloud infrastructure such as AWS Kinesis or SQS. Alternatively, to reach better throughput and use frameworks like AKKA streams, Kafka is chosen by Zalando.

Monitoring lag is important; without it we don’t know where the consumer is relative to the size of the queue. An analogy might be piloting a plane without knowing how many more miles are left on your journey. Zalando has trialled:

We needed a simple independent application, which may scrape metrics from a URL and place them into our metrics system. But what to include in this application? I put on my apron and played chef with the range of ingredients available to us.

Scala is big at Zalando; a majority of the developers know AKKA or Play. At the time in our team, we were designing systems using AKKA HTTP with an actor pattern. It is a very light, stable and an asynchronous framework. Could you just wrap the command line tools in a light Scala framework? Potentially, it could take less time and be more stable. Sounds reasonable, we thought, let’s do that.

The ingredients for ultimate stability were as follows: a handful of Kafka java command line tools with a pinch of AKKA Http and a hint of an actor design. Leave to code for a few days and take out of the oven. Lightly garnish with a performance test to ensure stability, throw in some docker. Deploy, monitor, alert, and top it off with a beautiful graph to impress. Present Remora to your friends so that everyone may have a piece of the cake, no matter where in the world you are!

Bon app-etit!

Zalando Tech Blog – Technical Articles Ian Duffy

Second in our series about the use of Apache Kafka’s Streams API by Zalando

This is the second in a series about the use of Apache Kafka’s Streams API by Zalando, Europe’s leading online fashion platform. See Ranking Websites in Real-time with Apache Kafka’s Streams API for the first post in the series.

This piece was first published on confluent.io

Running Kafka Streams applications in AWS

At Zalando, Europe’s leading online fashion platform, we use Apache Kafka for a wide variety of use cases. In this blog post, we share our experiences and lessons learned to run our real-time applications built with Kafka’s Streams API in production on Amazon Web Services (AWS). Our team at Zalando was an early adopter of the Kafka Streams API. We have been using it since its initial release in Kafka 0.10.0 in mid-2016, so we hope you find this hands-on information helpful for running your own use cases in production.

What is Apache Kafka’s Streams API?

The Kafka Streams API is available as a Java library included in Apache Kafka that allows you to build real-time applications and microservices that process data from Kafka. It allows you to perform stateless operations such as filtering (where messages are processed independently from each other), as well as stateful operations like aggregations, joins, windowing, and more. Applications built with the Streams API are elastically scalable, distributed, and fault-tolerant. For example, the Streams API guarantees fault-tolerant data processing with exactly-once semantics, and it processes data based on event-time i.e., when the data was actually generated in the real world (rather than when it happens to be processed). This conveniently covers many of the production needs for mission-critical real-time applications.

An example of how we use Kafka Streams at Zalando is the aforementioned use case of ranking websites in real-time to understand fashion trends.

Library Upgrades of Kafka Streams

Largely due to our early adoption of Kafka Streams, we encountered many teething problems in running Streams applications in production. However, we stuck with it due to how easy it was to write Kafka Streams code. In our early days of adoption, we hit various issues around stream consumer groups rebalancing, issues with getting locks on the local RocksDB after a rebalance, and more. These eventually settled down and sorted themselves out in the 0.10.2.1 release (April 2017) of the Kafka Streams API.

I/O

After upgrading to 0.10.2.1, our Kafka Streams applications were mostly stable, but we would still see what appeared to be random crashes every so often. These crashes occurred more frequently on components doing complex stream aggregations. We eventually discovered that the actual culprit was AWS rather than Kafka Streams; on AWS General purpose SSD (GP2) EBS volumes operate using I/O credits. The AWS pricing model allocates a baseline read and write IOPS allowance to a volume, based on the volume size. Each volume also has an IOPS burst balance, to act as a buffer if the base limit is exceeded. Burst balance replenishes over time but as it gets used up, the reading and writing to disks starts getting throttled to the baseline, leaving the application with an EBS that is very unresponsive. This ended up being the root cause of most of our issues with aggregations. When running Kafka Stream applications, we had initially assigned 10gb disks as we didn’t foresee much storage occurring on these boxes. However, under the hood, the applications performed lots of read/write operations on the RocksDBs which resulted in I/O credits being used up, and given the size of our disks, the I/O credits were not replenished quickly enough, grinding our application to a halt. We remediated this issue by provisioning Kafka Streams applications with much larger disks. This gave us more baseline IOPS, and the burst balance was replenished at a faster rate.

Monitoring

EBS Burst Balance

Our monitoring solution polls CloudWatch metrics and pulls back all AWS exposed metrics. For the issue outlined, the most important of these is EBS burst balance. As mentioned above, in many cases applications that use Kafka Streams rely on heavy utilization of locally persisted RocksDBs for storage and quick data access. This storage is persisted on the instance’s EBS volumes and generates a high read and write workload on the volumes. GP2 disks were used in preference to provisioned IOPS disks (IO1) since these were found to be much more cost-effective in our case.

Fine-tuning your application

With upgrades in the underlying Kafka Streams library, the Kafka community introduced many improvements to the underlying stream configuration defaults. Where in previous, more unstable iterations of the client library we spent a lot of time tweaking config values such as session.timeout.ms, max.poll.interval.ms, and request.timeout.ms to achieve some level of stability.

With new releases we found ourselves discarding these custom values and achieving better results. However, some timeout issues persisted on some of our services, where a service would frequently get stuck in a rebalancing state. We noticed that reducing the max.poll.records value for the stream configs would sometimes alleviate issues experienced by these services. From partition lag profiles we also saw that the consuming issue seemed to be confined to only a few partitions, while the others would continue processing normally between re-balances. Ultimately we realised that the processing time for a record in these services could be very long (up to minutes) in some edge cases. Kafka has a fairly large maximum offset commit time before a stream consumer is considered dead (five minutes), but with larger message batches of data, this timeout was still being exceeded. By the time the processing of the record was finished, the stream was already marked as failed and so the offset could not be committed. On rebalance, this same record would once again be fetched from Kafka, would fail to process in a timely manner and the situation would repeat. Therefore for any of the affected applications, we introduced a processing timeout, ensuring there was an upper bound on the time taken by any of our edge cases.

Monitoring

Consumer Lag

By looking at the metadata of a Kafka Consumer Group, we can determine a few key metrics. How many messages are being written to the partitions within the group? How many messages are being read from those partitions? The difference between these is called lag; it represents how far the Consumers lag behind the Producers.

The ideal running state is that lag is a near zero value. At Zalando, we wanted a way to monitor and plot this to see if our streams applications are functioning.

After trying out a number of consumer lag monitoring utilities, such as Burrow, Kafka Lag Monitor and Kafka Manager, we ultimately found these tools either too unstable or a poor fit for our use case. From this need, our co-worker, Mark Kelly, build a small utility called Remora. It is a simple HTTP wrapper around the Kafka consumer group “describe” command. By polling the Remora HTTP endpoints from our monitoring system at a set time interval, we were able to get good insights into our stream applications.

Memory

Our final issue was due to memory consumption. Initially we somewhat naively assigned very large heaps to the Java virtual machine (JVM). This was a bad idea because Kafka Streams applications utilize a lot of off-heap memory when configured to use RocksDB as their local storage engine, which is the default. By assigning large heaps, there wasn’t much free system memory. As a result, applications would eventually come to a halt and crash. For our applications we use M4.large instances, we assign 4gb of ram to the heap and usually utilize about 2gb of it, the system has a remaining 4gb of ram free for off-heap and system usage, utilization of overall system memory is at 70%. Additionally, we would recommend reviewing the memory management section of Confluent’s Kafka Streams documentation as customising the RocksDB configuration was necessary in some of our use cases.

Monitoring

JVM Heap Utilization

We expose JVM heap utilization using Dropwizard metrics via HTTP. This is polled on an interval by our monitoring solution and graphed. Many of our applications are fairly memory intensive, with in-memory caching, so it was important for us to be able to see at a glance how much memory was available to the application. Additionally, due to the relative complexity of many of our applications, we wanted to have easy visibility into garbage collection in the systems. Dropwizard metrics offered a robust, ready-made solution for these problems.

CPU, System Memory Utilization and Disk Usage

We run Prometheus node exporteron all of our servers; this exports lots of system metrics via HTTP. Again, our monitoring solution polls this on interval and graphs them. While the JVM monitoring provided a great insight into what was going on in the JVM, we needed to also have insight into what was going on in the instance. In general, most of the applications we ended up writing had a much greater network and memory overheads than CPU requirements. However, in many failure cases we saw, our instances were ultimately terminated by auto-scaling groups on failing their health checks. These health checks would fail because the endpoints became unresponsive due to high CPU loads as other resources were used up. While this was usually not due to high CPU use in the application processing itself, it was a great symptom to capture and dig further into where this CPU usage was coming from. Disk monitoring also proved very valuable, particularly for Kafka Streams applications consuming from topics with a large partitioning factor and/or doing more complex aggregations. These applications store a fairly large amount of data (200MB per partition) in RocksDB on the host instance, so it is very easy to accidentally run out of space. Finally, it is also good to monitor how much memory the system has available as a whole since this was frequently directly connected to CPU loads saturating on the instances, as briefly outlined above.

Conclusion: The Big Picture of our Journey

As mentioned in the beginning, our team at Zalando has been using the Kafka Streams API since its initial release in Kafka 0.10.0 in mid-2016.  While it wasn’t a smooth journey in the very beginning, we stuck with it and, with its recent versions, we now enjoy many benefits: our productivity as developers has skyrocketed, writing new real-time applications is now quick and easy with the Streams API’s very natural programming model, and horizontal scaling has become trivial.

In the next article of this series, we will discuss how we are backing up Apache Kafka and Zookeeper to Amazon S3 as part of our disaster recovery system.

About Apache Kafka’s Streams API

If you have enjoyed this article, you might want to continue with the following resources to learn more about Apache Kafka’s Streams API:

Join our ace tech team. We’re hiring!

Zalando Tech Blog – Technical Articles Hunter Kelly

Using Apache and the Kafka Streams API with Scala on AWS for real-time fashion insights

This piece was originally published on confluent.io

The Fashion Web

Zalando, Europe’s leading online fashion platform, cares deeply about fashion. Our mission statement is to, “Reimagine fashion for the good of all”. To reimagine something, first you need to understand it. The Dublin Fashion Insight Centre was created to understand the “Fashion Web” – what is happening in the fashion world beyond the borders of what’s happening within Zalando’s shops.  

We continually gather data from fashion-related sites. We bootstrapped this process with a list of relevant sites from our fashion experts, but as we scale our coverage, and add support for multiple languages (spoken, not programming), we need to know what are the next “best” sites.

Rather than relying on human knowledge and intuition, we needed an automated, data-driven methodology to do this. We settled on a modified version of Jon Kleinberg’s HITS algorithm. HITS (Hyperlink Induced Topic Search) is also sometimes known as Hubs and Authorities, which are the main outputs of the algorithm. We use a modified version of the algorithm, where we flatten to the domain level (e.g., Vogue.com) rather than on the original per-document level (e.g, http://Vogue.com/news).

HITS in a Nutshell

The core concept in HITS is that of Hubs and Authorities. Basically, a Hub is an entity that points to lots of other “good” entities. An Authority is the complement; an entity pointed to by lots of other “good” entities. The entities here, for our purposes, are web sites represented by their domains such as Vogue.com or ELLE.com. Domains have both Hub and Authority scores, and they are separate (this turns out to be important, which we’ll explain later).

These Hub and Authority scores are computed using an adjacency matrix. For every domain, we mark the other domains that it has links to. This is a directed graph, with the direction being who links to whom.

(Image courtesy of http://faculty.ycp.edu/~dbabcock/PastCourses/cs360/lectures/lecture15.html)

Once you have the adjacency matrix, you perform some straightforward matrix calculations to calculate a vector of Hub scores and a vector of Authority scores as follows:

  • Sum across the columns and normalize, this becomes your Hub vector
  • Multiply the Hub vector element-wise across the adjacency matrix
  • Sum down the rows and normalize, this becomes your Authority vector
  • Multiply the Authority vector element-wise down the the adjacency matrix
  • Repeat

An important thing to note is that the algorithm is iterative: you perform the steps above until  eventually you reach convergence—that is, the vectors stop changing—and you’re done. For our purposes, we just pick a set number of iterations, execute them, and then accept the results from that point.  We’re mostly interested in the top entries, and those tend to stabilize pretty quickly.

So why not just use the raw counts from the adjacency matrix? The beauty of the HITS algorithm is that Hubs and Authorities are mutually supporting—the better the sites that something points at, the better a Hub it is; similarly, the better the sites that point to something, the better an Authority it is. That is why the iteration is necessary: it bubbles the good stuff up to the top.

(Technically, you don’t have to iterate. There’s some fancy matrix math you can do instead with calculating the eigenvectors. In practice, we found that when working with large, sparse matrices, the results didn’t turn out the way we expected, so we stuck with the iterative, straightforward method.)

Common Questions


What about non-fashion domains? Don’t they clog things up?
Yes, in fact, on the first run of the algorithm, sites like Facebook, Twitter, Instagram, et al. were right up at the top of the list. Our Fashion Librarians then curated that list to get a nice list of fashion-relevant sites to work with.

Why not PageRank?
PageRank needs to have nearly complete information on the web; with the resources needed to get this. We only have outgoing link data on the domains that are already in our working list. We need an algorithm that is robust in the face of partial information.  

This is where the power of the separate Hub and Authority scores comes in. Given the information for our seeds, they become our list of Hubs. We can then calculate the Authorities, filter out our seeds, and have a ranked list of stuff we don’t have. Voilà!  Problem solved, even in the face of partial knowledge.

But wait, you said Kafka Streams?

Kafka was already part of our solution, so it made sense to try to leverage that infrastructure and our experience using it. Here’s some of the thinking behind why we chose to go with Apache Kafka’s® Streams API to perform such real-time ranking of domains as described above:

  • It has all the primitives necessary for MapReduce-style computation: the “Map” step can be done with groupBy & groupByKey, the “Reduce” step can be done with reduce & aggregate.
  • Streaming allows us to have real-time, up-to-date data.
  • The focus stays on the data. We’re not thinking about distributed computing machinery.
  • It fits in naturally with the functional style of the rest of our application.

You may wonder at this point, “Why not use MapReduce? And why not use tools like Apache Hadoop or Apache Spark that provide implementations of MapReduce?” Given that MapReduce was invented originally to solve this type of ranking problem, why not use it for the very similar type computation we have here? There are a few reasons we didn’t go with it:

  • We’re a small team, with no previous Hadoop experience, which rules Hadoop out.
  • While we do run Spark jobs occasionally in batch mode, it is a high infrastructure cost to run full-time if you’re not using it for anything else.
  • Initial experience with Spark Streaming, snapshotting, and recovery didn’t go smoothly.

How We Do It

For the rest of this article we are going to assume at least a basic familiarity with the Kafka Streams API and its two core abstractions, KStream and KTable. If not, there are plenty of tutorials, blog posts and examples available.

Overview

The real-time ranking is performed by a set of Scala components (groups of related functionality) that use the Kafka Streams API. We deploy them via containers in AWS, where they interact with our Kafka clusters. The Kafka Streams API allows us to run each component, or group of components, in a distributed fashion across multiple containers depending on our scalability needs.

At a conceptual level, there are three main components of the application. The first two, the Domain Link Extractor and the Domain Reducer, are deployed together in a single JVM. The third component, the HITS Calculator and its associated API front end, is deployed as a separate JVM. In the diagram below, the curved bounding boxes represent deployment units; the rectangles represent the components.

Data, in the form of s3 URLs to stored web documents, comes into the system on an input topic. The Domain Link Extractor loads the document, extracts the links, and outputs a mapping from domain to associated domains, for that document. We’ll drill into this a little bit more below. At a high level, we use the flatMap KStream operation. We use flatMap rather than map to simplify the error handling—we’re not overly concerned with errors, so for each document we either emit one new stream element on success, or zero if there is an error. Using flatMap makes that straightforward.

These mappings then go to the Domain Reducer, where we use groupByKey and reduce to create a KTable. The key is the domain, and the value in the table is the union of all the domains that the key domain has linked to, across all the documents in that domain. From the KTable, we use toStream to convert back to a KStream and from there to an output topic, which is log-compacted.

The final piece of the puzzle is the HITS Calculator. It reads in the updates to domains, keeps the mappings in a local cache, uses these mappings to create the adjacency matrix, then perform the actual HITS calculation using the process described above. The ranked Hubs and Authorities are then made available via a REST API.

The flexibility of Kafka’s Streams API

Let’s dive into the Domain Link Extractor for a second, not to focus on the implementation, but as a means of exploring the flexibility that Kafka Streams gives.

The current implementation of the Domain Link Extractor component is a function that calls four more focused functions, tying everything together with a Scala for comprehension.  This all happens in a KStream flatMap call. Interestingly enough, the monadic style of the for comprehension fits in very naturally with the flatMap KStream call.


One of the nice things about working with Kafka Streams is the flexibility that it gives us. For example, if we wanted to add information to the extracted external links, it is very straightforward to capture the intermediate output and further process that data, without interfering with the existing calculation (shown below).


In Summary

For us, Apache Kafka is useful for much more than just collecting and sharing data in real-time; we use it to solve important problems in our business domain by building applications on top. Specifically, Kafka’s Streams API enables us to build real-time Scala applications that are easy to implement, elastically scalable, and that fit well into our existing, AWS-based deployment setup.  

The programming style fits in very naturally with our already existing functional approach, allowing us to quickly and easily tackle problems with a natural decomposition of the problem into flexible and scalable microservices.

Given that much of what we’re doing is manipulating and transforming data, we get to stay close to the data. Kafka Streams doesn’t force us to work at too abstract a level or distract us with unnecessary infrastructure concerns.

It’s for these reasons we use Apache Kafka’s Streams API in the Dublin Fashion Insight Centre as one of our go-to tools in building up our understanding of the Fashion Web.

About Apache Kafka’s Streams API

If you have enjoyed this article, you might want to continue with the following resources to learn more about Apache Kafka’s Streams API:

Interested in working in our team? Get in touch: we’re hiring.

Zalando Tech Blog – Technical Articles Conor Clifford

Zalando is using an event-driven approach for its new Fashion Platform. Conor Clifford examines why

In a recent post, I wrote about how we went about building the core Article services and applications, of Zalando’s new Fashion Platform, with a strong event first focus. That new platform also has a strong overall event-driven focus, rather than a more “traditional” service-oriented approach.

The concept of “event-driven” is not a new one; indeed, it has been quite well covered in recent years.

In this post, we look at why we are using an event-driven approach to build the new Fashion Platform in Zalando.

Serving Complexity

A “traditional” service/microservice architecture will be composed of many individual services, each with different responsibilities. Each service will likely have several, probably many, clients; each interacting with the service to fetch data as needed. And these clients may be services to other clients, etc.

Various clients will have different requirements for the data they are fetching, for example:

  • Regular high frequency individual primary key fetches
  • Intermittent, yet regular, large bulk fetches
  • Non-primary key based queries/searches, also with varieties of frequencies and volumes
  • All the above with differing expectations/requirements around response times, and request throughputs

Over time, as such systems grow and become more complex, the demands on each of these services grow, both in terms of new functional requirements, as well as operational demands. From generally easy beginnings, growth can lead to ever-increasing complexity of the overall system, resulting in systems that are difficult to operate, scale, maintain and improve over time.

While there are excellent tools and techniques for dealing with and managing these complexities, these target the symptoms, not the underlying root causes.

Perhaps there is a better way.

Want to know more about Zalando Dublin. Check out the video straight from our fashion insights center.

Inversion of flow

The basic underlying concept here is to invert this traditional flow of information. To change from a top-down, request oriented system to one where data flows from the bottom up, with changes to data causing new snapshot events to be published. These changes propagate upwards through the system, being handled appropriately by a variety of client subsystems on its way.

Rather than fetching data on demand, clients requiring the data in question can process it appropriately for their own needs, at their own pace. That can be processing transformation, merging and producing new events, or building an appropriate local persisted projection of the data, e.g. a high speed key-value store for fast lookups, populating an analytical database, maintaining a search cluster, or even maintaining a corpus of data for various data science/machine learning activities, etc. In fact, there can and will be clients that do a combination of such activities around the event data.

On-Demand Requests is Easy

Building a client that pulls data on demand from a service would appear the easier thing to do, with clients being free to just fetch data directly as needed. From the client perspective, it can even appear that there is no obvious benefit to an event-driven approach.

However, with a view to the wider platform ecology (many clients, many services, lots of data, etc.), the traditional “pull-based” approach will lead to much more complex and problematic systems, leading to a variety of challenges:

  • Operation and Maintenance - core services in pull-based systems grow to serve more and more clients over time; clients with different access requirements (PK fetches, batch fetches, periodic "fetch the world" cycles, etc.). As the number and types of such clients grow, operating and maintaining such core services becomes ever more complex and difficult.
  • Software Delivery - as clients of core services grow, so to will the list of requirements around different access patterns and capabilities of the underlying data services (e.g. inclusion of batch-fetches, alternative indexing, growing request loads, competing prioritizations, etc.). This workload has a strong tendency to ultimately swamp the delivery teams of core services, to the detriment of delivering new business value. In addition to the service's team, the client teams themselves would also be dependent on new/changed functionality in the services to allow them to move forward.
  • Runtime Complexity - Outages and other such incidents in "pull" based environments can have dramatic impacts. Core service outage would essentially break any client fetching data "on demand". Multiple dependent applications can be brought down by an outage in a single underlying service. There can also be interesting dynamics on recovery of such services, with potential thundering herds, etc., causing repeating outages during this recovery, prolonging, or worse, further degrading, the impact of the original incident. Even without outages, the complexity of systems built around a request/response approach makes forecasting and predicting load growth difficult, modelling the interplay of many different clients, with different request patterns is difficult. Attempting to do forecasting of growth for each of these becomes a real challenge.

By evolving to an event-driven system, there are many advantages over these, and other aspects:

  • Operationally - since clients receive information about change, the clients can react instantly and appropriately themselves. As the throughput of data is driving the system, the performance/load characteristics are much more predictable (and testable.) There is no non-determinism caused by the interplay of multiple clients, etc. In general, handing data over event streams allows for much looser coupling of clients and services results in simpler systems.
  • Delivery - With the ability to access complete data from the event streams, clients are no longer blocked by the service teams delivery backlog; they are completely free to move forward themselves. Similarly, service delivery team backlogs are not overloaded by requests for serving modifications/alterations, etc., and as such freed up to directly deliver new business value.
  • Outages - With clients receiving data changes, and handling these locally, an outage of the originating service essentially means clients working with some stale data (the data that would have been changed during that outage),  typically a much less invasive and problematic issue. In many cases, where clients depend on data that changes infrequently, if at all, once established, it’s not an issue.
  • Greater Runtime Simplicity - with data passing through event streams, and clients consuming these streams as they need, the overall dynamic of the system becomes  more predictable/less complicated.

Tradeoffs

“Time is an illusion. Lunchtime doubly so.” - Douglas Adams

There's no such thing as a free lunch. There’s likely more work up front in establishing such an architectural shift, as well as other concerns:

  • Client Burden - There will be an additional burden on clients in a purely event-driven system, with those clients having to implement and operate local persisted projections of the event stream(s) for their purposes. However, a non-trivial part of this extra work is offset by removing work (development and operational) around integrating with a synchronous API and all the details that entails; dealing with authentication, rate limiting, circuit breakers, outage mitigations, etc. There is also less work involved with not having an API that is 100% purpose built. In addition, the logic to maintain such local snapshot projections is straightforward (e.g. write an optionally transformed value to a “key value” store for fast local lookups).
  • Source of Truth - A common concern with having locally managed projections of the source data is that there is a loss of the single source of truth that the originating service represents in the traditional model. By following an “Event First” approach, with the event stream itself being the primary source of truth, and by allowing only changes from the event stream itself to cause changes to any of the local projections, the source of truth is kept true.
  • Traditional Clients - there may be clients that are not in a position to deal with a local data projection (e.g. clients that only require few requests processed, clients that facilitate certain types of custom/external integrations, etc.) In such cases, there may be a need to provide a more traditional “request-response” interface. These, though, can be built using the same foundations, i.e. a custom data projection, and a dedicated new service using this to address these clients’ needs. We need to ensure that any clients looking to fit the “traditional” model are appropriate candidates to do so. Care should be taken to resist the temptation to implement the “easy” solution, rather than the correct solution.

Conclusion

In the modern era of building growing systems, with many hundreds of engineers, dozens of teams, all trying move fast and deliver excellent software with key business value, there is a need for less fragile solutions.

In this post, we have looked at moving away from a more regular “service” oriented architecture, towards one driven by event streams. There are many advantages, but, with their own set of challenges. And of course, there is no such thing as a silver bullet.

In the next post, I will look at some of the lessons we have learned building the system, and some practices that should be encouraged.

If you are interested in working with these types of systems and challenges, join us. We’re hiring!

Zalando Tech Blog – Technical Articles Vadym Kukhtin

Two brothers examine the pros and cons of UI testing

Based on their different experiences in Partner-Solutions and Zalando Media Solutions respectively, we speak to front-end developers, Vadym Kukhtin and Aleks Kukhtin about their opposing opinions on UI testing.

The Case Against UI Testing - Vadym

TL;DR It depends on preference, but I believe that UI testing isn’t required in every instance

In my experience, it is a sisyphean task to force developers to write even basic Unit tests, nevermind UI and E2E. Only Spartans led by Leonidas can achieve UI and E2E testing.

Of course, the case for UI testing is more complex than a simple “good vs bad” dichotomy. For example, the scale and scope of the app should be taken into account. If the app is small or short-term, most probably UI tests aren’t required. If it’s a monster project that needs to be covered as much as possible, then unit and E2E tests are required.

In the real world app, any interaction should change some state of the app, whether it’s a click, hover or any custom event. With unit tests, the developer can test internal component or service functionality, and with E2E the developer can test common component interactions and connections to third-party services, and API and backend functionality.

Example:

  • Use case: User should be able to login using OAuth and see “Hello” board.
  • App: LoginComponent:
  • Test case:

This process can be incredibly time consuming, with developers spending time writing the tests and mocking all dependencies. For small or short-term apps, we have to ask ourselves: is the time worth it?


My answer would be no.

The Case for UI Testing - Aleks

TL;DR In most cases I think we don’t need to write UI tests.

Let’s start with a small illustration. You start a new project and a month later have a nicely working app. You then decide to change one component in your structure. It is only UI component, so you know, that logic has no changes. The change itself works, but now the app has some of errors: it looks like you forgot about some style changes and your component now looks awkward. So, you fix the styles, deploy the changes, sit back and relax. But now, you have unwanted style issues in other component. So you again do the same thing and deploy the changes. Ideally, UI tests can identify this kind of problems.

Like most things in life, UI tests has advantages and disadvantages. Some argue it takes too much time, but this time can be considered an investment, safeguarding against any unwanted games of “tennis” as seen in the example above. Testing UI helps us better understand our code, and what actually it should render.

Yes, testing is complicated. Complex UI logic is pretty hard to test, but not inconceivable. Problem appears here: the process has so much troubles for the advantages gained: you need to write a lot of tests, but in result even small changes (that often happened with UI) force to rewrite bigger part of it.

In Conclusion
The biggest takeaway from our discussion is that UI testing cannot be simply filed away under “good” or “bad”. In some circumstances – such as small apps or short-term projects – testing may not be the best use of time. In others, testing is a must for maintaining the integrity of the components and saving time in the future.

Got an opinion on UI testing and want to bring it to a dynamic team? Get in touch. We’re hiring.

Zalando Tech Blog – Technical Articles Andrey Dyachkov

At Zalando we’ve created Nakadi, a distributed event bus that implements a RESTful API abstraction on top of Kafka-like queues. It helps to provide an available, durable, and fault tolerant publish/subscribe messaging system for simple microservices communication.

A Kafka cluster is able to grow to a huge amount of data stored on the disks. Hosting Kafka requires support of instance termination (on purpose or just because the “cloud provider” decided to terminate the instance), which in our case introduces a node with no data: the rebalance of the whole cluster has to be accomplished in order to evenly distribute the data among the nodes, taking hours of data copying. Here, we are going to talk about how we avoided rebalance after node termination in a Kafka cluster hosted on AWS.

In the beginning, at least when I joined, our Kafka cluster had the following configuration:

  • 5 Kafka brokers: m3.medium 2TB SSD
  • Replication factor 3 and min insync replicas 2
  • 3 Zookeeper nodes: m3.medium
  • Ingest 250GB per day

Nowadays, the cluster is much bigger:

  • 15 Kafka brokers: m4.2xlarge 8TB HDD
  • Replication factor 3 and min insync replicas 2
  • 3 Zookeeper nodes: i3.large
  • Ingest 5TB per day and egress 30TB per day

Problem

The above setup results in a number of problems that we are looking to solve, such as:

Loss of instance

AWS is able to terminate or restart your instance without notifying you in advance of the fact. Once it has happened, the broker has lost its data, which means it has to copy the log from the other brokers. This is a pain point, because it takes hours to accomplish.

Changing instance type

The load is growing and at some point in time, the decision is to upgrade the AWS instance type to sustain the load. This could be a major issue in the sense of time as well as availability. It correlates with the first issue, but a different scenario of losing broker data.

Upgrading a Docker image

Zalando has to follow certain compliance guidelines, which is provided by using services like Senza and Taupage. In their turn, they have requirements themselves which is to have immutable Docker images that are not replaceable once the instance is running. To overcome this, one has to relaunch the instance, hence coping a lot of data from other Kafka brokers.

Kafka cluster upgrade

Upgrading your Kafka version (or maybe downgrading it) requires the building of a different image which holds new parameters for downloading a Kafka version. This again requires the termination of the instance involving data copying.

When the cluster is quite small, it is pretty fast to rebalance it, which takes around 10-20 mins. However the bigger the cluster, the longer it takes to rebalance. It has happened that a rebalance of our current cluster takes about 7 hours in the case that one broker is down.

Solution

Our Kafka brokers were already using attached EBS volumes, which is an additional volume, located somewhere in the AWS Data Center. This is connected to the instance via network in order to have durability, availability and more disk space available. The AWS documentation states: "Amazon Elastic Block Store (Amazon EBS) provides persistent block storage volumes for use with Amazon EC2 instances in the AWS Cloud."

The only tiny issue here is that instance termination would bring the EBS volume down together with the instance, introducing data loss for one of the brokers. The figure below shows how it was organized:

The solution we found was to detach the EBS volume from the machine before terminating the instance and attach it to the new running instance. There is one small detail here: it is better to terminate Kafka gracefully, in order to decrease the startup time. In case you terminated Kafka in a “dirty” way without stopping, it would rebuild the log index from the start, requiring a lot of time and depending on how much data is stored on the broker.

In the diagram above we see that the instance termination does not touch any EBS volume, because it was safely detached from the instance.

Losing a broker without detaching EBS (terminating it together with the instance) introduces under-replicated partitions on other brokers, which holds the same partitions. To recover from that state, the rebalance takes around around 6-7 hours. If during the rebalance other brokers go down, which have the same partitions, it will provoke offline partitions and it will not be possible to publish to them anymore. It is better not to lose any other broker.

Reattachment of EBS volumes is possible to accomplish using the AWS Console by clicking buttons there, but to be honest I have never done it myself. Our team went about automating it with Python scripts and the Boto 3 library from AWS.

The scripts are able to do the following:

  • Create a broker with attached EBS volume
  • Create a broker and attach an existing EBS volume  
  • Terminate a broker, detaching the EBS volume beforehand
  • Upgrade a Docker image reusing the same EBS volume

Kafka instance control scripts can be found in our GitHub account, where the usage is described. Basically, these are one line commands which consume configuration for the cluster in order to not pass in the script parameters. Remember, we use Senza and Taupage, so the scripts are a bit Zalando specific, but can be changed quite quickly with very little effort.

However, it’s important to note that the instance could have a Kernel panic or some hardware issues while running the broker. AWS Auto Recovery helps to address this kind of issue. In simple terms, it is a feature of the EC2 instance to be able to recover after network connectivity, hardware or software failure. What does recovery mean here? The instance will be rebooted after failure to preserve a lot of parameters from an impaired instance, among that being EBS volume attachments. This is exactly what we need!

In order to turn it on, just create CloudWatch Alarm for StatusCheckFailed_System and you are all set. The next time the instance has a failure scenario it will be rebooted, preserving the attached EBS volume, which helps to avoid data copying.

Impact

Our team no longer worries about losing a Kafka broker, as it can be recovered in a number of minutes without copying data and wasting money on traffic. It only takes 2 hours to upgrade 15 nodes of a Kafka cluster and it just so happens that it is 42x faster than our previous approach.

In the future, we plan to add this functionality directly to our Kafka supervisor, which will allow us to completely automate our Kafka cluster upgrades and failure scenarios.

Have any feedback or questions? Find me on Twitter at @a_dyachkov.

Zalando Tech Blog – Technical Articles Alberto Alvarez

Fashion meets tech in our Dublin hub


At the Fashion Insights Centre in Dublin, one of the core tech products being developed is the Smart Product Platform (SPP). The fashion products we sell are the fundamental building blocks of what we do as a business. How to manage and represent these products and their associated data in today's competitive fashion marketplace is challenging. Fashion is everywhere, at once global and local, something that helps us identify with others but also something deeply personal. A thorough understanding of these products and their associated data is vital in delivering an engaging customer experience.


Fashion Data in My Language

Shopping, as we know, can be a deeply frustrating experience. Options run in spectrums from formal to casual, head to toe, expensive to affordable, and many more. So how can customers articulate what they want and see items or looks that are relevant to them. In essence, how can people truly dress well.

Perhaps you are a dedicated follower of fashion and understand fashion in terms of the latest designers, trends and styles that may be appearing this season on the catwalks of London, New York or Milan. Or maybe, while you like to keep up with the latest trends, you don’t understand fashion in the language of the catwalk. You just know what you like. Or perhaps you’re not even sure what you like at the moment; you’re just looking for something new and are seeking inspiration.

It all starts with the product data, which is where SPP comes in. How we understand fashion, how we describe it, the words we use, the language we speak, and the things that matter to us are all personal. Likewise, while we may all share a common understanding of themes, the approach to finding the fashion we want to wear, whatever the motivation or occasion may be, will be different from person to person. A few examples:

  • Physical characteristics - What category of product am I looking for: a dress, jeans, shoes? * What color is this? I know what color I like but can I describe it? * What about the material? What is it made of? *Is brand important to me?
  • Fashion information - Am I shopping for a trend or style? * What occasion am I shopping for? * Do I care if it is sustainably manufactured? * I saw celebrity ‘X’ wearing a fantastic dress. I wonder where I can get that?

To connect with our customers we truly have to match our products to human thinking. We must understand customers’ stories, and encourage and enable a transformation from low confidence ‘please inspire me’ encounters to truly materialized fashion confidence.

To match the individual with fashion, we need data. Article data. Fashion product data. Data that helps describe and understand the fashion we sell.


Smart Products

Smart Product Platform (SPP) is Zalando’s new product platform which enables the collection, management and exposure of fashion product data across the organization, at scale. SPP is a bespoke combination of Product Information Management (PIM) and Master Data Management (MDM) systems engineered 100% in-house to satisfy Zalando’s current and future product management requirements.

We intend to increase the amount of contextual product fashion data we store and do this at scale in order to help us, and our customers understand our products better. Consumers of this data can enhance their business use cases by deriving value and insights from this product data, as well as enriching it with new information. SPP provides a foundational infrastructure to store, organize and deliver product information from multiple sources.

Our considerations during the initial product and technology phases were:

  • Scalability - We’re facing exponential increases in product data (both in terms of the number of products in our assortment and in terms of the breadth of product data we store), so we need to ensure we can scale, both production and consumption wise.
  • Flexibility - We need an underlying hyper-flexible data model to ensure we can meet our customer needs quickly when it comes to managing and modeling data to support new business use cases.
  • Quality - Validate product data and ensuring it is complete, correct and consistent. This is done in conjunction with our Fashion Librarians (data stewardships) curation guidelines.
  • Accessibility - Ease of data onboarding, and data exposure and consumption are cornerstones of SPP.
  • Next generation architecture - We built all the underlying infrastructure on a microservice architecture connected via event-driven streaming interfaces to enable distribution data to our consumers.

Product Platform

SPP’s purpose is founded around its core value unit: the product. As an internal data platform, we connect multiple data producers (supply side) with product consumers (demand side), looking for ways to maximize their offerings and product value.

To meet the needs of our data suppliers (currently our wholesale organization to our partner brands and merchants), our data consumers (our fashion stores and other consumer-facing applications) and most importantly our customers, we need to ensure that our data ingestion and data update processes are effective and efficient.

As well as ensuring data supplies can onboard data easily, we need to ensure the same ease of access for all our data consumers. With millions of products and thousands of attributes, we offer rich data sets for other (internal and external) value added services such as product recommendations, personalization, advertising, logistics and so on.

More and more, it’s the product and associated data that is becoming a key source of differentiation.

Discover, Understand, Decide

While accurate and consistent metadata about physical attributes (What size is it? What color is it? What is it made of?) are vital to helping us understand a product, it is the effective enriching or augmenting of our product data with additional information over and above these traditional physical product attributes that truly provides business value: more conversions, higher customer engagement and a better customer experience.

When we talk about enrichment, we mean attaching associated relevant content to products:

  • Information (tags, descriptions) - What trend or style is it part of? What occasions would you wear it for? What celebrity/influencer is wearing it? Where is it being worn? Where is it popular? And so on.
  • Media (images, video, AR/VR) - Images are at the heart of the shopping experience. Video is increasingly important. Augmented and Virtual Reality are considerations also.
  • Content -  Links, editorial content, social media.

This content could be implicit as is the case with, say, personalization, or explicit e.g. discovery.

Thanks to SPP, we can now effectively store and expose this additional fashion context. With increases in additional fashion knowledge we can:

  • Help customers find the products they are searching for
  • Improve our customers’ understanding of our products better
  • Allow our customers to make more informed purchase decisions

Fashion is a journey and an experience, both inspirational and aspirational. A journey where data plays a key role.


Data, Data, Data

Considering the dynamic nature of the fashion world, the prominence of fast fashion and new avenues of discovery and promotion (e.g. social media, fashion influencers) it is clear that both the breadth and depth of (consistent and accurate) product data and fast time-to-market are key drivers for the success of any fashion product platform.

This additional data, or fashion context, lays the foundation to establish the product, the fashion article, as a new source of competitive advantage. The more and more relevant data we add, the better our customers can understand our products and ultimately make more informed buying decisions. We want to be there for our customers for every occasion, present and future. In discovery and product understanding, and in leveraging data that makes dressing well more achievable to everyone.

Got something to say about data? Come join us. We're hiring.

Zalando Tech Blog – Technical Articles Paulo Renato Campos de Siqueira

Pull Requests (PRs) are the norm today when it comes to common software development practices in teams. It is the right way to submit code changes so that your peers can check them out, add in their thoughts and help you create the best code you can - i.e. PRs allow us to easily introduce code review to our development process and enable a great deal of teamwork, while also decreasing the number of bugs our software contains.

There are several aspects we can talk about when it comes to Pull Requests and code review. In this post, I'm specifically concerned with the size of PRs, although I'll briefly touch other points as well. Other dimensions you could think about include having a good description of what is being done and why, and being sure that the Pull Request only changes one thing and one thing only, i.e. it is independent and self-contained.

On a personal note, I think Pull Requests nowadays are so important that I even use them on projects where I work alone so that I can have automated checks applied before deciding to merge into master. It allows me that extra opportunity to catch errors before it is too late. In GitHub for example, this generates a nice visual summary of the checks performed. And yes, you could do this straight into your branches, but using PRs is easier and more organized. You could for example easily decline and close a PR, and document why you did it. In this PR in one of my pet projects for example, you can see Codacy, Travis CI and CodeCov checking my code before I merge it to master.

Having said the above, it is way too easy to get carried away when developing and you may end up adding several small things at once - be it features or fixes, or simply some refactoring in the same PR - thus making it quite large and hard to read. And don't get me wrong: crafting small, self contained and useful Pull Requests is not easy! Good developers don't create big PRs because they are lazy:sometimes it is hard to see the value in going the extra mile to break what has already become way too big.

Another aspect to consider here is related to git commit good practices in general. Having small Pull Requests will also help to have small and focused individual commits, which is very valuable when maintaining code. Let’s illustrate this point with an example that happened recently to me.

Can I revert this?

I was investigating a bug, something that used to work well and that simply stopped working out of the blue. After some time and investigation, I found that the relevant code was simply removed, and that we didn't notice it beforehand because of yet another bug. Obvious solution: go through the git history and just git revert the deleted code. Except that I couldn't find any commit related to it.

After further investigation I finally found the commit that removed the files - but it was a commit that also did several other unrelated things. git revert was no longer an option, especially due to the rest of the code that had been changed at this point, and I ended up having to manually add the files myself. The total time spent with this became way more than it could have been.

Why are big Pull Requests a problem?

The first and most important thing to note here is our human capacity to hold knowledge in one's head. There is a limit of how much information you can keep and correlate at once, while at the same time weighing all its consequences in the rest of the system, or even for external / client systems. This will obviously be different for different people, but is a problem at some level. And when working in a team, you have to lower this bar, to make sure everyone can work at the same level.

When you are reviewing a Pull Request, you have to keep some things in mind, such as:

  • What are the new components being created?
  • How do they interact with existing components?
  • Is there code being deleted? If so, should it really be deleted?
  • Are the new components really necessary? Perhaps you already have something in the current code base that solves the problem? Or something that could be generalized and applied to both places?
  • Do you see new bugs being introduced?
  • Is the general design OK? Is it consistent with the rest of the project's design?

There are quite a few things to check. How much of that can you keep in your mind while you are reviewing code? This gets harder the more lines of code there are to be checked.

So back to small PRs. While all of this has little to no impact for automated checks and builds, this can actually have a huge impact when it comes to code review. With that in mind, let’s go through at least a few ideas you can use to escape the type of situation where you don't really feel you want to break your PR into smaller pieces - but should nonetheless. There is no black magic here, we will just use some nice git commands in a way that helps us achieve our goal. Perhaps you will even discover a few things you didn't know before!

Sort your imports

I prefer sorting imports in alphabetical order, but the actual criteria doesn't matter, as long as the whole team uses the same technique. This practice can be easily automated and avoids generating a diff when two developers add the same import in different positions. It also completely eliminates duplicated imports generated by merges.

Sometimes this will also avoid conflicts where two developers remove or add unrelated imports in the same position and git doesn't know what to do about them. Sorting imports makes them naturally mergeable.

Avoid frequent formatting changes

This happens a lot, especially if you don't use code formatter tools like scalafmt or scalariform (or whatever is available for your language of choice). Sometimes, you may see a blank line you don't like. Or you don't see a blank where you believe it should be. You simply go on and delete or add it. This means yet another line change that goes into your PR.

This is not related only to PR sizes. This small change has a big chance of creating conflicts if you ever have to update your PR before merging. Another developer might legitimately change a certain code point and you now have to very carefully check if a change was only cosmetic and thus can be ignored, or if there was something real there to consider. More than once I've seen features simply vanish because of this kind of thing.

If you really want to make some formatting changes, do so, but send it as a separate PR that can be merged as soon as possible, and independent of any features. And consider automating this task as well.

Allow reviewers the time to review

This is a little meta, but important nonetheless: resist the urge to want your code merged right away. I suffer from this myself from time to time, especially when we have some very small PRs. Still, the reviewers should be allowed time to work. If you did a good job of making it small and self-contained, and added a good description to the PR body, you will likely get some speedy feedback.

To better explain this it is worth quoting a teammate, who once said: “Sometimes it feels like we are asking for thumbs, not for reviews.”

If you sense something like this is happening, you should stop. You are probably rushing the review process, which will only result in some stress and badly reviewed code. My rule of thumb is to not ask for a thumbs up, quite literally. Every time I catch myself doing so I stop and rephrase, asking for a review instead.

Advanced and powerful: manipulating your sources with Git

Now for the more complex (and perhaps interesting) practices. What follows will require you to have at least an intermediate understanding of git, and a prerequisite of not being afraid of git rebase. As a side note, I say this because most of us are afraid of it (git rebase) when we first begin learning. This is only temporary though, until you fully realize the power it gives you.

Lets now think of the following scenario. You are working on a feature, and suddenly notice that some kind of side change is required. Something not strictly related to the feature itself, but would be of great help for your task. You might then get the urge to simply go on and do it, together with your current feature code.

Side changes with Git Stash

See the problem already? If you simply do it, the PR for your feature will get bigger. It will also now contain one (or more) extra concerns, meaning that the reviewers would have to verify this as well.

Instead, you should send this side change as a new PR. There are a few different ways to do this properly with git, but the easiest is to use git stash. What this does is hide your current changes and let you work with a clean workspace. Then you can switch to a new branch, implement the side changes and submit the PR request.

With that, your teammates can start reviewing these changes immediately while you are still working on the feature itself. Moreover, they will also be able to leverage those changes in their own code - who said that these changes would be useful only for you? And finally, it also gives your colleagues the opportunity to point out problems sooner rather than later. Perhaps something is incompatible with someone else's work, or another developer had just started to make the same kind of changes and now don't have to do anything. You can work together to achieve an even better result. Not to mention that this should a small PR, so quite easy to review.

After the PR is sent, you can recover your work with git stash pop. When you move to a new branch, you can get your changes back and start working. Now here there is yet another problem: how to deal with the fact that your side changes are probably not merged yet?

First, the problem in principle is not that big. The side changes are in their own commit, and thus your main changes are completely isolated. If at anytime you get feedback and have to update the PR you just sent, you can always stash your current changes again. Again, see the git stash documentation for more information on how this works.

Second, it might be that your PR with the side changes will simply be accepted as is and merged. In this case, it is quite easy to get your feature branch up-to-date. A git rebase master (or whatever branch your teams merge to) should do the trick. This is probably the easiest (and safest) variation of git rebase you can use. See the git rebase documentation here.

Finally, some pointers for the most complex case. You may find that you will have to fix many things on your side changes PR. Also, at this point you may have already made a few commits towards the feature you are implementing. You can use your imagination here and a nifty combination of git features to solve your problem. For example, you could try the following steps:

  • Wait for the side changes PR to be merged to master
  • Update your master: git pull
  • Create a new branch, based on master: git checkout -b my-new-branch
  • Go to your feature branch and carefully use git log to find which commits you used for the feature
  • Go back to the new branch
  • Use git cherry-pick to move the commit over that you found with git log

See the git cherry-pick documentation here. Notice that you can also cherry-pick a series of commits, instead of one by one, if you prefer. This also allows you to use the commit you sent as a new PR already, perhaps in a temporary branch where you add your feature code on top of that.

As you can see, git is a very powerful tool and offers you many ways to solve your problems.

Splitting up code into multiple PRs

The next scenario is that moment when you’ve already gotten too excited with your code and couldn't stop, and ended up with a huge pile of changes to throw at your peers' heads. In this case, it can be very easy to simply go and say something like :

“Sorry for the big PR. I could split it into smaller pieces but it would take too long.”

Let's go through some ideas to avoid this scenario by applying a little effort and splitting up your work.

First off, if you have well-crafted, individual commits, those can be turned easily into PRs with git cherry-pick. You can simply write down which commits you want to submit as new PRs, move to a new branch and bring those commits over with git cherry-pick. You can combine this with git stash to make it easier to deal with uncommitted code, like described above.

One small drawback is that sometimes your changes are dependent between each other and you might have to wait for the first one to be merged before you can really send the second one. On the other hand, if the first commit is small, chances are that it will be approved quickly, like we have already mentioned.

The whole process might not be too pleasant for you at first, but will definitely help the rest of the team. A small tip that might sound obvious is to "pre-wire" your PRs: go to your peers and let them know that those PRs are coming and what they are about. This will help them review your code faster.

A note about failure

It might all be beautiful on paper, but in reality this is not always possible. Even if you follow the tips presented here, you may still end up with big PRs from time to time. The critical point is that, when this happens, it should:

  • Be a conscious decision, not an accident;
  • Be as small as possible, i.e., you applied at least some of the tips above;
  • Be an exception, not the rule.

Remember: this is all about teamwork. Some things might make you a little slower, especially until you get into the right frame of mind, but it will make the whole team faster in the long run, and will also increase the chances of bugs being caught during code review. A final plus is that knowledge-sharing will also be better, since there is less to learn on each PR, and team members can ask more questions without being afraid of turning the review process into an endless discussion.

If you have read everything up until this point, then perhaps you are interested in reading even more. Here are some further interesting references around the subject:

What do you think? Do you have other techniques that you think could help in creating small and effective PRs? Or do you disagree that this is necessary? Let me know via Twitter at @jcranky.

Zalando Tech Blog – Technical Articles Conor Clifford

A Challenge

Shortly after joining Zalando, I, along with a small number of other new colleagues (in a newly opened Dublin office), was entrusted with the task of building an important part of the new Fashion Platform - in particular, the core services around the Article data of Zalando. This task came with several interesting challenges, not least of which was ensuring the new platform provided not just sufficient capacity/throughput for existing workloads, but also had capacity for longer term growth - not just in terms of data volumes/throughput, but also with the number, and types, of users of that data. The aim here was the democratization of data for all potential users on the new platform.

Decision Time

It was decided that this new platform would be primarily an event driven system - with data changes being streamed to consumers. These consumers would subscribe, receive, and process the data appropriately for their own needs - essentially inverting the flow of data, from the traditional “pull” based architectures, to a “push” based approach. With this, we were looking to strongly prompt a wide adoption of a “third generation microservices” architecture.

In an event driven system it is important that the outbound events themselves have at least equal importance to the data being managed by the system. The primary responsibility of the system is not just to manage the data, but also ensure a fully correct, and efficient, outbound event stream, as it is this event stream that is the primary source of data for the majority of clients of this system.

Starting with an API First approach, the event structure and definition were treated as much a part of the system’s API as the more traditional HTTP API being designed. Beyond just the definition of the events (as part of the API), key focus was placed on ensuring both correctness of the events (compared to any stored data, in addition to the sequence of changes made to that data), as well as efficient publishing of the stream of events. This Event First approach meant that any decisions around design or implementation were taken always with correctness, and efficiency, of the outbound event stream in primary focus.

Initially, we built a quick prototype of the data services - primitive CRUD-type services, with synchronous HTTP APIs, each interacting directly with a simple (dedicated) PostgreSQL database as the operational store for the data. Outbound events were generated after completion of DB updates.

For this prototype, a very simple HTTP-based mockup of an event delivery system was used, while we decided on the real eventing infrastructure that would be used.

Not only did this prototype allow us to quickly exercise the APIs (in particular the event definitions) as they were being constructed, it also allowed us to quickly identify several shortfalls with this type of synchronous service model, including:

  • Dealing with multiple networked systems, especially around ensuring correct delivery of outbound events for every completed data change
  • Ensuring concurrent modifications to the same data entities are correctly sequenced, guaranteeing correct outbound event sequenced delivery
  • Effectively supporting a variety of data providing client types, including live low latency clients, through to high volume bulk-type clients.

Throw away your prototypes

With these limitations in mind, we worked at moving from this synchronous service approach to an asynchronous approach, processing data using an Event Sourcing model. At the same time, we progressed in our selection of an eventing platform, and were looking strongly at Apache Kafka - the combination of high throughput, guaranteed ordering, at least once delivery semantics, strong partitioning, natural backpressure handling, and log compaction capability were a winning combination for dealing with the outbound events.

With this selection of Kafka as the outbound event platform, it was also a natural selection for the inbound data processing. Using Kafka for the inbound event source, the logic for processing the data became a relatively simple event processing engine. Much of the feature set that was valuable for outbound event processing was equally as valuable for the inbound processing:

  • High throughput allowing for fast data ingestion - HTTP submissions getting transformed to inbound events published to an internal topic - even with high acknowledge settings for publishing these events, submission times are generally in the order of single digit milliseconds per submitted event. By allowing clients to submit data, with fast, guaranteed, accepted responses, clients can safely proceed through their workload promptly - allowing for greater flow of information in general through the wider system.
  • Guaranteed ordering - moving processing to event processing on a guaranteed ordered topic removed a lot of complexity around concurrent modifications, as well as cross-entity validations, etc.
  • At least once delivery - With any network-oriented service, modelling data changes to be idempotent is an important best practice - it allows reprocessing the same request/event (in cases of retries, or in the case of at least once delivery, repeated delivery.) Having this semantic in place for both the inbound event source, as well as the outbound event topic, actually allowed the event processing logic to use coarse grained retries around various activities (e.g. database manipulations, accessing remote validation engines, audit trail generations, and of course, outbound event delivery.) Removing the need for complex transaction handling allowed for much simpler logic, and as such, higher throughput in the nominal case.
  • Natural Backpressure handling - with Kafka’s “pull” based semantics, clients process data at their own rate - there is no complex feedback/throttling interactions required for clients to implement.
  • Partitioning - using Kafka’s partitioning capabilities, the internal event source topics can be subdivided logically - some careful thought to select an appropriate partitioning key was required for some data services (especially those with interesting cross-entity validation requirements), but once partitioned, it allowed the processing logic of the application to be scaled effectively horizontally, as each partition can be processed without any involvement with any data in the other partitions.

There were also several additional benefits to the use of Kafka for the event sources, including:

  • As it was already a selected platform for the outbound events, there was no additional technology required for Event Source processing - the one tool was more than sufficient for both tasks - immediately reducing operational burden by avoiding different technologies for the two cases.
  • Using the same technology for Event Source processing as well as Outbound Event Delivery led to a highly composable architecture - one application’s Outbound event stream became another application’s inbound Event Source. In conjunction with judicious use of Kafka’s Log Compacted Topics, to act as a complete snapshot, bringing in new applications “later” was not a problem.
  • By building a suite of asynchronous services and applications all around an event sourcing and delivery data model, identifying bottlenecks in applications became much simpler - monitoring the Lag processing the event source for any given application allows bottlenecks to be much clearer - allowing us to quickly direct efforts to the hotspots without delay.
  • Coordinating event processing, retries, etc. - it was possible to minimise the interaction with underlying operational databases to just the data being processed - no large transactional handling, no additional advisory (or otherwise) locking, no secondary “messaging” queue tables, etc. This allowed much simpler optimisation of these datastores for the key operational nature of the services in question.
  • Processing applications could be, and several already have been, refactored opaquely to process Batches of events - allowing for many efficiencies that come with batch processing (e.g. bulk operations within databases, reduced network costs, etc.) - this could be done naturally with Kafka as the client model directly supports event batches. Adding batch processing in this way ensures all applications get the benefits of batch processing without impacting client APIs (forcing clients to create batches), and also without loss of low latency under “bursty” loads.
  • Separation of client data submissions from data processing allows for (temporary) disabling of the processing engines without interrupting client data requests - this allows for a far less intrusive operational model for these applications.
  • A coarse grained event sourcing model is much more amenable to a heterogeneous technology ecosystem - using “the right tool for the job” - for example, PostgreSQL for operational datastores, Solr/ElasticSearch for search/exploratory accesses, S3/DynamoDB for additional event archival/snapshotting, etc. - all primed from the single eventing platform.

Today, and Moving Forward

Today, we have a suite of close to a dozen loosely coupled event driven services and applications - all processing data asynchronously, composed via event streams. These applications and services, built on a standard set of patterns are readily operated, enhanced and further developed, by anyone in our larger, and still growing, team. As new requirements and opportunities come up around these applications, and the very data itself, we have strong confidence and capability in growing this system as appropriate.

If you find the topics in this post interesting, and would enjoy these types of challenges, come join us - we're hiring!

Zalando Tech Blog – Technical Articles Team Alpha

Programming is hard, and being part of an engineering team is even harder. Depending on requirements, cross-functional teams are not equally formed with frontend and backend engineers in most organizations. Also, they are neither stable nor do people have an equal amount of experience. People come and go but software stays on, so we need to buckle up and maintain it.

Retrospective

One year ago we started a new project within Zalando Category Management, which is the branch of Zalando that looks after all of our fashionable apparel and accessories. We had to implement a new system to support the reorganization of Zalando Buyers into new, more autonomous teams, to enable them to work more effectively.

When we developed a Minimal Viable Product (MVP), neither one of our backend developers could support or add new features to our frontend. Due to project workload, our backend developers couldn’t collaborate with our frontend developers, nor had any visibility regarding progress. Therefore, to address these concerns we decided to introduce full-stack responsibility – and we failed! We failed because of several factors:

  • The frontend stack was too sophisticated for the tasks we had to complete (Angular 2 Beta + Angular CLI + ngrx store);
  • User stories were not feature-focused, but instead role-focused (separate backend and frontend stories);
  • It was hard to dive into frontend development on a daily basis.

Once again, we face the issue that some frontend engineers switch teams or roles, but the original team is still responsible for all the products that have been developed. We have since decided to become responsible end-to-end as a team, independent from team members or engineering roles.

What has changed since?

We learned from our previous experience that we have to decide on the instrument we use as a team, as well as share knowledge early and often. This is why we took a two-week sprint to evaluate two popular frameworks (Angular and React) which allowed us to make an informed decision this time around.

We also challenged our Product Specialist to provide us with feature-oriented user stories so we can break them down into smaller subtasks containing frontend and backend parts. It allows us to truly have full-stack user stories, including both frontend and backend, which leads us to working together and sharing the knowledge. All in all, this leads to a better product.

Finally, we introduced a “health check” in our sprint planning to track if we still work as one team. Every two weeks during sprint planning we ask ourselves: “Are we still one team?” We check our backlog and ask if the whole team is satisfied with the scope for the next sprint. Then, based on our criterias, we define the status of the health check and see if any immediate action is needed or if we are progressing towards our goal. It reminds us of issues we have as a team and keeps our commitment high in order to solve them.

It’s getting personal

When taking on the task of introducing end-to-end responsibility, we surveyed the whole team and looked for answers to a specific question:

What is the single most important thing YOU wish to take care of to make our full-stack initiative a success, and why is it so important?

Check out some of our answers below. Do you agree?

"That no one is afraid of changing code anywhere in our stack. Which also means we don't have single points of failure."

"Having good documentation about 'Where to start?' and 'What architecture, tools?' are we using. I think most of the time developers of one domain are just overwhelmed with where to start when you want to write code, do a bug fix or add a small feature. For example, if you want to contribute to a Play-Scala project as a frontend developer you don't know where to change things, what the structure of the project is, which things you have to keep in mind if you do an API change etc. It is the same when you ask a Java backend developer to add a new component to an AngularJS application. I think what could help the most is something that good open source projects are doing:

  • Provide a great README as an overview to the project
  • Provide Checklists and Guidelines for Contributors to describe shortly what a user would need to do if he/she wants to add a new component, a new API endpoint etc."

"Understanding that cross-functional teams are equally responsible members for each part of their system. While there might be only frontend expertise or backend expertise in the team, from the responsibility aspect it doesn't have any impact. Decisions, discussions and changes should be discussed independently from the roles of a frontend developer or backend developer. Increase the expertise of backend developers in the frontend and vice versa to make them more impactful in discussions. They would feel more responsible and feel a stronger ownership if they could bring up valuable arguments in the discussions. In collaboration with product, we should send at least one backend developer also to product-related discussions to avoid knowledge silos."

“To make sure that people with different backgrounds actually work together and practice pair programming. In my opinion, this is crucial to succeed and also to understand other ways of working.”

We’re just starting on this full stack journey. If you’re interested in how we progress, follow us to know more! The official Zalando Tech Twitter account is here.