Zalando Tech Blog – Technical Articles Christoph Luetke Schelhowe

As Zalando continues taking steps towards becoming a fully-fledged platform, we want to move fast, validate the ways that our big strategic moves pay off, and capture the full value of our products by continuous optimization. To this end, we wanted to ensure that we’re bringing data-informed decision making to the forefront of our processes by establishing a true data and experimentation culture that could ultimately become a competitive advantage in today’s fast-changing world.

Zalando has always been a data-driven company and analytics has been one of our key success factors. We believe that much of the success (or failure) of a product rides on data, and on how it is used. This brought about the following question: How can we elevate Zalando to the next level of data-informed decision making? This is how the Product Analytics department came to life.

Purpose

The purpose of the Product Analytics team is to embed a true data and experimentation culture at Zalando to empower smart decision making.

What do we mean by true data and experimentation culture?

  • Our Business Units are aligned around key metrics that are rooted in our most important business priorities. Success is defined by a set of well-proven metrics which individual teams own and contribute to.
  • Every team can access the data they need from various data sources and with high data quality. Setting up tracking is easy as well as assessing the data quality. Understanding user behavior based on A/B tests is quick and teams are always running multiple experiments at the same time.
  • Every team can draw the right insights from their data. Teams have the ability and skills to learn from and make decisions informed by data. Advanced analytics helps them discover problems and opportunities, plus focus on the right developments.
  • Decision making is not influenced by compromises, personal biases or egos, but only insights.

How can we get there?

To make data-informed decision making an easy and effective routine, and establish a data and experimentation culture, we focus on 1.) building a self-service infrastructure for experimentation, tracking, analytics 2.) ensuring common data governance, and 3.) enabling and educating all teams throughout Zalando.

  • Self-service infrastructure for tracking, experimentation, and analytics: Data analysis and experimentation should be fast and easy. Only true self-service tools are truly scalable given the size of our organization today.
  • Common data governance: With nearly 200 teams producing and consuming data events, there’s a growing need to ensure event tracking completeness and correctness and to allow for the easy compatibility of data.
  • Enablement and education: As we want to move fast, all teams must be enabled and empowered in data informed product development; e.g. from building a rationale around new features up to iterative testing and optimization at the end of the product lifecycle. We expect a certain data and experimentation affinity from everybody and want to embed a data-driven culture everywhere. In order to get there, we want to guide teams and help them be more rigorous by embedding an expert analyst role into teams.

Department structure and competencies

The Product Analytics department was created as a hybrid organization of central teams and team-embedded product analysts. The central teams provide world-class tools and knowledge in the domains of Economics, Tracking, Experimentation, Journey Modelling, and Digital & Process Analytics. Product Analysts would also be embedded into teams varying from our Fashion Store, Data, and Logistics areas to focus on insight-driven product development. They play an instrumental part in all steps of the product lifecycle (“discover - define - design - deliver”) and can support insights-based decision making by performing the following tasks:

  • Understand user and customer behavior: Develop in-depth analytical understanding for what drives growth for the product and how it can be improved, thus inspiring product work.
  • Measure and monitor product progress: Analysts help to define target KPIs for the team and ensure that Product Specialists and and Product Owners develop ownership of them. At the same time, they facilitate access to the key target KPIs and other relevant data. They establish methods to monitor short-term progress and long-term product health. When KPIs change, embedded analysts explore the underlying reasons and are able to provide context for these changes.
  • Prove if product ideas work: In the context of value creation, especially for new features, embedded analysts play an essential role by gathering and formulating analytical evidence that supports all phases in the product lifecycle, from discovery to rollout. Data must justify why we do what we do.
  • Drive product optimization: From a value capturing point of view, embedded analysts drive optimization iterations for existing features until they reach a local maximum.
  • Ensure data quality: Product Analysts create awareness about data quality within the teams where they are embedded. They have the responsibility of defining the specifications of the data to be generated by their teams, monitoring its quality and making sure the team addresses any quality-related issues they are responsible for.
  • Improve data literacy: Analysts drive the data mindset in their teams, educate and guide in terms of analytical methodology – they are enablers for any data leading to product decisions.

What the future holds

Ultimately, we want the magic of data-informed product development to happen in every team, guided by team-embedded Product Analysts and empowered by central teams with best in class self-service tools and methodologies. By adopting processes that ensure data-informed decision making is taking place, our teams can build better products and iterate faster than ever.

Opinions are great to start a discussion, but we win on insights from user behavior. We prove strong hypotheses with relentless and granular attention to data and KPIs driving our decisions. We believe in high frequency experimentation and iterations to create the best possible experience for customers and all other players in the ecosystem.

It’s our vision that every product decision – be it the discovery or rollout of a new product; be it on the customer-facing, brand, core platform or intermediary side – is backed by analytical insights and rigorous impact testing. Thereby, we’re building a solid foundation for the next big learning curve in analytics: Artificial Intelligence and Machine Learning. We’ll be revealing more about our plans and learnings in upcoming articles.

Interested in Product Analytics possibilities at Zalando? We’re hiring.

LendingTree
Lyft

Zalando Tech Blog – Technical Articles Jan Brennenstuhl

JSON Web Tokens, or just JWTs (pron. [ˈdʒɒts]), are the new fancy kids around the block when it comes to transporting proofs of identity within an untrusted environment like the Web. In this article, I will describe the true purpose of JWTs. I will compare classical, stateful authentication with modern, stateless authentication. And I will explain why it is important to understand the fundamental difference of both approaches.

While there are many good articles available that describe specific aspects, best practises, or single use-cases of JWTs, the bigger picture is often missing. The actual problem that JWT specs try to solve is just not part of most discussions. With JWTs gaining in popularity however, that missing knowledge of the fundamental ideas of JSON Web Token leads to serious questions like:

This article is not about symptoms, but the purpose of JWT which actually is: Getting rid of stateful authentication!

Stateful Authentication

In the old days of the Web, authentication was a pure stateful affair. With a centralized overlord entity being responsible for tokens, the world was fairly simple:

  • Tokens are issued and stored in a single service for future checking and revocation,
  • Clients and resource servers know a single point of truth for token verification and information gathering.

This worked rather well in a world of integrated systems (some might call them legacy app, mothership or simply Jimmy), when servers rendered frontends and dependencies existed on e.g. package-level and not between independently deployed applications.

In a world where applications are composed by a flock of autonomous microservices however, this stateful authentication approach comes with a couple of serious drawbacks:

  • Basically no service can operate without having a synchronous dependency towards the central token store,
  • The token overlord becomes an infrastructural bottleneck and single point of failure.

Eventually, both facts oppose the fundamental ideas of microservice architectures. Stateful authentication introduces not just another dependency for all your single-purpose services (network latency!) but also makes them heavily rely on it. Without the token overlord being available (even for just a couple of seconds) everything is doomed. This is why a different approach is required: Stateless Authentication!

Stateless Authentication

Stateless authentication describes a system/process that enables its components to decentrally verify and introspect tokens. This ability to delegate token verification allows us to (partly) get rid of the direct coupling to a central token overlord and in that way enables state transfer for authentication. Having worked in stateless authentication environments for several years, the benefits in my eyes are clearly:

  • Less latency through local, decentralized token verification,
  • Custom authorization fallbacks due to local token interpretation,
  • Increased resilience by removed network overhead.

Also, stateless authentication is able to absolve from the need to keep track of issued tokens, and for that reason removes state (and hence reduces storage) dependencies from your system.

The antiquated, heavy-weighted token overlord converges to yet another microservice being mainly responsible for issuing tokens. All of this comes in handy, especially when your world mainly consists of single-page applications or mobile clients and services that primarily communicate using RESTful APIs.

“Using a JWT as a bearer for authorization, you can statelessly verify if the user is authenticated by simply checking if the expiration in the payload hasn’t expired and if the signature is valid.” —Jonatan Nilsson

One popular way to achieve stateless authentication is defined in RFC 7523 and leverages the OAuth 2.0 Authorization Framework (RFC 6749) by combining it with server-signed JSON Web Tokens (RFC 7519RFC 7515). Instead of storing the token-to-principal relationship in a stateful manner, signed JWTs allow decentralized clients to securely store and validate access tokens without calling a central system for every request.

With tokens not being opaque but locally introspectable, clients could also retrieve addition information (if present) about the corresponding identity directly from the token without the need of calling another remote API.

Stateful vs. Stateless

Nowadays in a Web that is mainly characterized by a wide-spread transition from monolithic legacy apps to decoupled microservices, a centralized token overlord service can be described as an additional burden. The purpose of JWT is to obviate the need for such a centralistic approach.

However, there again is no silver bullet and JWTs aren’t Swiss Army knives. Stateful authentication has its righteous place. If you really need a central authentication system (e.g. to fulfil restrictive auditing requirements) or if you simply don’t trust people or libraries to correctly verify your JWTs, a stateful overlord approach is still the way to go and there is nothing wrong with it.

In my opinion, you probably shouldn’t mix both approaches. To shortly answer the questions above:

  • There is no way of invalidating/revoking a JWT (and I don’t see the point), except if you just use it as yet another random string within a stateful authenticating system.
  • There is no way of altering an issued JWT, so prolongating its expiration date is again not possible.
  • You could use JWTs if they really help you in solving your issues. You don’t have to use them. You can also keep your opaque tokens.

If you have further comments regarding the purpose of JWT or if you think I missed something important, do not hesitate to drop me message via Twitter. I also appreciate feedback and further discussion. Thanks!

LendingTree
Lyft

Zalando Tech Blog – Technical Articles Hunter Kelly

To be able to measure the quality of some of the machine learning models that we have at Zalando, “Golden Standard” corpora are required.  However, creating a “Golden Standard” corpus is often laborious, tedious and time-consuming.  Thus, a method is needed to produce high quality validation corpora but without the traditional time and cost inefficiencies.

Motivation

As the Zalando Dublin Fashion Content Platform (FCP) continues to grow, we now have many different types of machine learning models.  As such, we need high quality labelled data sets that we can use to benchmark model performance and evaluate changes to the model.  Not only do we need such data sets for final validation, but going forward, we also need methods to acquire high-quality labelled data sets for training models.  This is becoming particular clear as we start working on models for languages other than English.

Creating a “Golden Standard” corpus generally requires a human being to look at something and make some decisions.  This can be quite time consuming, and ultimately quite costly, as it is often the researcher(s) conducting the experiment that end up doing the labelling.  However, the labelling tasks themselves don't always require much prior knowledge, and could be done by anyone reasonably computer literate.  In this era of crowdsourcing platforms such as Amazon's Mechanical Turk and CrowdFlower, it makes sense to leverage these platforms to try to create these high quality data sets at a reasonable cost.

Background

Back when we first created our English language Fashion Classifier, we bootstrapped our labelled data by using the (now defunct) DMOZ, also known as the Open Directory Project.  This was a site where volunteers, since 1998, were hand categorizing websites and webpages.  A web page could live under one or more "Categories".  Using a snapshot of the site, we took any web pages/sites that had a category that contained the word "fashion" anywhere in it's name.  This became our “fashion” dataset.  We then also took a number of webpages and sites from categories like "News", "Sports", etc, to create our “non-fashion” dataset.

Taking these two sets of links, and with the assumption that they would be noisy, but "good enough", we generated our data sets and went about building our classifier.  And from all appearances, the data was "good enough".   We were able to build a classifier that performed well on the validation and test sets, as well as on some small, hand-crafted sanity test sets.  But now, as we circle around, creating classifiers in multiple languages and for different purposes, we want to know:

  • What is our data processing quality, assessed against real data?
  • When we train a new model, is this new model better?  In what ways is it better?
  • How accurate were our assumptions regarding "noisy but good enough"?
  • Do we need to revisit our data acquisition strategy, to reduce the noise?

And of course, the perennial question for any machine learning practitioner:

  • How can I get more data??!?

Approach

Given that Zalando already had a trial account with CrowdFlower, it was the natural choice of crowdsourcing platform to go with.  With some help from our colleagues, we were able to get set up and understand the basics of how to use the platform.

Side Note: Crowdsourcing is an adversarial system

Rather than bog down the main explanation of the approach with too many side notes, it is worth mentioning up-front that crowdsourcing should be viewed as an adversarial system.

CrowdFlower "jobs" work on the idea of "questions", and the reviewer is presented with a number of questions per page.  On each page there will be one "test question", which you must supply.  As such, the test questions are viewed as ground truth and are used to ensure that the reviewers are maintaining a high enough accuracy (configurable) on their answers.

Always remember, though, that a reviewer wants to answer as many questions as quickly as possible to maximize their earnings.  They will likely only skim the instructions, if they look at them at all.  It is important to consider accuracy thresholds and to design your jobs such that they cannot be easily gamed.  One step that we took, for example, was to put all the links through a URL shortener (see here), so that the reviewer could not simply look at the url and make a guess; they actually had to open up the page to make a decision.

Initial Experiments

We created a very simple job that contained 10 panels with a link and a dropdown, as shown below.

We had a data set of hand-picked links to use as our ground-truth test questions, approximately 90 fashion links, and 45 non-fashion links.  We then also picked some of the links we had from our DMOZ data set, and used those to run some experiments on.  Since this was solely about learning how to use the platform, we didn't agonize over this data set, we just picked 100 nominally fashion links, and 100 nominally non-fashion links, and uploaded those as the data to use for the questions.

We ran two initial experiments: the first one we had tried to use some of the more exotic, interesting "Quality Control" settings that CrowdFlower makes available, but we found that the number of "Untrusted Judgements" was far too high compared to "Trusted Judgements".  We simply stopped the job, copied it and launched another.

The second of the initial experiments proved quite promising: we got 200 links classified, with 3 judgements per link (so 600 trusted judgements in total).  The classifications from the reviewers matched the DMOZ labels pretty closely.  All the links where the DMOZ label and the CrowdFlower reviewers disagreed were examined; there was one borderline case that was understandable, and the rest were actually indicative of the noise we expected to see in the DMOZ labels.

Key learnings from initial experiments:
  • Interestingly, we really overpaid on the first job.  Dial down the costs until after you've run a few experiments.  If the “Contributor Satisfaction” panel on the main monitoring page has a “good” (green) rating, you’re probably paying too much.
  • Start simple.  While it is tempting to play with the advanced features right from the get-go, don't.  They can cause problems with your job running smoothly; only add them in if/when they are needed.
  • You can upload your ground truth questions directly rather than using the UI, see these CrowdFlower docs for more information.
  • You can have extra fields in the data you upload that isn't viewed by the user at all; we were then able to use the CrowdFlower UI to quickly create pivot tables and compare the DMOZ labels against the generated labels.
  • You can get pretty reasonable results even with minimal instructions.
  • Design your job such that "bad apples" can't game the system.
  • It's fast!  You can get quite a few results in just an hour or two.
  • It's cheap!  You can run some initial experiments and get a feeling for what the quality is like for very little.  Even with our "massive" overspend on the first job, we still spent less than $10 total on our experiments.

Data Collection

Given the promising results from the initial experiments, we decided to proceed and collect a "Golden Standard" corpus of links, with approximately 5000 examples from each class (fashion and non-fashion).  Here is a brief overview of the data collection process:

  • Combine our original DMOZ link seed set with our current seed set
  • Use this new seed set to search the most recent CommonCrawl index to generate candidate links
  • Filter out any links that had been used in the training or evaluation of our existing classifiers
  • Sample approximately 10k links from each class: we intentionally sampled more than the target number to account for inevitable loss
  • Run the sampled links through a URL shortener to anonymize the urls
  • Prepared the data for upload to CrowdFlower

Final Runs

With data in hand, we wanted to make some final tweaks to the job before running it.  We fleshed out the instructions (not shown) with examples and more thorough definitions, even though we realized they would not be read by many.  We upped the minimum accuracy from 70% to 85% (as suggested by CrowdFlower).  Finally, we adjusted the text in the actual panels to explain what to do in borderline or error cases.

We ran a final experiment against the same 200 links as in the previous experiments.  The results were very similar, if not marginally better than the previous experiment, so we felt confident that the changes hadn't made anything worse.  We then incorporated the classified links as new ground truth test questions (where appropriate) into the final job.

We launched the job, asking for 15k links from a pool of roughly 20k.  Why 15k?  We wanted 5k links from each class; we were estimating about 20% noise on the DMOZ labels.  We also wanted a high level of agreement, so links that had 3/3 reviewers agreeing.  From the previous experiments, we were getting unanimous agreement on about 80% of the links seen.  So 10k + noise + agreement + fudge factor + human predilection for nice round numbers = 15k.

We launched the job in the afternoon; it completed overnight and the results were ready for analysis the next morning, which leads to...

Evaluation

How does the DMOZ data compare to the CrowdFlower data?  How good was "good enough"?

We can see two things, right away:

1. The things in DMOZ that we assumed were mostly not fashion, were, in fact, mostly not fashion.  1.5% noise is pretty acceptable.

2. Roughly 22% of all our DMOZ "Fashion" links are not fashion.  This is pretty noisy, and indicates that it was worth all the effort of building this properly labelled "Golden Standard" corpora in the first place!  There is definitely room for improvement in our data acquisition strategy.

Now, those percentages change if we only take into account the links where all the reviewers were in agreement; the noise in the fashion set drops down to 15%.  That's still pretty noisy.

So what did we end up with, for use in the final classifier evaluations?  Note that the total numbers don't add up to 15k because we simply skipped links that produced errors on fetching, 404s, etc.

This shows us, that similar to the initial experiments, that we had unanimous agreement roughly 80% of the time.

Aside: It's interesting to note that both the DMOZ noise and the number of links where opinions were split work out to about 20%.  Does this point to some deeper truth about human contentiousness?  Who knows!

So what should we use to do our final evaluation?  It's tempting to use the clean set of data, where everyone is in agreement.  But on the other hand, we don't want to unintentionally add bias to our classifiers by only evaluating it on clean data.  So why not both?  Below are the results of running our old baseline classifier, as well as our new slimmer classifier, against both the "Unanimous" and "All" data sets.

Taking a look at our seeds and comparing that to the returned links, we find that 4,023 of the 15,000 are links in the seed set, with the following breakdown when we compare against nominal DMOZ labels:

Key Takeaways

  • Overall, the assumption that the DMOZ was "good enough" for our initial data acquisition was pretty valid.  It allowed us to move our project forward without a lot of time agonizing over labelled data.
  • The DMOZ data was quite noisy, however, and could lead to misunderstandings about the actual quality of our models if used as a "Golden Standard".
  • Crowdsourcing, and CrowdFlower, in particular, can be a viable way to accrue labelled data quickly and for a reasonable price.
  • We now have a "Golden Standard" corpus for our English Fashion Classifier against which we can measure changes.
  • We now have a methodology for creating not only "Golden Standard" corpora for measuring our current data processing quality, but a method that can be extended to create larger data sets that can be used for training and validation.
  • There may be room to improve the quality of our classifier by using a different type of classifier, that is more robust in the face of noise in the training data (since we've established that our original training data was quite noisy).
  • There may be room to improve the quality of the classifier by creating a less noisy training and validation set.

Conclusion

Machine Learning can be a great toolkit to use to solve tricky problems, but the quality of data is paramount, not just for training but also for evaluation.  Not only here in Dublin, but all across Zalando, we’re beginning to reap the benefits of affordable, high quality datasets that can be used for training and evaluation.  We’ve just scratched the surface, and we’re looking forward to seeing what’s next in the pipeline.

If you're interested in the intersection of microservices, stream data processing and machine learning, we're hiring.  Questions or comments?  You can find me on Twitter at @retnuH.

LendingTree
Lyft

Zalando Tech Blog – Technical Articles Hung Chang

While developing Zalando’s real-time business process monitoring solution, we encountered the need to generate complex events upon the detection of specific patterns of input events. In this blog post we describe the generation of such events using Apache Flink, and share our experiences and lessons learned in the process. You can read more on why we have chosen Apache Flink over other stream processing frameworks here: Apache Showdown: Flink vs. Spark.

This post is aimed at those familiar with stream processing in general and having had first experiences working with Flink. We recommend Tyler Akidau’s blog post The World Beyond Batch: Streaming 101 to understand the basics of stream processing, and Fabian Hueske’s Introducing Stream Windows in Apache Flink for the specifics of Flink.

Business Processes

To start off, we would like to offer more context on the problem domain. Let’s begin by having a look at the business processes monitored by our solution.

A business process is, in its simplest form, a chain of correlated events. It has a start and a completion event. See the example depicted below:

The start event of the example business process is ORDER_CREATED. This event is generated inside Zalando’s platform whenever a customer places an order. It could have the following simplified JSON representation:

{
"event_type": "ORDER_CREATED",
"event_id": 1,
"occurred_at": "2017-04-18T20:00:00.000Z",
"order_number": 123
}

The completion event is ALL_PARCELS_SHIPPED. It means that all parcels pertaining to an order have been handed over for shipment to the logistic provider. The JSON representation is therefore:

{
"event_type": "ALL_PARCELS_SHIPPED",
 "event_id": 11,
"occurred_at": "2017-04-19T08:00:00.000Z",
"order_number": 123
}

Notice that the events are correlated on order_number, and also that they occur in order according to their occurred_at values.

So we can monitor the time interval between these two events, ORDER_CREATED and ALL_PARCELS_SHIPPED. If we specify a threshold, e.g. 7 days, we can tell for which orders the threshold has been exceeded and then can take action to ensure that the parcels are shipped immediately, thus keeping our customers satisfied.

Problem Statement

A complex event is an event which is inferred from a pattern of other events.

For our example business process, we want to infer the event ALL_PARCELS_SHIPPED from a pattern of PARCEL_SHIPPED events, i.e. generate ALL_PARCELS_SHIPPED when all distinct PARCEL_SHIPPED events pertaining to an order have been received within 7 days. If the received set of PARCEL_SHIPPED events is incomplete after 7 days, we generate the alert event THRESHOLD_EXCEEDED.

We assume that we know beforehand how many parcels we will ship for a specific order, thus allowing us to determine if a set of PARCEL_SHIPPED events is complete. This information is contained in the ORDER_CREATED event in the form of an additional attribute, e.g. "parcels_to_ship":  3.

Furthermore, we assume that the events are emitted in order, i.e. the occurred_at timestamp of ORDER_CREATED is smaller than all of the PARCEL_SHIPPED’s timestamps.

Additionally we require the complex event ALL_PARCELS_SHIPPED to have the timestamp of the last PARCEL_SHIPPED event.

The raw specification can be represented through the following flowchart:

We process all events from separate Apache Kafka topics using Apache Flink. For a more detailed look of our architecture for business process monitoring, please have a look here.

Generating Complex Events

We now have all the required prerequisites to solve the problem at hand, which is to generate the complex events ALL_PARCELS_SHIPPED and THRESHOLD_EXCEEDED.

First, let’s have an overview on our Flink job’s implementation:

  1. Read the Kafka topics ORDER_CREATED and PARCEL_SHIPPED.
  2. Assign watermarks for event time processing.
  3. Group together all events belonging to the same order, by keying by the correlation attribute, i.e. order_number.
  4. Assign TumblingEventTimeWindows to each unique order_number key with a custom time trigger.
  5. Order the events inside the window upon trigger firing. The trigger checks whether the watermark has passed the biggest timestamp in the window. This ensures that the window has collected enough elements to order.
  6. Assign a second TumblingEventTimeWindow of 7 days with a custom count and time trigger.
  7. Fire by count and generate ALL_PARCELS_SHIPPED or fire by time and generate THRESHOLD_EXCEEDED. The count is determined by the "parcels_to_ship" attribute of the ORDER_CREATED event present in the same window.
  8. Split the stream containing events ALL_PARCELS_SHIPPED and THRESHOLD_EXCEEDED into two separate streams and write those into distinct Kafka topics.

The simplified code snippet is as follows:

// 1
List<String> topicList = new ArrayList<>();
topicList.add("ORDER_CREATED");
topicList.add("PARCEL_SHIPPED");
DataStream<JSONObject> streams = env.addSource(
      new FlinkKafkaConsumer09<>(topicList, new SimpleStringSchema(), properties))
      .flatMap(new JSONMap()) // parse Strings to JSON
// 2-5
DataStream<JSONObject> orderingWindowStreamsByKey = streams
      .assignTimestampsAndWatermarks(new EventsWatermark(topicList.size()))
      .keyBy(new JSONKey("order_number"))
      .window(TumblingEventTimeWindows.of(Time.days(7)))
      .trigger(new OrderingTrigger<>())
      .apply(new CEGWindowFunction<>());
// 6-7
DataStream<JSONObject> enrichedCEGStreams = orderingWindowStreamsByKey
     .keyBy(new JSONKey("order_number"))
     .window(TumblingEventTimeWindows.of(Time.days(7)))
     .trigger(new CountEventTimeTrigger<>())
     .reduce((ReduceFunction<JSONObject>) (v1, v2) -> v2); // always return last element
// 8
enrichedCEGStreams
      .flatMap(new FilterAllParcelsShipped<>())
      .addSink(new FlinkKafkaProducer09<>(Config.allParcelsShippedType, 
         new SimpleStringSchema(), properties)).name("sink_all_parcels_shipped");
enrichedCEGStreams
      .flatMap(new FilterThresholdExceeded<>())
      .addSink(new FlinkKafkaProducer09<>(Config.thresholdExceededType,
         newSimpleStringSchema(), properties)).name("sink_threshold_exceeded");

Challenges and Learnings

The firing condition for CEG requires ordered events

As per our problem statement, we need the ALL_PARCELS_SHIPPED event to have the event time of the last PARCEL_SHIPPED event. The firing condition of the CountEventTimeTrigger thus requires the events in the window to be in order, so we know which PARCEL_SHIPPED event is last.

We implement the ordering in steps 2-5. When each element comes, the keyed state stores the biggest timestamp of those elements. At the registered time, the trigger checks whether the watermark is greater than the biggest timestamp. If so, the window has collected enough elements for ordering. We assure this by letting the watermark only progress at the earliest timestamp among all events. Note that ordering events is expensive in terms of the size of the window state, which keeps them in-memory.

Events arrive in windows at different rates

We read our event streams from two distinct Kafka topics: ORDER_CREATED and PARCEL_SHIPPED. The former is much bigger than the latter in terms of size. Thus, the former is read at a slower rate than the latter.

Events arrive in the window at different speeds. This impacts the implementation of the business logic, particularly the firing condition of the OrderingTrigger. It waits for both event types to reach the same timestamps by keeping the smallest seen timestamp as the watermark. The events pile up in the windows’ state until the trigger fires and purges them. Specifically, if events in the topic ORDER_CREATED start from January 3rd and and the ones in PARCEL_SHIPPED start from January 1st, the latter will be piling up and only purged after Flink has processed the former at January 3rd. This consumes a lot of memory.

Some generated events will be incorrect at the beginning of the computation

We cannot have an unlimited retention time in our Kafka queue due to finite resources, so events expire. When we start our Flink jobs, the computation will not take into account those expired events. Some complex events will either not be generated or will be incorrect because of the missing data. For instance, missing PARCEL_SHIPPED events will result in the generation of a THRESHOLD_EXCEEDED event, instead of an ALL_PARCELS_SHIPPED event.

Real data is big and messy. Test with sample data first

At the beginning, we used real data to test our Flink job and reason about its logic. We found its use inconvenient and inefficient for debugging the logic of our triggers. Some events were missing or their properties were incorrect. This made reasoning unnecessarily difficult for the first iterations. Soon after, we implemented a custom source function, simulated the behaviour of real events, and investigated the generated complex events.

Data is sometimes too big for reprocessing

The loss of the complex events prompts the need to generate them again by reprocessing the whole Kafka input topics, which for us hold 30 days of events. This reprocessing proved to be unfeasible for us. Because the firing condition for CEG needs ordered events, and because events are read at different rates, our memory consumption grows with the time interval of events we want to process. Events pile up in the windows’ state and await the watermark progression so that the trigger fires and purges them.

We used AWS EC2 t2.medium instances in our test cluster with 1GB of allocated RAM. We observed that we can reprocess, at most, 2 days worth without having TaskManager crashes due to OutOfMemory exceptions. Therefore, we implemented additional filtering on earlier events.

Conclusion

Above we have shown you how we designed and implemented the complex events ALL_PARCELS_SHIPPED and THRESHOLD_EXCEEDED. We have shown how we generate these in real-time using Flink’s event time processing capabilities. We have also presented the challenges we’ve encountered along the way and have described how we met those using Flink’s powerful event time processing features, i.e. watermark, event time windows and custom triggers.

Advanced readers will be aware of the CEP library Flink offers. When we started with our use cases (Flink 1.1) we determined that these cannot be easily implemented with it. We believed that full control of the triggers gave us more flexibility when refining our patterns iteratively. In the meantime, the CEP library has matured and in the upcoming Flink 1.4 it will also support dynamic state changes in CEP patterns. This will make implementations of use cases similar to ours more convenient.

If you have any questions or feedback you’d like to share, please get in touch. You can reach us via e-mail: hung.chang@zalando.de and mihail.vieru@zalando.de.

Lyft
LendingTree

Zalando Tech Blog – Technical Articles Alaa Elhadba

Information Retrieval (IR) systems are a vital component in the core of successful modern web platforms, and Zalando understand their importance incredibly well.

The main goal of IR systems is to provide a communication layer that enables customers to establish a retrieval dialogue with underlying data.

The immense explosion of unstructured data drives modern search applications to go beyond just fuzzy string matching, to invest in deep understanding of user queries through interpretation of user intention in order to respond with a relevant result set.

The modern architecture of search is a design of a data-driven IR system that covers the following:

  • Data ingestion pipelines from various sources
  • Data retrieval and the lifecycle of a user search query
  • Machine-learned relevance ranking
  • Personalized search
  • Search performance tracking and quality assessment

At the recent Berlin Buzzwords conference this month, we discussed the components needed to build an ecosystem that is designed to solve the problems of IR in web platforms. What role can Machine Learning play in search relevancy? How can natural language processing help provide a solid understanding of search phrases? How can data drive a personalized search experience? And finally, what are the challenges of maintaining such a complex system?  

Watch as we reveal those answers and more below.

LendingTree
Lyft

Zalando Tech Blog – Technical Articles Alaa Elhadba

Information Retrieval (IR) systems are a vital component in the core of successful modern web platforms, and Zalando understand their importance incredibly well.

The main goal of IR systems is to provide a communication layer that enables customers to establish a retrieval dialogue with underlying data.

The immense explosion of unstructured data drives modern search applications to go beyond just fuzzy string matching, to invest in deep understanding of user queries through interpretation of user intention in order to respond with a relevant result set.

The modern architecture of search is a design of a data-driven IR system that covers the following:

  • Data ingestion pipelines from various sources
  • Data retrieval and the lifecycle of a user search query
  • Machine-learned relevance ranking
  • Personalized search
  • Search performance tracking and quality assessment

At the recent Berlin Buzzwords conference this month, we discussed the components needed to build an ecosystem that is designed to solve the problems of IR in web platforms. What role can Machine Learning play in search relevancy? How can natural language processing help provide a solid understanding of search phrases? How can data drive a personalized search experience? And finally, what are the challenges of maintaining such a complex system?  

Watch as we reveal those answers and more below.

Lyft
LendingTree

Zalando Tech Blog – Technical Articles Alaa Elhadba

Information Retrieval (IR) systems are a vital component in the core of successful modern web platforms, and Zalando understand their importance incredibly well.

The main goal of IR systems is to provide a communication layer that enables customers to establish a retrieval dialogue with underlying data.

The immense explosion of unstructured data drives modern search applications to go beyond just fuzzy string matching, to invest in deep understanding of user queries through interpretation of user intention in order to respond with a relevant result set.

The modern architecture of search is a design of a data-driven IR system that covers the following:

  • Data ingestion pipelines from various sources
  • Data retrieval and the lifecycle of a user search query
  • Machine-learned relevance ranking
  • Personalized search
  • Search performance tracking and quality assessment

At the recent Berlin Buzzwords conference this month, we discussed the components needed to build an ecosystem that is designed to solve the problems of IR in web platforms. What role can Machine Learning play in search relevancy? How can natural language processing help provide a solid understanding of search phrases? How can data drive a personalized search experience? And finally, what are the challenges of maintaining such a complex system?  

Watch as we reveal those answers and more below.

LendingTree
Lyft

Zalando Tech Blog – Technical Articles Alaa Elhadba

Information Retrieval (IR) systems are a vital component in the core of successful modern web platforms, and Zalando understand their importance incredibly well.

The main goal of IR systems is to provide a communication layer that enables customers to establish a retrieval dialogue with underlying data.

The immense explosion of unstructured data drives modern search applications to go beyond just fuzzy string matching, to invest in deep understanding of user queries through interpretation of user intention in order to respond with a relevant result set.

The modern architecture of search is a design of a data-driven IR system that covers the following:

  • Data ingestion pipelines from various sources
  • Data retrieval and the lifecycle of a user search query
  • Machine-learned relevance ranking
  • Personalized search
  • Search performance tracking and quality assessment

At the recent Berlin Buzzwords conference this month, we discussed the components needed to build an ecosystem that is designed to solve the problems of IR in web platforms. What role can Machine Learning play in search relevancy? How can natural language processing help provide a solid understanding of search phrases? How can data drive a personalized search experience? And finally, what are the challenges of maintaining such a complex system?  

Watch as we reveal those answers and more below.

Lyft
LendingTree

Zalando Tech Blog – Technical Articles Alaa Elhadba

Information Retrieval (IR) systems are a vital component in the core of successful modern web platforms, and Zalando understand their importance incredibly well.

The main goal of IR systems is to provide a communication layer that enables customers to establish a retrieval dialogue with underlying data.

The immense explosion of unstructured data drives modern search applications to go beyond just fuzzy string matching, to invest in deep understanding of user queries through interpretation of user intention in order to respond with a relevant result set.

The modern architecture of search is a design of a data-driven IR system that covers the following:

  • Data ingestion pipelines from various sources
  • Data retrieval and the lifecycle of a user search query
  • Machine-learned relevance ranking
  • Personalized search
  • Search performance tracking and quality assessment

At the recent Berlin Buzzwords conference this month, we discussed the components needed to build an ecosystem that is designed to solve the problems of IR in web platforms. What role can Machine Learning play in search relevancy? How can natural language processing help provide a solid understanding of search phrases? How can data drive a personalized search experience? And finally, what are the challenges of maintaining such a complex system?  

Watch as we reveal those answers and more below.

Lyft
LendingTree

Zalando Tech Blog – Technical Articles Jan Mußler

A lot of time has passed at Zalando since the first services were started backed by PostgreSQL 9.0-rc1. Despite the adoption of other technologies, PostgreSQL remains the preferred relational database for most engineers around. You can follow some of the developments around PostgreSQL on the blog and also on GitHub where we share most of our PostgreSQL-related tooling.

Let’s start with a quick look at PostgreSQL on AWS. When Zalando Tech began its transition to AWS, the STUPS landscape and tooling was created. For the ACID team (the database engineering team), the most relevant changes where that applications had to run in Docker and EC2 instances might be slightly less reliable than what we were used to.

At scale and in the cloud automation is key. The ACID team started the work on Patroni, today Zalando’s most popular open source GitHub project, to take care of PostgreSQL deployments and manage high availability, among other valuable features. The next step was Spilo, packaging Patroni and PostgreSQL into a single Docker image and providing guidance on how to deploy database clusters using AWS CloudFormation templates.

Today teams have the choice of deploying PostgreSQL either with AWS RDS or Spilo. We are convinced that Spilo is a more flexible solution, providing more control to teams, although often the one-click RDS service is more compelling. We feel that our own PostgreSQL solution gives us more control and more flexibility, but this is not always required.

However, automated our deployment became, we did not focus on the last step, which is automating the initial request for a cluster. Somewhere between the team wanting a PostgreSQL database and the database team creating it was still a ticketing system. This had to change. Initial work on a REST service to trigger Spilo deployments on plain AWS/EC2 was scrubbed in favor of a new solution using Kubernetes, believing that this the future platform to run on and benefiting from its feature set, which is a stable API and declarative deployment descriptions. Kubernetes today runs on various cloud providers, opening up for a bigger target audience and less lock in.

Current status

Let’s take a look at what we are currently developing and working on as open source products. First, we will briefly touch on the PostgreSQL operator and its tiny user interface and then look into the pg_view web version.

Kubernetes provides so-called third party objects, allowing us to store YAML documents within Kubernetes itself and act upon their changes. Using those third party objects to describe PostgreSQL clusters, we started working on the operator that picks up the YAML definitions and transforms them into Kubernetes resources needed to run and expose PostgreSQL clusters to our engineers. This concept will later allow us to easily configure and provision PostgreSQL into production environments with a common deployment pipeline that relies solely on the Kubernetes API, basically triggering PostgreSQL cluster setup from engineers committing to Git.

Writing a YAML is pretty easy, but somehow it turned out having a user interface to get a cluster even quicker was a good idea and less error prone. Thus, we wrote a very small RiotJS user interface for engineers to create PostgreSQL clusters and provide them with feedback on how far the cluster creation is progressing. As one basically only works against the nice Kubernetes API, this was not much work in the end.

The next thing we learned is that once you have a UI, engineers create clusters with incomplete or tiny misconfigurations - forcing us to quickly add the first possible features to change the cluster configuration and test the idea of the operator in production. Making the change means updating the third party object and letting the postgres-operator update the Kubernetes resources.

Thanks to the work done in Patroni, tackling deployment, configuration, failover and recovery from S3 for example, the deployment of a database is only a part of what users expect from PostgreSQL as a service. Maintenance and monitoring are equally essential and most likely to require more work and attention.

Monitoring

Earlier we released our console tool pg_view to monitor the PostgreSQL cluster in “real-time”; however, by its nature it required users to have SSH access into the machine, something no longer possible and not desired for every engineer. Discussing the options, not everyone was immediately on board with the idea to transfer this to a web based solution, but one of our engineers had already done the heavy lifting: a custom PostgreSQL extension was lingering around in his GitHub repos providing all the metric and query data via a single HTTP endpoint. We quickly implemented a tiny prototype UI showing the same data earlier visible on the terminal and, as we received good feedback on the idea, decided to stick with it.

While this provided the critical insights into running queries and system metrics, we also reworked the ZMON-based coverage of our PostgreSQL clusters. ZMON checks track the basic metrics one expects from AWS instances/Kubernetes Pods: CPU and memory, along with storage metrics and monitoring for free disk space. Additionally, we also started to track PostgreSQL internal metrics from tables and indexes to give engineers a better impression on how tables and indexes were growing, as well as how and where sequential scan or index scan patterns changed over time.

What’s in it for you?

We have already open sourced the operator and are investing more time to improve its feature set as we speak. Shortly, we will also release the user interface for “creating” the third party resources to trigger PostgreSQL clusters. Our pg_view web version will also arrive soon.

From our point of view, the above creates a very useful set of projects around operating PostgreSQL on Kubernetes. Keep an eye out for new repositories in the Zalando Incubator, or contact me via Twitter at @JanMussler if you have further questions. Interested in joining us? We're hiring.

LendingTree
Lyft