Zalando Tech Blog – Technical Articles Paulo Renato Campos de Siqueira

Pull Requests (PRs) are the norm today when it comes to common software development practices in teams. It is the right way to submit code changes so that your peers can check them out, add in their thoughts and help you create the best code you can - i.e. PRs allow us to easily introduce code review to our development process and enable a great deal of teamwork, while also decreasing the number of bugs our software contains.

There are several aspects we can talk about when it comes to Pull Requests and code review. In this post, I'm specifically concerned with the size of PRs, although I'll briefly touch other points as well. Other dimensions you could think about include having a good description of what is being done and why, and being sure that the Pull Request only changes one thing and one thing only, i.e. it is independent and self-contained.

On a personal note, I think Pull Requests nowadays are so important that I even use them on projects where I work alone so that I can have automated checks applied before deciding to merge into master. It allows me that extra opportunity to catch errors before it is too late. In GitHub for example, this generates a nice visual summary of the checks performed. And yes, you could do this straight into your branches, but using PRs is easier and more organized. You could for example easily decline and close a PR, and document why you did it. In this PR in one of my pet projects for example, you can see Codacy, Travis CI and CodeCov checking my code before I merge it to master.

Having said the above, it is way too easy to get carried away when developing and you may end up adding several small things at once - be it features or fixes, or simply some refactoring in the same PR - thus making it quite large and hard to read. And don't get me wrong: crafting small, self contained and useful Pull Requests is not easy! Good developers don't create big PRs because they are lazy:sometimes it is hard to see the value in going the extra mile to break what has already become way too big.

Another aspect to consider here is related to git commit good practices in general. Having small Pull Requests will also help to have small and focused individual commits, which is very valuable when maintaining code. Let’s illustrate this point with an example that happened recently to me.

Can I revert this?

I was investigating a bug, something that used to work well and that simply stopped working out of the blue. After some time and investigation, I found that the relevant code was simply removed, and that we didn't notice it beforehand because of yet another bug. Obvious solution: go through the git history and just git revert the deleted code. Except that I couldn't find any commit related to it.

After further investigation I finally found the commit that removed the files - but it was a commit that also did several other unrelated things. git revert was no longer an option, especially due to the rest of the code that had been changed at this point, and I ended up having to manually add the files myself. The total time spent with this became way more than it could have been.

Why are big Pull Requests a problem?

The first and most important thing to note here is our human capacity to hold knowledge in one's head. There is a limit of how much information you can keep and correlate at once, while at the same time weighing all its consequences in the rest of the system, or even for external / client systems. This will obviously be different for different people, but is a problem at some level. And when working in a team, you have to lower this bar, to make sure everyone can work at the same level.

When you are reviewing a Pull Request, you have to keep some things in mind, such as:

  • What are the new components being created?
  • How do they interact with existing components?
  • Is there code being deleted? If so, should it really be deleted?
  • Are the new components really necessary? Perhaps you already have something in the current code base that solves the problem? Or something that could be generalized and applied to both places?
  • Do you see new bugs being introduced?
  • Is the general design OK? Is it consistent with the rest of the project's design?

There are quite a few things to check. How much of that can you keep in your mind while you are reviewing code? This gets harder the more lines of code there are to be checked.

So back to small PRs. While all of this has little to no impact for automated checks and builds, this can actually have a huge impact when it comes to code review. With that in mind, let’s go through at least a few ideas you can use to escape the type of situation where you don't really feel you want to break your PR into smaller pieces - but should nonetheless. There is no black magic here, we will just use some nice git commands in a way that helps us achieve our goal. Perhaps you will even discover a few things you didn't know before!

Sort your imports

I prefer sorting imports in alphabetical order, but the actual criteria doesn't matter, as long as the whole team uses the same technique. This practice can be easily automated and avoids generating a diff when two developers add the same import in different positions. It also completely eliminates duplicated imports generated by merges.

Sometimes this will also avoid conflicts where two developers remove or add unrelated imports in the same position and git doesn't know what to do about them. Sorting imports makes them naturally mergeable.

Avoid frequent formatting changes

This happens a lot, especially if you don't use code formatter tools like scalafmt or scalariform (or whatever is available for your language of choice). Sometimes, you may see a blank line you don't like. Or you don't see a blank where you believe it should be. You simply go on and delete or add it. This means yet another line change that goes into your PR.

This is not related only to PR sizes. This small change has a big chance of creating conflicts if you ever have to update your PR before merging. Another developer might legitimately change a certain code point and you now have to very carefully check if a change was only cosmetic and thus can be ignored, or if there was something real there to consider. More than once I've seen features simply vanish because of this kind of thing.

If you really want to make some formatting changes, do so, but send it as a separate PR that can be merged as soon as possible, and independent of any features. And consider automating this task as well.

Allow reviewers the time to review

This is a little meta, but important nonetheless: resist the urge to want your code merged right away. I suffer from this myself from time to time, especially when we have some very small PRs. Still, the reviewers should be allowed time to work. If you did a good job of making it small and self-contained, and added a good description to the PR body, you will likely get some speedy feedback.

To better explain this it is worth quoting a teammate, who once said: “Sometimes it feels like we are asking for thumbs, not for reviews.”

If you sense something like this is happening, you should stop. You are probably rushing the review process, which will only result in some stress and badly reviewed code. My rule of thumb is to not ask for a thumbs up, quite literally. Every time I catch myself doing so I stop and rephrase, asking for a review instead.

Advanced and powerful: manipulating your sources with Git

Now for the more complex (and perhaps interesting) practices. What follows will require you to have at least an intermediate understanding of git, and a prerequisite of not being afraid of git rebase. As a side note, I say this because most of us are afraid of it (git rebase) when we first begin learning. This is only temporary though, until you fully realize the power it gives you.

Lets now think of the following scenario. You are working on a feature, and suddenly notice that some kind of side change is required. Something not strictly related to the feature itself, but would be of great help for your task. You might then get the urge to simply go on and do it, together with your current feature code.

Side changes with Git Stash

See the problem already? If you simply do it, the PR for your feature will get bigger. It will also now contain one (or more) extra concerns, meaning that the reviewers would have to verify this as well.

Instead, you should send this side change as a new PR. There are a few different ways to do this properly with git, but the easiest is to use git stash. What this does is hide your current changes and let you work with a clean workspace. Then you can switch to a new branch, implement the side changes and submit the PR request.

With that, your teammates can start reviewing these changes immediately while you are still working on the feature itself. Moreover, they will also be able to leverage those changes in their own code - who said that these changes would be useful only for you? And finally, it also gives your colleagues the opportunity to point out problems sooner rather than later. Perhaps something is incompatible with someone else's work, or another developer had just started to make the same kind of changes and now don't have to do anything. You can work together to achieve an even better result. Not to mention that this should a small PR, so quite easy to review.

After the PR is sent, you can recover your work with git stash pop. When you move to a new branch, you can get your changes back and start working. Now here there is yet another problem: how to deal with the fact that your side changes are probably not merged yet?

First, the problem in principle is not that big. The side changes are in their own commit, and thus your main changes are completely isolated. If at anytime you get feedback and have to update the PR you just sent, you can always stash your current changes again. Again, see the git stash documentation for more information on how this works.

Second, it might be that your PR with the side changes will simply be accepted as is and merged. In this case, it is quite easy to get your feature branch up-to-date. A git rebase master (or whatever branch your teams merge to) should do the trick. This is probably the easiest (and safest) variation of git rebase you can use. See the git rebase documentation here.

Finally, some pointers for the most complex case. You may find that you will have to fix many things on your side changes PR. Also, at this point you may have already made a few commits towards the feature you are implementing. You can use your imagination here and a nifty combination of git features to solve your problem. For example, you could try the following steps:

  • Wait for the side changes PR to be merged to master
  • Update your master: git pull
  • Create a new branch, based on master: git checkout -b my-new-branch
  • Go to your feature branch and carefully use git log to find which commits you used for the feature
  • Go back to the new branch
  • Use git cherry-pick to move the commit over that you found with git log

See the git cherry-pick documentation here. Notice that you can also cherry-pick a series of commits, instead of one by one, if you prefer. This also allows you to use the commit you sent as a new PR already, perhaps in a temporary branch where you add your feature code on top of that.

As you can see, git is a very powerful tool and offers you many ways to solve your problems.

Splitting up code into multiple PRs

The next scenario is that moment when you’ve already gotten too excited with your code and couldn't stop, and ended up with a huge pile of changes to throw at your peers' heads. In this case, it can be very easy to simply go and say something like :

“Sorry for the big PR. I could split it into smaller pieces but it would take too long.”

Let's go through some ideas to avoid this scenario by applying a little effort and splitting up your work.

First off, if you have well-crafted, individual commits, those can be turned easily into PRs with git cherry-pick. You can simply write down which commits you want to submit as new PRs, move to a new branch and bring those commits over with git cherry-pick. You can combine this with git stash to make it easier to deal with uncommitted code, like described above.

One small drawback is that sometimes your changes are dependent between each other and you might have to wait for the first one to be merged before you can really send the second one. On the other hand, if the first commit is small, chances are that it will be approved quickly, like we have already mentioned.

The whole process might not be too pleasant for you at first, but will definitely help the rest of the team. A small tip that might sound obvious is to "pre-wire" your PRs: go to your peers and let them know that those PRs are coming and what they are about. This will help them review your code faster.

A note about failure

It might all be beautiful on paper, but in reality this is not always possible. Even if you follow the tips presented here, you may still end up with big PRs from time to time. The critical point is that, when this happens, it should:

  • Be a conscious decision, not an accident;
  • Be as small as possible, i.e., you applied at least some of the tips above;
  • Be an exception, not the rule.

Remember: this is all about teamwork. Some things might make you a little slower, especially until you get into the right frame of mind, but it will make the whole team faster in the long run, and will also increase the chances of bugs being caught during code review. A final plus is that knowledge-sharing will also be better, since there is less to learn on each PR, and team members can ask more questions without being afraid of turning the review process into an endless discussion.

If you have read everything up until this point, then perhaps you are interested in reading even more. Here are some further interesting references around the subject:

What do you think? Do you have other techniques that you think could help in creating small and effective PRs? Or do you disagree that this is necessary? Let me know via Twitter at @jcranky.

Lyft
LendingTree

Zalando Tech Blog – Technical Articles Conor Clifford

A Challenge

Shortly after joining Zalando, I, along with a small number of other new colleagues (in a newly opened Dublin office), was entrusted with the task of building an important part of the new Fashion Platform - in particular, the core services around the Article data of Zalando. This task came with several interesting challenges, not least of which was ensuring the new platform provided not just sufficient capacity/throughput for existing workloads, but also had capacity for longer term growth - not just in terms of data volumes/throughput, but also with the number, and types, of users of that data. The aim here was the democratization of data for all potential users on the new platform.

Decision Time

It was decided that this new platform would be primarily an event driven system - with data changes being streamed to consumers. These consumers would subscribe, receive, and process the data appropriately for their own needs - essentially inverting the flow of data, from the traditional “pull” based architectures, to a “push” based approach. With this, we were looking to strongly prompt a wide adoption of a “third generation microservices” architecture.

In an event driven system it is important that the outbound events themselves have at least equal importance to the data being managed by the system. The primary responsibility of the system is not just to manage the data, but also ensure a fully correct, and efficient, outbound event stream, as it is this event stream that is the primary source of data for the majority of clients of this system.

Starting with an API First approach, the event structure and definition were treated as much a part of the system’s API as the more traditional HTTP API being designed. Beyond just the definition of the events (as part of the API), key focus was placed on ensuring both correctness of the events (compared to any stored data, in addition to the sequence of changes made to that data), as well as efficient publishing of the stream of events. This Event First approach meant that any decisions around design or implementation were taken always with correctness, and efficiency, of the outbound event stream in primary focus.

Initially, we built a quick prototype of the data services - primitive CRUD-type services, with synchronous HTTP APIs, each interacting directly with a simple (dedicated) PostgreSQL database as the operational store for the data. Outbound events were generated after completion of DB updates.

For this prototype, a very simple HTTP-based mockup of an event delivery system was used, while we decided on the real eventing infrastructure that would be used.

Not only did this prototype allow us to quickly exercise the APIs (in particular the event definitions) as they were being constructed, it also allowed us to quickly identify several shortfalls with this type of synchronous service model, including:

  • Dealing with multiple networked systems, especially around ensuring correct delivery of outbound events for every completed data change
  • Ensuring concurrent modifications to the same data entities are correctly sequenced, guaranteeing correct outbound event sequenced delivery
  • Effectively supporting a variety of data providing client types, including live low latency clients, through to high volume bulk-type clients.

Throw away your prototypes

With these limitations in mind, we worked at moving from this synchronous service approach to an asynchronous approach, processing data using an Event Sourcing model. At the same time, we progressed in our selection of an eventing platform, and were looking strongly at Apache Kafka - the combination of high throughput, guaranteed ordering, at least once delivery semantics, strong partitioning, natural backpressure handling, and log compaction capability were a winning combination for dealing with the outbound events.

With this selection of Kafka as the outbound event platform, it was also a natural selection for the inbound data processing. Using Kafka for the inbound event source, the logic for processing the data became a relatively simple event processing engine. Much of the feature set that was valuable for outbound event processing was equally as valuable for the inbound processing:

  • High throughput allowing for fast data ingestion - HTTP submissions getting transformed to inbound events published to an internal topic - even with high acknowledge settings for publishing these events, submission times are generally in the order of single digit milliseconds per submitted event. By allowing clients to submit data, with fast, guaranteed, accepted responses, clients can safely proceed through their workload promptly - allowing for greater flow of information in general through the wider system.
  • Guaranteed ordering - moving processing to event processing on a guaranteed ordered topic removed a lot of complexity around concurrent modifications, as well as cross-entity validations, etc.
  • At least once delivery - With any network-oriented service, modelling data changes to be idempotent is an important best practice - it allows reprocessing the same request/event (in cases of retries, or in the case of at least once delivery, repeated delivery.) Having this semantic in place for both the inbound event source, as well as the outbound event topic, actually allowed the event processing logic to use coarse grained retries around various activities (e.g. database manipulations, accessing remote validation engines, audit trail generations, and of course, outbound event delivery.) Removing the need for complex transaction handling allowed for much simpler logic, and as such, higher throughput in the nominal case.
  • Natural Backpressure handling - with Kafka’s “pull” based semantics, clients process data at their own rate - there is no complex feedback/throttling interactions required for clients to implement.
  • Partitioning - using Kafka’s partitioning capabilities, the internal event source topics can be subdivided logically - some careful thought to select an appropriate partitioning key was required for some data services (especially those with interesting cross-entity validation requirements), but once partitioned, it allowed the processing logic of the application to be scaled effectively horizontally, as each partition can be processed without any involvement with any data in the other partitions.

There were also several additional benefits to the use of Kafka for the event sources, including:

  • As it was already a selected platform for the outbound events, there was no additional technology required for Event Source processing - the one tool was more than sufficient for both tasks - immediately reducing operational burden by avoiding different technologies for the two cases.
  • Using the same technology for Event Source processing as well as Outbound Event Delivery led to a highly composable architecture - one application’s Outbound event stream became another application’s inbound Event Source. In conjunction with judicious use of Kafka’s Log Compacted Topics, to act as a complete snapshot, bringing in new applications “later” was not a problem.
  • By building a suite of asynchronous services and applications all around an event sourcing and delivery data model, identifying bottlenecks in applications became much simpler - monitoring the Lag processing the event source for any given application allows bottlenecks to be much clearer - allowing us to quickly direct efforts to the hotspots without delay.
  • Coordinating event processing, retries, etc. - it was possible to minimise the interaction with underlying operational databases to just the data being processed - no large transactional handling, no additional advisory (or otherwise) locking, no secondary “messaging” queue tables, etc. This allowed much simpler optimisation of these datastores for the key operational nature of the services in question.
  • Processing applications could be, and several already have been, refactored opaquely to process Batches of events - allowing for many efficiencies that come with batch processing (e.g. bulk operations within databases, reduced network costs, etc.) - this could be done naturally with Kafka as the client model directly supports event batches. Adding batch processing in this way ensures all applications get the benefits of batch processing without impacting client APIs (forcing clients to create batches), and also without loss of low latency under “bursty” loads.
  • Separation of client data submissions from data processing allows for (temporary) disabling of the processing engines without interrupting client data requests - this allows for a far less intrusive operational model for these applications.
  • A coarse grained event sourcing model is much more amenable to a heterogeneous technology ecosystem - using “the right tool for the job” - for example, PostgreSQL for operational datastores, Solr/ElasticSearch for search/exploratory accesses, S3/DynamoDB for additional event archival/snapshotting, etc. - all primed from the single eventing platform.

Today, and Moving Forward

Today, we have a suite of close to a dozen loosely coupled event driven services and applications - all processing data asynchronously, composed via event streams. These applications and services, built on a standard set of patterns are readily operated, enhanced and further developed, by anyone in our larger, and still growing, team. As new requirements and opportunities come up around these applications, and the very data itself, we have strong confidence and capability in growing this system as appropriate.

If you find the topics in this post interesting, and would enjoy these types of challenges, come join us - we're hiring!

LendingTree
Lyft

Zalando Tech Blog – Technical Articles Team Alpha

Programming is hard, and being part of an engineering team is even harder. Depending on requirements, cross-functional teams are not equally formed with frontend and backend engineers in most organizations. Also, they are neither stable nor do people have an equal amount of experience. People come and go but software stays on, so we need to buckle up and maintain it.

Retrospective

One year ago we started a new project within Zalando Category Management, which is the branch of Zalando that looks after all of our fashionable apparel and accessories. We had to implement a new system to support the reorganization of Zalando Buyers into new, more autonomous teams, to enable them to work more effectively.

When we developed a Minimal Viable Product (MVP), neither one of our backend developers could support or add new features to our frontend. Due to project workload, our backend developers couldn’t collaborate with our frontend developers, nor had any visibility regarding progress. Therefore, to address these concerns we decided to introduce full-stack responsibility – and we failed! We failed because of several factors:

  • The frontend stack was too sophisticated for the tasks we had to complete (Angular 2 Beta + Angular CLI + ngrx store);
  • User stories were not feature-focused, but instead role-focused (separate backend and frontend stories);
  • It was hard to dive into frontend development on a daily basis.

Once again, we face the issue that some frontend engineers switch teams or roles, but the original team is still responsible for all the products that have been developed. We have since decided to become responsible end-to-end as a team, independent from team members or engineering roles.

What has changed since?

We learned from our previous experience that we have to decide on the instrument we use as a team, as well as share knowledge early and often. This is why we took a two-week sprint to evaluate two popular frameworks (Angular and React) which allowed us to make an informed decision this time around.

We also challenged our Product Specialist to provide us with feature-oriented user stories so we can break them down into smaller subtasks containing frontend and backend parts. It allows us to truly have full-stack user stories, including both frontend and backend, which leads us to working together and sharing the knowledge. All in all, this leads to a better product.

Finally, we introduced a “health check” in our sprint planning to track if we still work as one team. Every two weeks during sprint planning we ask ourselves: “Are we still one team?” We check our backlog and ask if the whole team is satisfied with the scope for the next sprint. Then, based on our criterias, we define the status of the health check and see if any immediate action is needed or if we are progressing towards our goal. It reminds us of issues we have as a team and keeps our commitment high in order to solve them.

It’s getting personal

When taking on the task of introducing end-to-end responsibility, we surveyed the whole team and looked for answers to a specific question:

What is the single most important thing YOU wish to take care of to make our full-stack initiative a success, and why is it so important?

Check out some of our answers below. Do you agree?

"That no one is afraid of changing code anywhere in our stack. Which also means we don't have single points of failure."

"Having good documentation about 'Where to start?' and 'What architecture, tools?' are we using. I think most of the time developers of one domain are just overwhelmed with where to start when you want to write code, do a bug fix or add a small feature. For example, if you want to contribute to a Play-Scala project as a frontend developer you don't know where to change things, what the structure of the project is, which things you have to keep in mind if you do an API change etc. It is the same when you ask a Java backend developer to add a new component to an AngularJS application. I think what could help the most is something that good open source projects are doing:

  • Provide a great README as an overview to the project
  • Provide Checklists and Guidelines for Contributors to describe shortly what a user would need to do if he/she wants to add a new component, a new API endpoint etc."

"Understanding that cross-functional teams are equally responsible members for each part of their system. While there might be only frontend expertise or backend expertise in the team, from the responsibility aspect it doesn't have any impact. Decisions, discussions and changes should be discussed independently from the roles of a frontend developer or backend developer. Increase the expertise of backend developers in the frontend and vice versa to make them more impactful in discussions. They would feel more responsible and feel a stronger ownership if they could bring up valuable arguments in the discussions. In collaboration with product, we should send at least one backend developer also to product-related discussions to avoid knowledge silos."

“To make sure that people with different backgrounds actually work together and practice pair programming. In my opinion, this is crucial to succeed and also to understand other ways of working.”

We’re just starting on this full stack journey. If you’re interested in how we progress, follow us to know more! The official Zalando Tech Twitter account is here.

Lyft
LendingTree

Zalando Tech Blog – Technical Articles Nikolay Jetchev

This blog post gives an overview of the latest work in image generation using machine learning at Zalando Research. In particular, we show how we advanced the state-of-the art in the field by using deep neural networks to produce photo-realistic high resolution texture images in real-time.

In the spirit of Zalando’s embracing of open source, we've published two consecutive (see https://arxiv.org/abs/1611.08207 and https://arxiv.org/abs/1611.08207) papers at world-class machine learning conferences, and the source code (https://github.com/zalandoresearch/spatial_gan and https://github.com/zalandoresearch/psgan) to reproduce the research is also available on GitHub.

State-of-the-art in Machine Learning

It’s all over town. Machine learning, and in particular deep learning, is the new black. And justifiably so: not only do vast datasets and raw computational GPU power contribute to this fact, but also the influx of brilliant people dedicating their time to the topic has accelerated the  progress in the field.

Computer Vision and Machine Learning

Computer Vision methods are very popular in Zalando’s research lab, where we constantly work on improving our classification and recommendation methods. This type of business relevant research aims to discriminate articles according to their visual properties, which is what most people expect from computer methods. However, the recent deep learning revolution has made a great step towards generative models - models that can actually create novel content and images.

Generative Adversarial Networks

The typical approach in machine learning is to formulate a so-called loss function, which basically quantifies a distance of the output of a model to samples from a dataset. The model parameters can then be optimized on this dataset by minimizing a loss function. For many datasets this is a viable approach - for image generation, however, it fails. The problem is that nobody knows how to plausibly measure the distance of a generated image to a real one - standard measures, which typically assume isotropic Gaussianity, do not correspond to our perception of image distances. However, how do humans know how to perceive this distance? Could it be that the answer is in the image data itself?

In 2014, Ian Goodfellow published a brilliant idea [Goodfellow et al. 2014] which strongly indicates that it seems to be in the data: he proposed to learn the loss function in addition to the model. But how do we do that?

The key inspiration comes from game theory. We have two different networks. First, a generative model (‘generator network’) takes noise as input and should transform it into valid images. Second, a discriminator network is added, which should learn the loss function. The two networks then enter a game in which they compete: the discriminator network tries to tell if an image is from the generator network or a real image, while the generator tries to be as good as possible in fooling the discriminator network into believing that it produces real images. Due to the competitive nature of this setup, Ian called this approach Generative Adversarial Networks (GANs).

Since 2014 a lot has happened, in particular GANs have been built with convolutional architectures, called DCGANs (Deep Convolutional GANs) [Radford et al. 2015]. DCGANs are able to produce convincing and crisp images that resemble, but are not contained, in the training set - i.e. to some degree you could say the network is creative, because it invents new images. You could now argue that this is not too impressive, because it is ‘just’ in the style of the training set, and hence not really ‘creative’. However, let us convince you that it is at least technically spectacular.

Consider that DCGANs learn a probability distribution over images, which are of extremely high dimensionality. As an example, assume we want to fit a Gaussian distribution to match image statistics. The sufficient statistic (i.e. the parameters that fully specify it) of a Gaussian distribution is the mean and covariance matrix, which for a color image of (only) 64x64 pixels would mean that more than 75 million parameters have to be determined. To make this even worse, it has been known for decades by now that Gaussian statistics are not even sufficient for images - 75 million parameters are therefore only a lower bound. Hence, as typically less than 75 million images are used, it is from a statistical perspective borderline crazy that DCGANs actually work at all.

Texture synthesis methods

Textures capture the look and feel of a surface, and they are very important in Computer Generated Imagery. The goal of texture synthesis is to learn a generating process and sample textures with the "right" properties, corresponding to an example texture image.

Classical methods include instance-based approaches [Efros et al. 2001] where parts of the example texture are copied and recombined. Other methods define parametric statistics [Portilla et al. 2000] that capture the properties of the “right” texture and create images by optimizing a loss function to minimize the distance between the example and the generated images.

However, both of these methods have a big drawback: they are slow to generate images, taking as much as 10 minutes for a single output texture of size 256x256. This is clearly too slow for many applications.

More recent work [Gatys et al. 2015] uses filters of pretrained deep neural networks to define powerful statistics of texture properties. It yields textures of very high quality, but it comes with the disadvantage of high computational cost to produce a novel texture, due to the optimization procedure involved (note that there has been work to short-cut the optimization procedure more recently).

Besides the computational speed issue, there are a few other issues plaguing texture synthesis methods. One of them is the failure to accurately reproduce textures with periodic patterns. Such textures are important both in nature -- e.g. the scales of a fish -- and for human fashion design, e.g. the regular patterns of a knitted material. Another issue needing improvement is the ability to handle multiple example textures and learning a texture process with properties reflecting the diverse inputs. The methods mentioned above cannot flexibly model diverse texture classes in the process they learn.

Spatial Generative Adversarial Networks (SGANs)

Our own research into generative models and textures allowed us to solve many of the challenges of the existing texture synthesis methods and constitute a new state of the art for texture generation algorithms.

The key insight we had in Spatial Generative Adversarial Networks (SGANs) [Jetchev et al. 2016] is that in texture images, the appearance is the same everywhere. Hence, a texture generation network needs to reproduce the data statistics only locally, and, when we ignore alignment issues (see PSGAN below for how to fix this), can generate far-away regions of an image independent of each other. But how can this idea be implemented in a GAN?

Recall that in GANs a randomly sampled input vector, e.g. from a uniform distribution, gets transformed into an image by the generative network. We extend this concept to sampling a whole tensor Z, or spatial field of vectors. This tensor Z is then transformed by a fully convolutional network to produce an image X. A fully convolutional network consists of exclusively convolutional layers, i.e. layers in which neuronal weights are shared over spatial positions. The images this network produces have therefore local statistics that are the same everywhere.

In a standard, non-convolutional fully-connected network, the addition of new neurons at a layer implies the addition of weights that connect to this neuron. In a convolutional layer, this is not the case, as the weights are shared across positions. Hence, the spatial extent of a layer can be changed by simply changing the inputs to the layer. In a fully-convolutional network, a model can be trained on a certain image size, but then rolled-out to a much larger size. In fact, given the texture assumption above, i.e. that the generation process of the image at different locations is independent (given a large enough distance), generation of images of arbitrary larger size is possible. The only (theoretical) constraint is compute time. These resulting images locally resemble the texture of the original image on which the model was trained, see Figure 1. This is a key point where the research goes beyond the literature, as standard GANs are bound to a fixed size, and producing globally coherent, uncannily large images remains a standing challenge.

Further, as the generator network is a fully-convolutional feed-forward network, and convolutions are efficiently implemented on GPUs and current deep learning libraries, image generation is very fast: generation of an image of size 512x512 takes 0.019 seconds on an nVidia K80 GPU. This corresponds to 50 frames per second! As this is real-time speed, we built a webcam demo, with which you can observe and manipulate texture generation – see here.

Periodic Spatial GANs (PSGANs)

The second paper we wrote on the texture topic was published recently at the International Conference of Machine Learning, where we also gave a talk about it. In the paper, we improved two shortcomings of the original SGAN paper.

The first shortcoming of SGANs is that they always sample from the same statistical process, which means that after they’re trained, they always produce the same texture. When the network is trained on a single image, it produces texture images that correspond to the texture that was in this image. However, if it was trained on a set of images, it produces a texture image that mixes the original texture images in a single texture in the outputs, see Figure 2. Often, though, we’d rather want a network to produce an output image, which resembles one of the training images - and not all simultaneously. This means that the generation process needs to have some global information, that encodes which texture to generate. We achieved this by setting a few dimensions of each vector in the spatial field of vectors Z to be identical across all positions - instead of randomly sampling it as in the previous section. These dimensions are hence globally identical, and we therefore call them global dimensions. Figure 3 shows an image that resulted from a model trained with this idea on many pictures of snake skins from the Describable Textures Dataset. Figure 4 shows how the flower example looks when explicitly learning the diverse flower image dataset, and that this is totally different behaviour than the example in Figure 2. In addition to training on a set of texture images, the model can also be trained on clip-outs of one larger image, which itself does not have to be an image of textures. Generating images with the model will then result in textures that resemble the local appearance of the original image.

An interesting property of GANs is that a small change in the input vector Z results in a small change of the output image. Moving between two points in Z-space hence morphs one image smoothly into another one [Radford et al. 2015]. In our model, we can take this property one step further, because we have a spatial field of inputs: we can interpolate the global dimension in Z in space. Figure 5 shows that this produces smooth transitions between the learned textures in one image.

Second, many textures contain long-range dependencies. In particular, in periodic textures the structure changes at well-defined length scales - the periods - thus the generation process is not independent of other positions. However, we can make it independent by handing information about the phase of our periodic structure to the local generation processes. We did this by adding simple sinusoids of a given periodicity, so-called plane-waves (see Figure 6), to our input Z. The wavenumbers that determine the periodicity of the sinusoids were learned as a function of the current texture (using multi-layer perceptrons), i.e. as a function of the global dimensions. This allows the network to learn different periodic structures for diverse textures. Figure 7 shows generated images learned on two different input images for various methods: text and a honeycomb texture. PSGAN is the only method which manages to generate images without artifacts. Note in particular that the other two neural based methods (SGAN and Gatys’ method) scramble the honeycomb pattern. Interestingly, our simulations indicate that the periodic dimensions also helped stabilize the generation process of non-periodic patterns. Our interpretation of this observation is that it helps to anchor generation processes to a coordinate system.

Discussion and Outlook

As a wrap-up, in this blog post we have given an overview of how we extended current methods in generative image modeling to allow for very fast creation of high-resolution texture images.

The method exceeds the state-of-the art in the following ways:

  • Scalable, arbitrary large texture image generation
  • Learn texture image processes representing various texture image classes
  • Flexibly sample diverse textures and blend them in novel textures

Please check out our research papers for more details, and these videos as examples that show you how to animate textures:

So far, this is basic research with a strong academic focus. In a longer term perspective, one of several potential products could be a virtual wardrobe, which could be used to asses how Zalando’s customers will look in a desired article, e.g. a dress. Will it fit? How will I look in it? Solutions to these questions will very likely become a reality in the future of e-commerce online shopping. We already have academic results that get closer to this use case and a paper will be published soon in a workshop of “Computer Vision for Fashion” at the International Conference for Computer Vision.

Stay tuned!

References

[Bergmann et al. 2017] Urs Bergmann and Nikolay Jetchev and Roland Vollgraf.

Learning Texture Manifolds with the Periodic Spatial GAN. Proceedings of The 34th International Conference on Machine Learning, ICML 2017.

[Efros et al. 2001] Alexei A. Efros and William T. Freeman. Image quilting for texture synthesis and transfer. In Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH, 2001.

[Gatys et al. 2015] Leon Gatys, Alexander Ecker, and Matthias Bethge. Texture synthesis using convolutional neural networks. In Advances in Neural Information Processing Systems 28, 2015.

[Goodfellow et al. 2014] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems 27, 2014.

[Jetchev et al. 2016] Nikolay Jetchev, Urs Bergmann and Roland Vollgraf. Texture Synthesis with Spatial Generative Adversarial Networks. Adversarial Learning Workshop at NIPS 2016

[Portilla et al. 2000] Javier Portilla and Eero P. Simoncelli. A parametric texture model based on joint statistics of complex wavelet coefficients. Int. J. Comput. Vision, 40(1), October 2000.

[Radford et al. 2015] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR, abs/1511.06434, 2015.

LendingTree
Lyft

Zalando Tech Blog – Technical Articles Jan Schulz

At Zalando, we are constantly looking into ways to widen our assortment, in depth and width. This is to make sure that all fashion items are available anywhere and at anytime for our customers. Our Partner Program helps to bring this vision to life. Through the Partner Program, brands and retailers can integrate their own e-commerce stock into the Zalando Fashion Store and ship their products directly from their own warehouse to Zalando customers.

Following this, we not only want to offer the best and freshest assortment to customers, but a frictionless shopping experience throughout the whole process – including delivery and returns. We are constantly improving our service proposition and also want our partners to fulfill the high standards that our customers are used to – standards that some partners often struggle with due to limited logistics capabilities for certain markets.

With Zalando Fulfillment Solutions (ZFS), we’re now able to help our partners in the Partner Program with these challenges and offer up our logistics expertise, taking over all logistics processes from inbound to pick, pack, shipping plus returns. But better availability of products is regarded as extremely important – not only for Zalando to offer the best assortment, but also for our partners to further grow their business. With Zalando Fulfillment Solutions we are able to provide our current and future brand partners with highly customized and reliable solutions, enabling them to sell their merchandise through our platform and without having to worry about logistics concerns.

Zalando Fulfillment Solutions addresses different target groups - smaller brands and retailers, as well as bigger partners, by using synergies and the one parcel principle: More than half of the orders of an item from our Partner Program also contain an article from Zalando Wholesale. With all items from Partner Program and Wholesale in our Zalando warehouse, we can simplify the process for all parties involved, meaning customers no longer receive two different parcels, but one combined package, with shipping costs being shared with our partner. This is not only more efficient but also more profitable overall. However, bigger partners still sell their products via different channels and prefer full flexibility for their inventory. This introduces the idea of replenishment, with Zalando wanting to enable its partners to replenish the right amount of fashion items to reduce:

  • Lost sales; due to insufficient inventory
  • Inventory holding costs; due to too much inventory

To deliver on this we have developed the FAST Replenishment Algorithm, which serves ZFS partners with recommendations on what fashion items need to be replenished and in what quantity. In the following post, we address the challenges in the proposition, key product features, and possible improvements for future iterations.

Challenges and opportunities

In short, we face two main challenges in the project: The forecasting of demand and the delivery of operational excellence with our FAST supply.

Supply comes in two flavours: The ZFS partner’s replenishment and returns from customers. Both are by far not deterministic with regards to:

  • The quantities our partner actually replenishes: In some cases, partners can have insufficient inventory units to follow the recommended quantity.
  • The lead-time between when the partner has received the replenishment recommendation and when their replenished inventory units are available for sale.
  • The quantities and lead-time of customer return.

Demand forecasting can be seen as even more challenging, for reasons such as:

  • Fashion is seasonal, meaning a fashion article’s life cycle is short (< 180 days) and continues to get shorter (with fast fashion having a 28 day cycle).
  • Demand steering with promotions (advertisements) while inventory management works on SKU level (named “article sample”, size, or EAN).
  • Demand forecasting of fashion-type products is described as being a problem of high uncertainty, high volatility and impulsive buying behavior. Several authors advise against forecast demand for these products, but instead build an agile supply chain that can satisfy demand as soon as it occurs.

Replenishment planning is always integer planing and thus presents another challenge. It’s impossible to replenish the fraction of fashion item demand required for your intended days of coverage. Therefore it’s crucial to verify, for each demand pattern, the impact of rounding up, rounding down, or proportionally rolling the dice.

Key solution concepts

FAST replenishment

A FAST supply chain gives us a powerful strategic advantage. FAST is a reference to the speed of replenishment, which can be broken down into the following steps:

  1. Zalando calculates a replenishment recommendation
  2. Our partner coordinates their inventory availability and replenishment shipping schedules
  3. Zalando receives the replenishment

A high replenishment process speed is equivalent to shorter replenishment lead-time, and therefore equivalent to the lower inventory quantity levels needed to fulfill customer demand.

Currently, FAST is implemented as a weekly inventory review. Zalando, together with its ZFS partner BESTSELLER, is able to execute replenishment with a one week cycle-time. Other ZFS partners aim to increase their cycle-time as well.

The key contributions here are clear wins for both sides. Lower out-of-stock notifications mean higher sales, while the partnership yields lower inventory costs and thus higher margins.

How agile product development helped the process

Agile product development is a perfect fit for data-driven product development, especially when the product is a replenishment algorithm.

In order to start quick and learn fast, our Logistics Algorithms team focused on continuous interactions between the customer, our ZFS Partner, and Zalando, organised into weekly build, measure, and learn cycles.

The Logistics Algorithms team was able to successfully contribute real business value within one week by radically focusing on the problem and reducing the scope in order to build an MVP.

This was done with a script that created a CSV file with the ZFS partner’s SKUs and a “recommended” replenishment quantity. The minimal quality on “recommendation” leads to the question of how to assess the quality of any “ZFS Partner Replenishment Algorithm” and therefore what to measure. The Logistics Algorithms team started with some standard standard inventory control KPIs as a basis for this.

In order to build quick, the Logistics Algorithms team used Anaconda as their data science platform.

From the open data science pillar, Python and Jupyter Notebooks were used to collaborate and share results, including data science models and visualizations, as well as to reproduce results and govern the ZFS replenishment algorithm product as a whole.

On the data front, the team used standard ODBC connectivity to extract, transform and load sales, on top of inventory and article data from Zalando’s EXASOL. Postgres is our standard for data storage.

Demand forecast

Any type of replenishment is based on forecasting the demand of items. The quality of the demand forecast is defined as the forecast accuracy, which depends on the level of detail and the time horizon. Our FAST replenishment algorithm requires SKU-level demand forecast for a time horizon of about one to two weeks. One great method to assess the demand forecast quality is benchmarking your performance within the industry. The Institute of Business Forecasting and Planning serves those benchmarks for the short-term, meaning they delivered a one month outlook of:

  • Aggregate forecasts that had an average error rate between 10.4% and 15%
  • SKU level forecasts had ranged between a staggering 27% to 37.7%.

Forecast errors on high volumes can cause greater issues for a business than slower moving SKU. If the stock-out is caused by low forecast accuracy on a fast mover, it makes a huge impact on sales volume and profitability. In the case that low forecast accuracy has caused overstocking, it holds too much working capital on inventory and leads to extra warehousing costs.

Forecasting methods

To forecast the demand, our Logistics Algorithms team applied standard quantitative methods such as a naive moving average with several lookback times (7 days, 14 days, 28 days, 42 days), as well as simple exponential smoothing based on historic sales data on an SKU-level. The demand forecasts for new articles perform best on a higher aggregation level with article configuration, brand or category. The team also applied the principle of combining severable reasonable forecasting methods which yielded more accuracy overall.

Key product features

For ZFS partners, features are configurable and include sales channels, replenishment cycle-time, as well as inventory cycle-time at the service level. We also provide automatic inventory detection for partners via their current inventory on hand in order to detect stock-outs. Historic sales data is also taken into account.

Stock-up recommendations on the SKU level are based on demand pattern segmentation and best-in-class forecast methods respectively when it comes to forecast accuracy.

How do we further improve the service?

To speed up the supply chain even more, our ZFS FAST Replenishment Algorithm must incorporate check-point events along the supply chain. This could look like the following:

  1. When our ZFS partner acknowledges replenishments
  2. When Zalando accepts replenishments inbound
  3. When ZFS partners ship replenishments
  4. When Zalando receives replenishments
  5. When Zalando stores replenishments

When the supply chain is controlled, Zalando and its ZFS partner are enabled to move from a weekly periodic review to a continuous review while processing multiple replenishment cycles in parallel.

Outlook

The Zalando platform is an operating system for the fashion world, with multiple ways of integrating all sorts of fashion contributors and stakeholders. Our logistics services enable the platform, and ZFS is merely one example of how we cater to specific stakeholder needs. We see ZFS as supporting the growth of our Partner Program by meeting high delivery standards and supporting one of our core values: To make the fashion experience as frictionless as possible.

Currently, Zalando supports ZFS from only one dedicated warehouse. In the future, ZFS will be rolled out to multiple warehouses, which means the FAST Replenishment Algorithm must consider multi-warehouse allocation for ZFS inventory.

We expect an increase in the level of organisational and technology maturity as the next iteration of this service: From manual execution and supervision (build, measure, learn) to an even more automated approach. In the end, we aim to enable partners to further build up their business, becoming the go-to digital strategy for their growth. We see further partners and further countries being added to increase scope and scale our solution.

Lyft
LendingTree

Zalando Tech Blog – Technical Articles Aliaksandr Kavalevich

In this article I would like to talk about the integration of Amazon DynamoDB into your development process. I will not try to convince you to use Amazon DynamoDB, as I will assume that you have already made the decision to use it and have several questions about how to start development.

Development is not only about production code - it should also include integration tests and support different environments for making more complex tests. How would you achieve it with SaaS? Especially for integration tests and local development, Amazon provides local installation of DynamoDB. You can use it for your tests and local development, it will save you a lot of money, and it will also increase the execution speed of your integration tests. In this post, I'll show you how to write your production code and integration tests, and how to separate different environments with Java, Spring Boot, and Gradle.

Let's start with a simple example. First we will need to create a Gradle build file with all the needed dependencies included:

apply plugin: 'java'
apply plugin: 'spring-boot'
buildscript {
repositories {
mavenCentral()
maven {
url "https://plugins.gradle.org/m2/"
}
}
dependencies {
classpath "org.springframework.boot:spring-boot-gradle-plugin:1.3.2.RELEASE"
}
}
repositories {
mavenCentral()
}
jar {
baseName = 'application-gradle'
version = '0.1.0'
}
dependencies {
compile('org.springframework.boot:spring-boot-starter-web:1.3.2.RELEASE')
compile 'com.amazonaws:aws-java-sdk-dynamodb:1.10.52'
compile 'com.github.derjust:spring-data-dynamodb:4.2.0'
    testCompile 'junit:junit:4.12'
testCompile 'org.springframework.boot:spring-boot-starter-test'
}
bootRun {
addResources = false
main = 'org.article.Application'
}
test {
testLogging {
events "passed", "skipped", "failed"
}
}

Two main dependencies for using DynamoDB are:

compile 'com.amazonaws:aws-java-sdk-dynamodb:1.10.45'
compile 'com.github.derjust:spring-data-dynamodb:4.2.0'

Those dependencies include Amazon DynamoDB support for us. The first one includes a standard client for DynamoDB from AWS, and the second adds Spring-Data support for DynamoDB.

The next step is creating a Spring-Boot configuration class to configure the connection to DynamoDB. It should looks like this:

package org.article.config;

import org.apache.commons.lang3.StringUtils;
import org.socialsignin.spring.data.dynamodb.core.DynamoDBOperations;
import org.socialsignin.spring.data.dynamodb.core.DynamoDBTemplate;
import org.socialsignin.spring.data.dynamodb.repository.config.EnableDynamoDBRepositories;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import com.amazonaws.regions.Regions;
import com.amazonaws.services.dynamodbv2.AmazonDynamoDB;
import com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient;
import com.amazonaws.services.dynamodbv2.datamodeling.DynamoDBMapperConfig;

@EnableDynamoDBRepositories(basePackages = "org.article.repo", dynamoDBOperationsRef = "dynamoDBOperations")
@Configuration
public class DynamoDBConfig {

@Value("${amazonDynamodbEndpoint}")
private String amazonDynamoDBEndpoint;
@Value("${environment}")
private String environment;
@Value("${region}")
private String region;

@Bean
public AmazonDynamoDB amazonDynamoDB() {
final AmazonDynamoDBClient client = new AmazonDynamoDBClient();
client.setSignerRegionOverride(Regions.fromName(region).getName());
if (StringUtils.isNotEmpty(amazonDynamoDBEndpoint)) {
client.setEndpoint(amazonDynamoDBEndpoint);
}
return client;
}

@Bean
public DynamoDBOperations dynamoDBOperations() {
final DynamoDBTemplate dynamoDBTemplate = new DynamoDBTemplate(amazonDynamoDB());
final DynamoDBMapperConfig.TableNameOverride tableNameOverride = DynamoDBMapperConfig.TableNameOverride .withTableNamePrefix(environment);
dynamoDBTemplate.setDynamoDBMapperConfig(new DynamoDBMapperConfig(tableNameOverride));

return dynamoDBTemplate;
}
}

Here we've created an Amazon DynamoDB client for a specified region. It’s important to notice that Amazon provides DynamoDB in different regions, and those DBs are completely separate instances, therefore it’s important to specify the region. By default, a client will use region "us-west-1". We've also added the possibility to change the DynamoDB endpoint. For production code, you don’t need to specify this endpoint, since the client provided by Amazon will create the appropriate URL itself. For test purposes you need only to specify the URL of your local DynamoDB installation.

Another decision to be made needs to be about environment separation. For each AWS account, you only need one DynamoDB instance per region. There are two possibilities where you can have several environments (e.g. production and stage) in Amazon DynamoDB.

The first approach is to have two separate accounts - one per environment. The main benefit of this approach is that you have two completely separate environments. The main disadvantage however is that you have to maintain two accounts and switch between them during development. There can be quite a big overhead for this task.

The second approach is to separate environments using table name prefixes. For example, for table "User" there will be no real table with the name "User" in DynamoDB. Instead, there will be table names like "prodUser", "stageUser". The main benefit just so happens to be the main disadvantage of the previous approach: You don’t have to switch between accounts.

Now it's time to create a Java entity. It should look like this:


package org.article.domain;

import com.amazonaws.services.dynamodbv2.datamodeling.DynamoDBAttribute;
import com.amazonaws.services.dynamodbv2.datamodeling.DynamoDBHashKey;
import com.amazonaws.services.dynamodbv2.datamodeling.DynamoDBTable;

@DynamoDBTable(tableName = "User")
public class User {
@DynamoDBHashKey
private String userName;
@DynamoDBAttribute
private String firstName;
@DynamoDBAttribute
private String lastName;
public String getUserName() {
return userName;
}
public void setUserName(final String userName) {
this.userName = userName;
}
public String getFirstName() {
return firstName;
}
public void setFirstName(final String firstName) {
this.firstName = firstName;
}
public String getLastName() {
return lastName;
}
public void setLastName(final String lastName) {
this.lastName = lastName;
}
@Override
public boolean equals(final Object o) {
if (this == o) {
return true;
}

if (!(o instanceof User)) {
return false;
}

User forum = (User) o;
if (userName != null ? !userName.equals(forum.userName) : forum.userName != null) {
return false;
}

if (firstName != null ? !firstName.equals(forum.firstName) : forum.firstName != null) {
return false;
}

return lastName != null ? lastName.equals(forum.lastName) : forum.lastName == null;
}

@Override
public int hashCode() {
int result = userName != null ? userName.hashCode() : 0;
result = 31 * result + (firstName != null ? firstName.hashCode() : 0);
result = 31 * result + (lastName != null ? lastName.hashCode() : 0);
return result;
}
}

The User entity looks like a usual POJO. In additional to that, we have a couple of annotations added. The DynamoDBTable annotation shows that this class corresponds to the table with the name “User”. In this table we should only have one hash key. To specify this we need the annotation DynamoDBHashKey. So, we marked the field userName as a hash key. We also have two attributes: firstName and lastName annotated with DynamoDBAttribute. Please note that this entity has a partition key without a sort key.

After Entity, we should create the UserRepository object, which is just an interface extended from CrubRepository. We should specify the entity and type of id – then it's done! Now, we have basic CRUD operations implemented for us:

package org.article.repo;

import org.article.domain.User;
import org.socialsignin.spring.data.dynamodb.repository.EnableScan;
import org.springframework.data.repository.CrudRepository;

@EnableScan
public interface UserRepository extends CrudRepository<User, String> { }

At the moment, we have both User entity and UserRepository with basic CRUD operations implemented, so it’s time to check them out with an integration test. First, we need to change our Gradle build to run an integration test. Local DynamoDB should start before tests and stop right after. We also need to create tables in DynamoDB. Although it doesn’t have a schema in the usual way, you still need to create tables and specify the partition key and sort key, if needed. To start local DynamoDB, create tables, and stop the local DynamoDB instance, there’s a nice Maven plugin here. The main disadvantage of this plugin is that it can create tables only for local DynamoDB instances, but not for the real Amazon environment. As you'll need to create tables for the production environment anyway, I believe this should be done exactly the same way as you’d do it for your local instance (that's why I don't use this plugin). What I like to do is start a local DynamoDB instance from a Docker container. If you don't have Docker yet, you can find instructions on how to set it up here.

The first Gradle task that we need is to start the local DynamoDB instance:

task startDB (type:Exec) {
commandLine "bash", "-c", "docker run -p ${dbPort}:${dbPort} -d tray/dynamodb-local -inMemory -sharedDb -port ${dbPort}"
}

This will start DynamoDB on the port specified with the property dbPort. We used 2 parameters to start DB: The first one is “inMemory”. This parameter tells DynamoDB that it should be completely in memory. The second parameter is “sharedDb”. This is responsible for making sure there isn’t any region separation in the DB.

The next step would be to create tables. We will create a table description in json format in the directory database, so there should be just the one User.json file for now.

 {
"AttributeDefinitions": [
{
"AttributeName": "userName",
"AttributeType": "S"
}
],
"TableName": "User",
"KeySchema": [
{
"AttributeName": "userName",
"KeyType": "HASH"
}
],
"ProvisionedThroughput": {
"ReadCapacityUnits": 10,
"WriteCapacityUnits": 10
}
}

We also need to add a Gradle task to create the User table in DynamoDB:

task deployDB(type:Exec) {
mustRunAfter startDB
def dynamoDBEndpoint;
if (amazonDynamodbEndpoint != "") {
dynamoDBEndpoint = "--endpoint=${amazonDynamodbEndpoint}"
} else {
dynamoDBEndpoint = ""
}
commandLine "bash", "-c", "for f in \$(find database -name \"*.json\"); do aws --region ${region} dynamodb create-table ${dynamoDBEndpoint} --cli-input-json \"\$(cat \$f | sed -e 's/TableName\": \"/TableName\": \"${environment}/g')\"; done"
}

You’ll notice that when using this task we can create tables both for local and real AWS environments. We can also create tables for different environments in the cloud. To do this, all we need is to pass the right parameters. To deploy to the real AWS, you need to execute the following command:

gradle deployDB -Penv=prod

Here the env parameter is the name of the property file. Then, we need the task to stop DynamoDB.

task stopDB (type:Exec) {
commandLine "bash", "-c", "id=\$(docker ps | grep \"tray/dynamodb-local\" | awk '{print \$1}');if [[ \${id} ]]; then docker stop \$id; fi"
}

Let's configure those tasks to start DynamoDB before an integration test and to stop it right after:

test.dependsOn startDB
test.dependsOn deployDB
test.finalizedBy stopDB

Before we can execute our first integration test, we need to create two property files – the first one for production usage and the second one for the integration tests:

src/main/resources/prod.properties
amazon.dynamodb.endpoint=
environment=prod
region=eu-west-1
dbPort=
AWS_ACCESS_KEY=realValue
AWS_SECRET_ACCESS_KEY=realValue
src/test/resources/application.properties
amazon.dynamodb.endpoint=http://localhost:7777
environment=local
region=eu-west-1
dbPort=7777
AWS_ACCESS_KEY=nonEmpty
AWS_SECRET_ACCESS_KEY=nonEmpty

Now everything is ready to write our first integration test. It looks very simple: we try to save two entities and then get one of them by ID:

package org.article.repo;

import static org.junit.Assert.assertEquals;
import org.article.Application;
import org.article.domain.User;
import org.junit.After;
import org.junit.Test;
import org.junit.runner.RunWith;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.SpringApplicationConfiguration;
import org.springframework.test.context.junit4.SpringJUnit4ClassRunner;

@SpringApplicationConfiguration(classes = Application.class)
@RunWith(SpringJUnit4ClassRunner.class)
public class UserRepositoryIT {

@Autowired
private UserRepository userRepository;

@After
public void tearDown() {
userRepository.deleteAll();
}

@Test
public void findByUserName() {
final User user = new User();
user.setUserName("userName");
user.setFirstName("firstName");
user.setFirstName("lastName");
userRepository.save(user);

final User actualUser = userRepository.findOne(user.getUserName());
assertEquals(user, actualUser);
}
}

Once initiated, we should be able to execute this test using Gradle from the command line:

    gradle clean build

And there you go! In this post, I've showed you how to manage a local instance of Amazon DynamoDB during the development process using Spring Data and the Gradle build tool. I have also shown you how to create tables for a real AWS environment and how to separate environments using a standard Java client for AWS. The source code for this procedure can be found on our GitHub page.

Lyft
LendingTree

Zalando Tech Blog – Technical Articles Nicolas Braun

When combing through state-of-the-art articles about IT-Compliance Management, I frequently stumble over the following quintessence: “IT-Compliance is a necessary (legal) evil and - because of that - boring by nature”. I can’t exclude myself from this perception: I used to think exactly the same way before taking over three teams in Zalando’s Platform Engineering department, one of them being the IT-Compliance team (with the very appropriate name “Torch”). After more than 1,5 years of dedicated work in this field, I can now say with a clear conscience: IT-Compliance is actually exciting!

Why do I feel the need to state such a fact? My foremost purpose is to promote working in the field of IT-Compliance more positively, more appealingly, and - overall - more prominently. The common understanding of IT-Compliance is not currently positive, which is something that I hope to change with the following blog post. At Zalando, our challenges regarding the technology landscape, regulations, and the people required to make the magic happen all elevate IT-Compliance to exciting heights.

Need an example? In order to ensure compliant software development, we implemented an open source agent that enforces guidelines for GitHub repositories. It automatically checks pull requests before they are merged. This kind of work is what makes IT-Compliance all the more impressive.

The Challenge

So what’s so challenging about dealing with IT-Compliance in the 21st century? To find that out we need to take a closer look at the fundamental protagonists of the “game”.

First, there is the IT landscape itself. Second, there are regulations. Third, we must always include the people involved. And last but not least, there is the company that wishes to remain competitive in the market. Finding the sweet spot of a well-balanced alignment between these factors is the key to success.

IT Landscape

Modern IT systems are as complex as they are diverse. This is related to the emergence and usage of numberless technologies and programming languages as well. Rapid, continuous enhancements of existing technologies make the landscape profoundly volatile. Along with this comes the “modern engineering mindset”: agile, curious, experiment-happy, and willing to take risks. Both aspects cross-fertilize each other and strengthen the use of ever new technologies. On top of that, Zalando grants engineers a high degree of development freedom. The logical consequence is a regularly changing way of developing software and bringing it to production, which also fuels rapid change.  

Regulations

Regulations can be vague and technology is changing rapidly, as noted above. This means that quite often, regulations can’t keep the pace. From a technical perspective, part of the dilemma now becomes the question: how to adequately address vague regulations?

People

Rational people understand the need for being compliant. However, they face natural business-driven constraints such as time pressure and delivery stress. Under these circumstances, engineers tend to avoid undesired overhead. One of the most frequently asked questions is: “why do I have to do that?”. Understanding and clarifying the “why” (in both directions) is an indispensable prerequisite. Afterwards, addressing the constraints (e.g. offering frictionless compliance tooling) while deliberately sharpening an engineer's mindset and raising awareness is the most challenging mission.

Company

Zalando is a multi-billion dollar business with the fastest growing technology engineering group in Europe. In fact, it’s one of the fastest growing European companies with a transition period from startup to IPO in 6 years. How do we find the healthy balance between investment and return-of-investment? How do you even measure IT-Compliance costs anyway? How can you guarantee IT-Compliance in a company of this size and scope?

Managing IT-Compliance in the 21st Century

IT-Compliance of the modern age has to cope with all the challenges listed above and more. It’s as simple as this: nobody knows how to achieve “100% IT-Compliance”. However, certainty needs to be brought into a sea of uncertainty. Assessment procedures of yearly IT-Audits are also less transparent. In order to survive here and address aforementioned challenges, we identified two building blocks: “Strategic Focus” and “Division of Powers”.

Strategic Focus

Even amid turmoil, we keep the unit focused on objectives and strategy. All teams are involved in setting and evolving the vision, goals, and progress of our work. Focus topics are identified as change management, data classification, and access management. Having defined the “what to do?” we then define the “how to get there?” via a maturity model and by mapping each focus topic to it. The model consists of several maturity levels that can be thought of as well-defined evolutionary plateaus towards achieving service excellence. In the end, the “when to reach maturity?” is stated by putting a concrete timeline on top of each focus topic in accordance with its current maturity level.

Division of Powers

Theory (legislative power) and practice (executive power) are merged into an undividable unit, which serves the inside and outside - neither arrogant nor dictating and with a clear guideline of consolidated, unified communication. Important in the overall concept is that the executive power - although acting as an internal supervisory committee -  is neither appearing or being perceived as judiciary. The latter is entering the “game” early enough in the form of audit companies or the internal revision department.

Instead, we strive for closely involving employees in all matters of compliance and taking their concerns seriously. Feedback is our most valuable asset, highly appreciated and always taken into consideration. Another important piece of the puzzle is the support of both legislative and executive powers via close collaboration with a dedicated engineering team.

Legislative Power: ITC Foundation

This unit deals with Scoping and Narrowing of IT-Compliance requirements. Risk-based rules are identified along the focus topics and communicated to the relevant engineering units. A close collaboration with our stakeholders is essential. Main credo: not against them - with them! This credo is also reflected in the provisioning of exciting, innovative IT-Compliance trainings and bootcamps, around topics such as resolving violations, or understanding our Rules of Play in quiz-like or gamified formats. Moreover, individual consultancy services and support channels complete the task area.   

Executive Authority: IT Internal Controls

This unit implements Measuring and Monitoring solutions. Main credo: uncover violations before the auditors find them! For this purpose, control measures are defined and executed along the focus topics. A reasonable reporting of results to stakeholders is a critical endeavor. This likely results in a professional execution of escalation management (shared activity with ITC Foundation).

Computerized Support: ITC Engineering

A third technical unit fulfills Remediating and Automating tasks. Dedicated tooling supports legislative and executive powers as well as customers in their daily work. The primary goal here is to realize the highest possible automation of manual processes. Monitoring activities are supported by implementing a reliable visualization of violations (e.g. in form of IT-Compliance dashboards). Tooling is evaluated in aspects of compliant usage and - if applicable - integrated into a “Compliance Radar” (analog to Zalando’s Tech Radar). In addition, the unit takes over the important task of supporting all stakeholders in understanding the complex IT landscape and the offered tooling itself.

Conclusion

After reading this blog entry, the conclusion for you should be pretty simple: it’s real fun contributing to the various aspects of IT-Compliance in our modern age. Finding smart solutions to meet regulations and standards in a large, versatile Tech environment - like you find at Zalando - is actually one of the biggest challenges in Europe.

Where to learn, grow and succeed better than here?

Lyft
LendingTree

Zalando Tech Blog – Technical Articles Christoph Luetke Schelhowe

As Zalando continues taking steps towards becoming a fully-fledged platform, we want to move fast, validate the ways that our big strategic moves pay off, and capture the full value of our products by continuous optimization. To this end, we wanted to ensure that we’re bringing data-informed decision making to the forefront of our processes by establishing a true data and experimentation culture that could ultimately become a competitive advantage in today’s fast-changing world.

Zalando has always been a data-driven company and analytics has been one of our key success factors. We believe that much of the success (or failure) of a product rides on data, and on how it is used. This brought about the following question: How can we elevate Zalando to the next level of data-informed decision making? This is how the Product Analytics department came to life.

Purpose

The purpose of the Product Analytics team is to embed a true data and experimentation culture at Zalando to empower smart decision making.

What do we mean by true data and experimentation culture?

  • Our Business Units are aligned around key metrics that are rooted in our most important business priorities. Success is defined by a set of well-proven metrics which individual teams own and contribute to.
  • Every team can access the data they need from various data sources and with high data quality. Setting up tracking is easy as well as assessing the data quality. Understanding user behavior based on A/B tests is quick and teams are always running multiple experiments at the same time.
  • Every team can draw the right insights from their data. Teams have the ability and skills to learn from and make decisions informed by data. Advanced analytics helps them discover problems and opportunities, plus focus on the right developments.
  • Decision making is not influenced by compromises, personal biases or egos, but only insights.

How can we get there?

To make data-informed decision making an easy and effective routine, and establish a data and experimentation culture, we focus on 1.) building a self-service infrastructure for experimentation, tracking, analytics 2.) ensuring common data governance, and 3.) enabling and educating all teams throughout Zalando.

  • Self-service infrastructure for tracking, experimentation, and analytics: Data analysis and experimentation should be fast and easy. Only true self-service tools are truly scalable given the size of our organization today.
  • Common data governance: With nearly 200 teams producing and consuming data events, there’s a growing need to ensure event tracking completeness and correctness and to allow for the easy compatibility of data.
  • Enablement and education: As we want to move fast, all teams must be enabled and empowered in data informed product development; e.g. from building a rationale around new features up to iterative testing and optimization at the end of the product lifecycle. We expect a certain data and experimentation affinity from everybody and want to embed a data-driven culture everywhere. In order to get there, we want to guide teams and help them be more rigorous by embedding an expert analyst role into teams.

Department structure and competencies

The Product Analytics department was created as a hybrid organization of central teams and team-embedded product analysts. The central teams provide world-class tools and knowledge in the domains of Economics, Tracking, Experimentation, Journey Modelling, and Digital & Process Analytics. Product Analysts would also be embedded into teams varying from our Fashion Store, Data, and Logistics areas to focus on insight-driven product development. They play an instrumental part in all steps of the product lifecycle (“discover - define - design - deliver”) and can support insights-based decision making by performing the following tasks:

  • Understand user and customer behavior: Develop in-depth analytical understanding for what drives growth for the product and how it can be improved, thus inspiring product work.
  • Measure and monitor product progress: Analysts help to define target KPIs for the team and ensure that Product Specialists and and Product Owners develop ownership of them. At the same time, they facilitate access to the key target KPIs and other relevant data. They establish methods to monitor short-term progress and long-term product health. When KPIs change, embedded analysts explore the underlying reasons and are able to provide context for these changes.
  • Prove if product ideas work: In the context of value creation, especially for new features, embedded analysts play an essential role by gathering and formulating analytical evidence that supports all phases in the product lifecycle, from discovery to rollout. Data must justify why we do what we do.
  • Drive product optimization: From a value capturing point of view, embedded analysts drive optimization iterations for existing features until they reach a local maximum.
  • Ensure data quality: Product Analysts create awareness about data quality within the teams where they are embedded. They have the responsibility of defining the specifications of the data to be generated by their teams, monitoring its quality and making sure the team addresses any quality-related issues they are responsible for.
  • Improve data literacy: Analysts drive the data mindset in their teams, educate and guide in terms of analytical methodology – they are enablers for any data leading to product decisions.

What the future holds

Ultimately, we want the magic of data-informed product development to happen in every team, guided by team-embedded Product Analysts and empowered by central teams with best in class self-service tools and methodologies. By adopting processes that ensure data-informed decision making is taking place, our teams can build better products and iterate faster than ever.

Opinions are great to start a discussion, but we win on insights from user behavior. We prove strong hypotheses with relentless and granular attention to data and KPIs driving our decisions. We believe in high frequency experimentation and iterations to create the best possible experience for customers and all other players in the ecosystem.

It’s our vision that every product decision – be it the discovery or rollout of a new product; be it on the customer-facing, brand, core platform or intermediary side – is backed by analytical insights and rigorous impact testing. Thereby, we’re building a solid foundation for the next big learning curve in analytics: Artificial Intelligence and Machine Learning. We’ll be revealing more about our plans and learnings in upcoming articles.

Interested in Product Analytics possibilities at Zalando? We’re hiring.

LendingTree
Lyft

Zalando Tech Blog – Technical Articles Jan Brennenstuhl

JSON Web Tokens, or just JWTs (pron. [ˈdʒɒts]), are the new fancy kids around the block when it comes to transporting proofs of identity within an untrusted environment like the Web. In this article, I will describe the true purpose of JWTs. I will compare classical, stateful authentication with modern, stateless authentication. And I will explain why it is important to understand the fundamental difference of both approaches.

While there are many good articles available that describe specific aspects, best practises, or single use-cases of JWTs, the bigger picture is often missing. The actual problem that JWT specs try to solve is just not part of most discussions. With JWTs gaining in popularity however, that missing knowledge of the fundamental ideas of JSON Web Token leads to serious questions like:

This article is not about symptoms, but the purpose of JWT which actually is: Getting rid of stateful authentication!

Stateful Authentication

In the old days of the Web, authentication was a pure stateful affair. With a centralized overlord entity being responsible for tokens, the world was fairly simple:

  • Tokens are issued and stored in a single service for future checking and revocation,
  • Clients and resource servers know a single point of truth for token verification and information gathering.

This worked rather well in a world of integrated systems (some might call them legacy app, mothership or simply Jimmy), when servers rendered frontends and dependencies existed on e.g. package-level and not between independently deployed applications.

In a world where applications are composed by a flock of autonomous microservices however, this stateful authentication approach comes with a couple of serious drawbacks:

  • Basically no service can operate without having a synchronous dependency towards the central token store,
  • The token overlord becomes an infrastructural bottleneck and single point of failure.

Eventually, both facts oppose the fundamental ideas of microservice architectures. Stateful authentication introduces not just another dependency for all your single-purpose services (network latency!) but also makes them heavily rely on it. Without the token overlord being available (even for just a couple of seconds) everything is doomed. This is why a different approach is required: Stateless Authentication!

Stateless Authentication

Stateless authentication describes a system/process that enables its components to decentrally verify and introspect tokens. This ability to delegate token verification allows us to (partly) get rid of the direct coupling to a central token overlord and in that way enables state transfer for authentication. Having worked in stateless authentication environments for several years, the benefits in my eyes are clearly:

  • Less latency through local, decentralized token verification,
  • Custom authorization fallbacks due to local token interpretation,
  • Increased resilience by removed network overhead.

Also, stateless authentication is able to absolve from the need to keep track of issued tokens, and for that reason removes state (and hence reduces storage) dependencies from your system.

The antiquated, heavy-weighted token overlord converges to yet another microservice being mainly responsible for issuing tokens. All of this comes in handy, especially when your world mainly consists of single-page applications or mobile clients and services that primarily communicate using RESTful APIs.

“Using a JWT as a bearer for authorization, you can statelessly verify if the user is authenticated by simply checking if the expiration in the payload hasn’t expired and if the signature is valid.” —Jonatan Nilsson

One popular way to achieve stateless authentication is defined in RFC 7523 and leverages the OAuth 2.0 Authorization Framework (RFC 6749) by combining it with server-signed JSON Web Tokens (RFC 7519RFC 7515). Instead of storing the token-to-principal relationship in a stateful manner, signed JWTs allow decentralized clients to securely store and validate access tokens without calling a central system for every request.

With tokens not being opaque but locally introspectable, clients could also retrieve addition information (if present) about the corresponding identity directly from the token without the need of calling another remote API.

Stateful vs. Stateless

Nowadays in a Web that is mainly characterized by a wide-spread transition from monolithic legacy apps to decoupled microservices, a centralized token overlord service can be described as an additional burden. The purpose of JWT is to obviate the need for such a centralistic approach.

However, there again is no silver bullet and JWTs aren’t Swiss Army knives. Stateful authentication has its righteous place. If you really need a central authentication system (e.g. to fulfil restrictive auditing requirements) or if you simply don’t trust people or libraries to correctly verify your JWTs, a stateful overlord approach is still the way to go and there is nothing wrong with it.

In my opinion, you probably shouldn’t mix both approaches. To shortly answer the questions above:

  • There is no way of invalidating/revoking a JWT (and I don’t see the point), except if you just use it as yet another random string within a stateful authenticating system.
  • There is no way of altering an issued JWT, so prolongating its expiration date is again not possible.
  • You could use JWTs if they really help you in solving your issues. You don’t have to use them. You can also keep your opaque tokens.

If you have further comments regarding the purpose of JWT or if you think I missed something important, do not hesitate to drop me message via Twitter. I also appreciate feedback and further discussion. Thanks!

LendingTree
Lyft

Zalando Tech Blog – Technical Articles Hunter Kelly

To be able to measure the quality of some of the machine learning models that we have at Zalando, “Golden Standard” corpora are required.  However, creating a “Golden Standard” corpus is often laborious, tedious and time-consuming.  Thus, a method is needed to produce high quality validation corpora but without the traditional time and cost inefficiencies.

Motivation

As the Zalando Dublin Fashion Content Platform (FCP) continues to grow, we now have many different types of machine learning models.  As such, we need high quality labelled data sets that we can use to benchmark model performance and evaluate changes to the model.  Not only do we need such data sets for final validation, but going forward, we also need methods to acquire high-quality labelled data sets for training models.  This is becoming particular clear as we start working on models for languages other than English.

Creating a “Golden Standard” corpus generally requires a human being to look at something and make some decisions.  This can be quite time consuming, and ultimately quite costly, as it is often the researcher(s) conducting the experiment that end up doing the labelling.  However, the labelling tasks themselves don't always require much prior knowledge, and could be done by anyone reasonably computer literate.  In this era of crowdsourcing platforms such as Amazon's Mechanical Turk and CrowdFlower, it makes sense to leverage these platforms to try to create these high quality data sets at a reasonable cost.

Background

Back when we first created our English language Fashion Classifier, we bootstrapped our labelled data by using the (now defunct) DMOZ, also known as the Open Directory Project.  This was a site where volunteers, since 1998, were hand categorizing websites and webpages.  A web page could live under one or more "Categories".  Using a snapshot of the site, we took any web pages/sites that had a category that contained the word "fashion" anywhere in it's name.  This became our “fashion” dataset.  We then also took a number of webpages and sites from categories like "News", "Sports", etc, to create our “non-fashion” dataset.

Taking these two sets of links, and with the assumption that they would be noisy, but "good enough", we generated our data sets and went about building our classifier.  And from all appearances, the data was "good enough".   We were able to build a classifier that performed well on the validation and test sets, as well as on some small, hand-crafted sanity test sets.  But now, as we circle around, creating classifiers in multiple languages and for different purposes, we want to know:

  • What is our data processing quality, assessed against real data?
  • When we train a new model, is this new model better?  In what ways is it better?
  • How accurate were our assumptions regarding "noisy but good enough"?
  • Do we need to revisit our data acquisition strategy, to reduce the noise?

And of course, the perennial question for any machine learning practitioner:

  • How can I get more data??!?

Approach

Given that Zalando already had a trial account with CrowdFlower, it was the natural choice of crowdsourcing platform to go with.  With some help from our colleagues, we were able to get set up and understand the basics of how to use the platform.

Side Note: Crowdsourcing is an adversarial system

Rather than bog down the main explanation of the approach with too many side notes, it is worth mentioning up-front that crowdsourcing should be viewed as an adversarial system.

CrowdFlower "jobs" work on the idea of "questions", and the reviewer is presented with a number of questions per page.  On each page there will be one "test question", which you must supply.  As such, the test questions are viewed as ground truth and are used to ensure that the reviewers are maintaining a high enough accuracy (configurable) on their answers.

Always remember, though, that a reviewer wants to answer as many questions as quickly as possible to maximize their earnings.  They will likely only skim the instructions, if they look at them at all.  It is important to consider accuracy thresholds and to design your jobs such that they cannot be easily gamed.  One step that we took, for example, was to put all the links through a URL shortener (see here), so that the reviewer could not simply look at the url and make a guess; they actually had to open up the page to make a decision.

Initial Experiments

We created a very simple job that contained 10 panels with a link and a dropdown, as shown below.

We had a data set of hand-picked links to use as our ground-truth test questions, approximately 90 fashion links, and 45 non-fashion links.  We then also picked some of the links we had from our DMOZ data set, and used those to run some experiments on.  Since this was solely about learning how to use the platform, we didn't agonize over this data set, we just picked 100 nominally fashion links, and 100 nominally non-fashion links, and uploaded those as the data to use for the questions.

We ran two initial experiments: the first one we had tried to use some of the more exotic, interesting "Quality Control" settings that CrowdFlower makes available, but we found that the number of "Untrusted Judgements" was far too high compared to "Trusted Judgements".  We simply stopped the job, copied it and launched another.

The second of the initial experiments proved quite promising: we got 200 links classified, with 3 judgements per link (so 600 trusted judgements in total).  The classifications from the reviewers matched the DMOZ labels pretty closely.  All the links where the DMOZ label and the CrowdFlower reviewers disagreed were examined; there was one borderline case that was understandable, and the rest were actually indicative of the noise we expected to see in the DMOZ labels.

Key learnings from initial experiments:
  • Interestingly, we really overpaid on the first job.  Dial down the costs until after you've run a few experiments.  If the “Contributor Satisfaction” panel on the main monitoring page has a “good” (green) rating, you’re probably paying too much.
  • Start simple.  While it is tempting to play with the advanced features right from the get-go, don't.  They can cause problems with your job running smoothly; only add them in if/when they are needed.
  • You can upload your ground truth questions directly rather than using the UI, see these CrowdFlower docs for more information.
  • You can have extra fields in the data you upload that isn't viewed by the user at all; we were then able to use the CrowdFlower UI to quickly create pivot tables and compare the DMOZ labels against the generated labels.
  • You can get pretty reasonable results even with minimal instructions.
  • Design your job such that "bad apples" can't game the system.
  • It's fast!  You can get quite a few results in just an hour or two.
  • It's cheap!  You can run some initial experiments and get a feeling for what the quality is like for very little.  Even with our "massive" overspend on the first job, we still spent less than $10 total on our experiments.

Data Collection

Given the promising results from the initial experiments, we decided to proceed and collect a "Golden Standard" corpus of links, with approximately 5000 examples from each class (fashion and non-fashion).  Here is a brief overview of the data collection process:

  • Combine our original DMOZ link seed set with our current seed set
  • Use this new seed set to search the most recent CommonCrawl index to generate candidate links
  • Filter out any links that had been used in the training or evaluation of our existing classifiers
  • Sample approximately 10k links from each class: we intentionally sampled more than the target number to account for inevitable loss
  • Run the sampled links through a URL shortener to anonymize the urls
  • Prepared the data for upload to CrowdFlower

Final Runs

With data in hand, we wanted to make some final tweaks to the job before running it.  We fleshed out the instructions (not shown) with examples and more thorough definitions, even though we realized they would not be read by many.  We upped the minimum accuracy from 70% to 85% (as suggested by CrowdFlower).  Finally, we adjusted the text in the actual panels to explain what to do in borderline or error cases.

We ran a final experiment against the same 200 links as in the previous experiments.  The results were very similar, if not marginally better than the previous experiment, so we felt confident that the changes hadn't made anything worse.  We then incorporated the classified links as new ground truth test questions (where appropriate) into the final job.

We launched the job, asking for 15k links from a pool of roughly 20k.  Why 15k?  We wanted 5k links from each class; we were estimating about 20% noise on the DMOZ labels.  We also wanted a high level of agreement, so links that had 3/3 reviewers agreeing.  From the previous experiments, we were getting unanimous agreement on about 80% of the links seen.  So 10k + noise + agreement + fudge factor + human predilection for nice round numbers = 15k.

We launched the job in the afternoon; it completed overnight and the results were ready for analysis the next morning, which leads to...

Evaluation

How does the DMOZ data compare to the CrowdFlower data?  How good was "good enough"?

We can see two things, right away:

1. The things in DMOZ that we assumed were mostly not fashion, were, in fact, mostly not fashion.  1.5% noise is pretty acceptable.

2. Roughly 22% of all our DMOZ "Fashion" links are not fashion.  This is pretty noisy, and indicates that it was worth all the effort of building this properly labelled "Golden Standard" corpora in the first place!  There is definitely room for improvement in our data acquisition strategy.

Now, those percentages change if we only take into account the links where all the reviewers were in agreement; the noise in the fashion set drops down to 15%.  That's still pretty noisy.

So what did we end up with, for use in the final classifier evaluations?  Note that the total numbers don't add up to 15k because we simply skipped links that produced errors on fetching, 404s, etc.

This shows us, that similar to the initial experiments, that we had unanimous agreement roughly 80% of the time.

Aside: It's interesting to note that both the DMOZ noise and the number of links where opinions were split work out to about 20%.  Does this point to some deeper truth about human contentiousness?  Who knows!

So what should we use to do our final evaluation?  It's tempting to use the clean set of data, where everyone is in agreement.  But on the other hand, we don't want to unintentionally add bias to our classifiers by only evaluating it on clean data.  So why not both?  Below are the results of running our old baseline classifier, as well as our new slimmer classifier, against both the "Unanimous" and "All" data sets.

Taking a look at our seeds and comparing that to the returned links, we find that 4,023 of the 15,000 are links in the seed set, with the following breakdown when we compare against nominal DMOZ labels:

Key Takeaways

  • Overall, the assumption that the DMOZ was "good enough" for our initial data acquisition was pretty valid.  It allowed us to move our project forward without a lot of time agonizing over labelled data.
  • The DMOZ data was quite noisy, however, and could lead to misunderstandings about the actual quality of our models if used as a "Golden Standard".
  • Crowdsourcing, and CrowdFlower, in particular, can be a viable way to accrue labelled data quickly and for a reasonable price.
  • We now have a "Golden Standard" corpus for our English Fashion Classifier against which we can measure changes.
  • We now have a methodology for creating not only "Golden Standard" corpora for measuring our current data processing quality, but a method that can be extended to create larger data sets that can be used for training and validation.
  • There may be room to improve the quality of our classifier by using a different type of classifier, that is more robust in the face of noise in the training data (since we've established that our original training data was quite noisy).
  • There may be room to improve the quality of the classifier by creating a less noisy training and validation set.

Conclusion

Machine Learning can be a great toolkit to use to solve tricky problems, but the quality of data is paramount, not just for training but also for evaluation.  Not only here in Dublin, but all across Zalando, we’re beginning to reap the benefits of affordable, high quality datasets that can be used for training and evaluation.  We’ve just scratched the surface, and we’re looking forward to seeing what’s next in the pipeline.

If you're interested in the intersection of microservices, stream data processing and machine learning, we're hiring.  Questions or comments?  You can find me on Twitter at @retnuH.

Lyft
LendingTree