Similar products in nopCommerce with machine learning techniques

Similar products in nopCommerce with machine learning techniques

Manually managing mappings of similar products in nopCommerce using the related products feature can be a hugely time-consuming administrative task if your store has a large product catalog. With modern machine-learning (ML) techniques that work can be automated, but training the model requires infrastructure and data that is not always available. The inner workings of the resulting model are also likely to be opaque and the output difficult to predict beforehand or tune once deployed. For those reasons, full ML approaches are more appropriate for full-fledged recommendation systems than for simple product-similarity tasks. Fortunately, though, we can borrow ideas and techniques from the ML field and apply them to quickly and efficiently finding similar products on the fly!

The problem

What makes two products "similar" is, of course, subjective, and in any case depends on what the customer is looking for. Is a blue T-shirt more similar to a blue polo shirt than to a green T-shirt? It is impossible to answer that question without more information about the customer, so we leave that to a full ML solution and choose to limit our scope to finding the more similar product given that we have some reasonable initial guess about customer preferences.

In nopCommerce, a product's properties are described by its specification attributes, so that is a good place to start when designing a product similarity measure. Intuitively, the more specification attributes two products have in common, the more similar they are. Returning to the T-shirt example, we might have three specification attributes sleeve length, neck and colour. By just counting the shared specification attributes, both the blue polo shirt (sleeve length: short, neck: polo, colour: blue) and the green T-shirt (sleeve length: short, neck: round, colour: green) have two specification attributes in common with the blue T-shirt (sleeve length: short, neck: round, colour: blue) and are therefore equally similar to it. A blue V-necked sweater (sleeve length: long, neck: v-neck colour: blue), on the other hand, would as expected be less similar with only the colour in common with the blue T-shirt.

Similar products example The picture shows similar products (bottom) on a product page for a wallpaper at FamilyWallpapers, generated by Majako's similar products plugin with default settings.

Sketching a solution

We can visualise this measure as the Manhattan distance between two products by viewing each product as a point in the space with each binary dimension corresponding to one specification attribute option (such as blue, v-neck or short in our running example). Using specification attributes as the dimensions with the options as points along the axis might seem more natural, but remember that the options in general are categorical and usually not actually comparable among each other. Because we are only interested in the relative distance between pairs of products, we can replace the Manhattan distance with some other metric fulfilling the triangle inequality.

Of course, not all specification attributes are equal. During the earlier example, a reasonable objection might have been that surely sleeve length: short is less similar to sleeve length: long than colour: blue is to colour: green because the customer is more likely to pick a green T-shirt than a blue sweater over a blue T-shirt. This issue is addressed by scaling the dimensions of the "product space" by some (inverse) weight, so that a specification attribute with a higher weight (such as sleeve length) has a higher impact on similarity than one with a lower weight (like colour). Although suitable weights could be learnt using ML, in most cases the store administrator will have the domain expertise to assign good-enough weights, at least with some experimentation. Irrelevant specification attributes could be turned off by simply setting their weights to zero.

The algorithm

By formulating the similarity problem as one of distance between points in some high-dimensional space, it lends itself naturally to the application of nearest-neighbour (NN) search methods. In fact, the problem of finding the k most similar products now becomes simple kNN search, for which there are many existing exact and approximate algorithms. The product embedding can be precomputed, so that any call to the similarity function need only perform the actual search, which in Majako's implementation runs in the tens of milliseconds on a large production store when loading a product page. We exploit some unique properties of the specific representation of specification attributes to greatly improve running times and memory usage over off-the-shelf algorithms, so that the algorithm scales very well with the number of products and specification attribute options. Another benefit of this approach is that we are able to compute similarity for any pair of products without the need for massive database tables to store the results.

As described earlier, specification attribute options are in general categorical variables that cannot be directly compared. However, in much the same way as product similarity can be computed by first embedding the products into a vector space, embeddings can also be learnt for options within a specification attribute. This is a proper ML technique for learning feature representation, and is most commonly used in natural-language processing with for example the Word2vec algorithm to learn word embeddings.


With Majako's upcoming plugin, similar products can be found and ranked according to their similarity with any given product efficiently and on the fly, with very little to no work required to get started. Full machine-learning systems, which need to be trained on large amounts of existing data, might be considered overkill for a simple product-similarity ranking task. Since the essential information about each product is usually already available in the form of specification attributes, we instead directly use them to represent products in a high-dimensional vector space, where we can compute a similarity metric as the simple distance between products. This approach is easy to understand and visualise, and produces predictable results that can be easily tuned by the administrator.

For more information, or if you need help with developing other similar solutions for nopCommerce, please feel free to contact us!

Rickard von Haugwitz Lead ML engineer, Majako AB

Leave your comment