1 Introduction

Recommendation systems are a key component of the tool-box for analysing data from social media. A recent survey (Balaji et al. 2021) reviews the main methods and application areas. The focus of this paper is on Collaborative Filtering (CF) based RSs, in which the basic data structure is a matrix of ratings, with as many rows as users and as many columns as items, and each entry is the rating provided by a user (row) to an item (column). Rating matrices are mostly sparse because many entries are unknown: the key assumption is that the unknown ratings are predictable because the known ratings are often highly correlated across various users or items.

The main driver in the development of RSs has been the accuracy of recommendations i.e., the error in the prediction on unknown ratings. Recently the need to recognize the increasing heterogeneity of users' demands has led to multiple metrics such as diversity and novelty which might conflict with each other. Generally speaking, the increase of diversity and novelty might decrease accuracy (Castells et al. 2015). Therefore, we have to manage a trade-off between different objective functions so that we are faced with multi-objective optimization problems (MOP). The best trade­ off between the objectives can be defined in terms of Pareto optimality. in which solutions are the elements of the Pareto set.

We consider Multi­objective evolutionary algorithms (MOEAs): recently some papers, e.g. Li et al. (2016), Lin et al. (2018), have focused also on novelty and diversity.

In this paper we focus on the k-top recommendation problem in which a list is pro­ posed to each user containing his k-top rated items. A solution is encoded as a matrix whose rows correspond to customers and column to items. The objectives of the optimization are accuracy, novelty, and coverage.

One key novelty of this paper is that a solution is associated not to a scalar value of each objective function, given by the average over users/items, but to the probability distribution of its sample values.

This maps the problem into a space whose elements are the discrete probability distributions which we represent as histograms: this space is structured by a distance between histograms, namely the Wasserstein distance. Measuring the distance between distributions can be accomplished by many alternative models. A general class of distances known as f-divergences are based on the expected value of a convex function of the ratio of the two distributions. If \(P\) and \(Q\) are two probability distributions over \({\mathbb{R}}^{d}\), continuous with respect to each other, and \(f\) is a convex function such that \(f\left(0\right)=1\) the f-divergenge is given by:

$$ \begin{array}{*{20}c} {D_{f} \left( {P,Q} \right) = {\mathbb{E}}_{Q} f\left( \frac{P}{Q} \right)} \\ \end{array} $$
(1)

According to the choice of \(f\) the above formula yields specific distances including Kullback–Leibler distance, its symmetrized version Jensen–Shannon, Hellinger distance, total variation divergence and \(\chi \)-square divergence. The main disadvantage of KL and \(\chi \)-square distances with respect to Wasserstein is that they do not use information across different bins of the histograms or distributions with different binning schemes, that is different support. WST first introduced in Monge (1781) has received its modern formulation in Kantorovich (1942). WST has a very rich mathematical structure whose complexity and flexibility are analyzed in a landmark volume (Villani 2009) and, in the discrete domain in the tutorial (Solomon et al. 2014). A difficulty with WST is its computational cost which has hampered its diffusion outside the computer vision community. Recently a number of specialized computational approaches have drastically reduced the computational hurdles (Peyré and Cuturi 2019).

1.1 Related works

A related approach has been proposed in Zheng et al. (2017) for the “grey sheep” problem. Users are represented as histograms whose distance is given by their intersections and whose feature by the quantile analysis of each histogram. Ribeiro et al. (2014) analyses multi-objective Pareto efficient approaches considering accuracy of a top-k recommendation list along with novelty and coverage as objective functions. A similar approach is proposed in Li et al. (2016) where one more objective serendipity is considered.

A different approach based on Gaussian Processes has been proposed in Nguyen et al. (2014), Galuzzi et al. (2020). Another approach, also based on Gaussian processes, is proposed in Vanchinathan et al. (2014) which deals explicitly with the explore/exploit dilemma in top-k list selection. Multi armed bandits (MAB) is a classical formalism for studying the exploration/exploitation dilemma when the reward is an unknown pay-off function. Guillou et al. (2015) gathers feedback from users which is used to update the model of users’ preferences. Another approach which stresses the interaction with users is Christakopoulou and Banerjee (2018). In Hejazinia et al. (2019) it is shown how different approaches, including matrix factorization, can be embedded in the MAB framework. The MAB framework can also be extended to the case of context dependent bandits (Gentile et al. 2017) and correlated arms (Wang et al. 2018a, b). A general consideration is that while MAB models are very good at capturing the interaction with users and the online learning mechanism, they meet some difficulty dealing with top-k list recommendation and with multi-objective problems.

Multi-objective problems are typically dealt with using evolutionary algorithms; Ribeiro et al. (2014) exploit the pareto efficiency and show that the suggested lists are simultaneously accurate, diverse, and novel. The same objectives are considered in Lin et al. (2018) which show that an extreme point based method can encode the prior knowledge of RSs and enhance the performance of personalized recommendation.

Two recent deterministic approaches are (Gillis et al. 2021) where the issue of distributional robust multi objective optimization is considered and Lin et al. (2019) which considers two objectives CTR (Click Through Rate) and GMV (Gross merchandise Volume) and propose a scalarization approach whose weights are automatically learnt and shown to guarantee Pareto efficiency.

A stream of research, increasingly intertwined with RSs is metric learning. After the early contributions of Weinberger and Saul (2009) metric learning has been suggested moving from the observation that the output of MF methods violates the triangle inequality between inter user distances.

The literature on WST distance is extremely large. Here we quote only few very recent papers focused on the design of recommender systems (RSs). Meng et al. (2020) shows that embedding the RS into a metric space endowed with WST distance enables an effective solution to the item cold-start recommendation. Zhao et al. (2021) propose a Wasserstein based Correlation Analysis for Cross-Domain Recommendation. The use of autoencoders and generative adversarial networks (GAN) for collaborative filtering has been recently proposed in Zhang et al. (2021), Li et al. (2020).

Wasserstein has been also proposed in the context of metric learning (Ma et al. 2020; Rakotomamonjy et al. 2018) to measure uncertainty, embeds user/item representation in a low dimensional space and comply with the triangle inequality).

Our contributions

  • The proposal of a probabilistic space in which both the data model and the optimization algorithm are embedded. The elements of this space are discrete probability distributions, specifically histograms.

  • A distributional representation of the rating matrix. To each user one can associate a vector whose components are the distances according to a similarity measure to all other users. This vector is synthetized as a histogram with \(N\) bins, where \(N\ll m\). The value associated to each bin is the number of users whose distance from the target user falls in that bin. The histogram might be regarded as the signature of a user and the weights as user’s features.

  • The distance between 2 users in the probabilistic space is given by the Wasserstein distance between their histograms. Another important result is that the WST based distributional representation enables the construction of a new graph: called the WST graph whose nodes are the users and the weights of the edges are the WST distance between users. The clustering of users takes then place in the WST-graph.

  • The value of each objective (accuracy, novelty, and coverage), for each candidate top k-list, is evaluated as a sample and can be represented as a histogram. This multidimensional histogram encodes the knowledge obtained from function evaluations. The difference between two lists can be encoded as the WST distance between their histograms. This enables to define a WST based selection operators in MOEAs.

  • The resulting algorithm Multi-objective Evolutionary Optimization/Wasserstein (MOEA/WST) is run on each cluster and compared with the algorithmic benchmark NSGA-II. On a standard benchmark data set for RSs MOEA/WST results in better hypervolume and coverage, in particular at low generation counts.

1.2 Organization of the paper

Section 2 contains the formal definition of the problem of the ranking matrix and the basic notions of multi-objective optimization. Section 3 contains the basic definitions of the Wasserstein distance and its main computational methods. Section 4 introduces the distributional representation of users, the graph representation of the rating matrix, and the results of different clustering methods. Section 5 is devoted to the distributional representation of the objective functions. Section 6 describes the encoding of solutions and their distributional representations. Section 7 contains the description of MOEA/WST and in particular of the new genetic operators. Section 8 describes the software resources. Section 9 the computational results in terms of hypervolume and coverage of the Pareto approximation, to compare MOEA/WST and NSGA-II. Section 10 contains the conclusions and perspectives.

2 The problem definition and background information

2.1 The problem definition

The basic notion is:

  • The set of users \(U={\left\{{u}_{i}\right\}}_{i=1,\dots ,M}\), where \(M\) is the number of users.

  • The set of items \(O={\left\{{o}_{j}\right\}}_{j=1,\dots ,N}\),, where \(N\) is the number of items.

Each user expresses its judgement, or rating, \(r\in X\), where typical rating values can be binary or integers from a given range. The set of all the ratings given by the users on the items can be represented as a partially specified matrix \(R\in {\mathbb{R}}^{N\times M}\), where its entries \({r}_{ij}\) express the possible ratings of user \({u}_{i}\) for item \({o}_{j}\). Usually, each user rates only a small number of items, thus the matrix R is sparse.

Two types of Collaborative Filtering (CF) methods are commonly used to solve it: the memory-based methods and model-based methods. The memory-based methods, or neighbourhood-based algorithms are based on the fact that similar users display similar patterns of rating behaviour (user-based) or similar items receive similar ratings (item-based). These methods rely usually on k-nearest neighborus (k­NN) algorithms, are simple to implement, and yield recommendations easy to explain. Clearly the choice of the distance and the value k can be considered as hyperparameters and their values impact substantially the performance of the method.

The model-based methods, since the Netflix contest, use mostly matrix factorization where users and items can be represented by latent factors in a low dimensional space. Many variants of MF have been recently proposed and MF has become a de-facto cornerstone in the construction of RS. The focus of this paper is not on matrix filling but on the multi-objective optimization problem in k-top recommendation.

An important evolution of MF is given in Indyk et al. (2019) which propose a learning based approach and Wang et al. (2018a, b) and Zhang et al. (2018).

We used the basic implementation of k-NN in Surprise with the cosine similarity.

2.2 Multi-objective optimization

Multiobjective optimization problem (MOP) can be stated as follows:

$$ \begin{array}{*{20}c} {\mathop {\min }\limits_{{{\text{x}} \in {\Omega } \subseteq {\mathbb{R}}^{{\text{d}}} }} F\left( x \right) = \left( {f_{1} \left( x \right), \ldots ,f_{m} \left( x \right)} \right)} \\ \end{array} $$
(2)

Pareto rationality is the theoretical framework to analyse multi objective optimization problems where \(m\) objective functions \({f}_{1}\left(x\right),\dots ,{f}_{m}\left(x\right),\) where \({f}_{i}\left(x\right):\to {\mathbb{R}}\) are to be simultaneously optimized in the search space \(\Omega \subseteq {\mathbb{R}}^{d}\). Here we use \(x\) to be compliant with the typical Pareto analysis’s notation, clearly in this study \(x\) is a sensor placement \(s\).

Let \(u,v\in {\mathbb{R}}^{m}\) u is said to dominate v if and only if \({u}_{i}\ge {v}_{i} \forall i=1,\dots ,n\) and \({u}_{j}>{v}_{j}\) for at least one index \(j\) to refer to the vector of all objectives evaluated at a location \(x\). The goal in multi-objective optimization is to identify the Pareto frontier of \(f(x)\). A point \({x}^{*}\) is pareto optimal for the problem in Eq. (1) if there is no point \(x\) such that \(F(x)\) dominate \({F(x}^{*})\). This implies that any improvement in a Pareto optimal point in one objective leads to a deterioration in another. The set of all Pareto optimal points is the Pareto set and the set of all Pareto optimal objective vectors is the Pareto front (PF). The interest in finding locations \(x\) having the associated \(F(x)\) on the Pareto frontier is clear: all of them represent efficient trade-offs between conflicting objectives and are the only ones, according to the Pareto rationality, to be considered by the decision maker. The issue of the quality evaluation of Pareto solutions sets is the key issue in multi objective optimization. The quality evaluation of Pareto solutions sets is the key issue in multi objective optimization.

To measure the progress of the optimization a natural and widely used metric is the hypervolume indicator that measures the objective space between a non-dominated set and a predefined reference vector. An example of Pareto frontier, along with the reference point to compute the hypervolume, is reported in Fig. 1.

Fig. 1
figure 1

An example of Pareto frontier (left), with the associated hypervolume (right), for two minimization objectives

A good approximation of the Pareto set will result into a high hypervolume value; thus, hypervolume is a reasonable measure for evaluating the quality of the optimization process.

The grey shaded area is the original hypervolume: a new point A improves the approximation to the exact Pareto front and increases the hypervolume by the blue shaded area. The improvement of the hypervolume can also be used in the selection of the new point as in Daulton et al. (2020). This approach has been further developed using the gradient of the expected hypervolume improvement to speed up the selection process and also in the selection of the new point.

Another metric to compare different approximations of the Pareto front is the \(C\)-metric, also called coverage. Let \(A\) and \(B\) be two approximations of the PF, \(C(A,B)\) gives the fraction of solutions in \(B\) that are dominated by at least one solution in \(A\). Hence, \(C\left(A,B\right)=1\) means that all solutions in \(B\) are dominated by at least one solution in \(A\) while \(C\left(A,B\right)=0\) implies that no solution in \(B\) is dominated by a solution in \(A\).

2.3 Structure of the overall approach

Phase A: data representation and analysis


Step 1: Distributional representation of each user as a histogram whose values depend on the distribution of (user, user) similarity values. This step can be considered as the embedding of each user in a lower dimensional space whose elements are the histograms.


Step 2: The distance between histograms is computed as the Wasserstein distance (a.k.a. optimal transport distance) and the space of histograms, endowed with the Wasserstein metric, is called the Wasserstein space.


Step 3: The rating matrix can be represented as two alternative graphs:

  1. i.

    The usual cosine graph where each edge is weighted by the cosine similarity of the two end-point users of the edge.

  2. ii.

    The WST graph where each edge is weighted by the WST distance of the two histograms associated to the two end-point users of the edge.

Each graph can be clustered by any standard method like k-means or spectral clustering. In the computational results, spectral clustering has been used.

Phase B: optimization


Step 4: Formulation of the multi-objective optimization problem.

The candidate top-k list is encoded as a matrix. The values of the objective functions (accuracy, novelty and coverage) are represented as multivariate histograms.


Step 5: Multi-objective evolutionary algorithm.

The selection operation is built upon the WST distance between histograms associated to two candidate solutions.

To measure the performance of the algorithm over each cluster computed in Step 3, hypervolume and coverage of the approximate pareto sets have been used.

3 The Wasserstein distance

Measuring the distance between distributions can be accomplished by many alternative models. Most used distances are Kullback–Leibler distance, Hellinger distance, total variation divergence and \({\varvec{\chi}}\)-square divergence and Jensen-Shannon, which is a symmetrized version of KL. The main disadvantage of KL and \({\varvec{\chi}}\)-square to measure the distance between histograms is that they account only for the correspondence between bins of the same index and do not use information across bins or distributions with different binning schemes, that is different support.

This is clearly shown in Fig. 2 (Öcal et al. 2019), which compares the values of 5 distances Total Variation (TV), Hellinger, Jensen-Shannon and Kullback–Leibler (KL) between 3 histograms. It’s apparent that WST not only shows a better discrimination but also an intuitive and natural perception of the distances.

Fig. 2
figure 2

Three histograms and their distances (Öcal et al. 2019)

WST is a cross binning distance and is not affected by different binning schemes. Moreover, WST matches naturally the perceptual notion of nearness and similarity. Moreover, Wasserstein embedding can be easily generalized to multi-objective problems considering \({\varvec{m}}\)-dimensional histograms:

3.1 Basic definitions

The WST distance between continuous probability distributions is:

$$ \begin{array}{*{20}c} {W_{p} \left( {P^{\left( 1 \right)} ,P^{\left( 2 \right)} } \right) = \left( {\mathop {\inf }\limits_{{{\upgamma } \in {\Gamma }\left( {P^{\left( 1 \right)} ,P^{\left( 2 \right)} } \right)}} \mathop \smallint \limits_{X \times X}^{{}} d\left( {x^{\left( 1 \right)} ,x^{\left( 2 \right)} } \right)^{p} d\gamma \left( {x^{\left( 1 \right)} ,x^{\left( 2 \right)} } \right)} \right)^{\frac{1}{p}} } \\ \end{array} $$
(3)

where \(d({x}^{\left(1\right)},{x}^{\left(2\right)})\) is also called ground distance (usually it is the Euclidean norm), \(\Gamma \left({P}^{\left(1\right)},{P}^{\left(2\right)}\right)\) denotes the set of all joint distributions \(\gamma ({x}^{\left(1\right)},{x}^{\left(2\right)})\) whose marginals are respectively \({P}^{\left(1\right)}\) and \({P}^{\left(2\right)}\), and \(p>1\) is an index. The Wasserstein distance is also called the Earth Mover Distance (EMD). The EMD is the minimum energy cost of moving and transforming a pile of sand in the shape of \({P}^{\left(1\right)}\) to the shape of \({P}^{\left(2\right)}\). The cost is quantified by the amount of sand moved times the moving distance \(d({x}^{\left(1\right)},{x}^{\left(2\right)})\).The EMD then is the cost of the optimal transport plan.

There are some specific cases, very relevant in applications, where WST can be written in an explicit form. Let \({\widehat{P}}^{\left(1\right)}\) and \({\widehat{P}}^{\left(2\right)}\) be the cumulative distribution for one-dimensional distributions \({P}^{\left(1\right)}\) and \({P}^{\left(2\right)}\) on the real line and \({\left({\widehat{P}}^{\left(1\right)}\right)}^{-1}\) and \({\left({\widehat{P}}^{\left(2\right)}\right)}^{-1}\) be their quantile functions.

$$ \begin{array}{*{20}c} {W_{p} \left( {P^{\left( 1 \right)} ,P^{\left( 2 \right)} } \right) = \left( {\mathop \smallint \limits_{0}^{1} \left| {\left( {\hat{P}^{\left( 1 \right)} } \right)^{ - 1} \left( {x^{\left( 1 \right)} } \right) - { }\left( {\hat{P}^{\left( 2 \right)} } \right)^{ - 1} \left( {x^{\left( 2 \right)} } \right)} \right|^{p} dx} \right)^{\frac{1}{p}} } \\ \end{array} $$
(4)

Let’s now consider the case of a discrete distribution \(P\) specified by a set of support points \({x}_{i}\) with \(i=1,\dots ,m\) and their associated probabilities \({w}_{i}\) such that \(\sum_{i=1}^{m}{w}_{i}=1\) with \({w}_{i}\ge 0\) and \({x}_{i}\in M\) for \(i=1,\dots ,m\).

Usually, \(M={\mathbb{R}}^{d}\) is the \(d\)-dimensional Euclidean space with the \({l}_{p}\) norm and \({x}_{i}\) are called the support vectors. \(M\) can also be a symbolic set provided with a symbol-to-symbol similarity. \(P\) can also be written using the notation:

$$ \begin{array}{*{20}c} {P\left( x \right) = \mathop \sum \limits_{i = 1}^{m} w_{i} \delta \left( {x - x_{i} } \right)} \\ \end{array} $$
(5)

where \(\delta (\cdot )\) is the Kronecker delta.

The WST distance between two distributions \({P}^{\left(1\right)}=\left\{{w}_{i}^{\left(1\right)},{x}_{i}^{\left(1\right)}\right\}\) with \(i=1,\dots ,{m}_{1}\) and \({P}^{\left(2\right)}=\left\{{w}_{i}^{\left(2\right)},{x}_{i}^{\left(2\right)}\right\}\) with \(i=1,\dots ,{m}_{2}\) is obtained by solving the following linear program:

$$ \begin{array}{*{20}c} {W\left( {P^{\left( 1 \right)} ,P^{\left( 2 \right)} } \right) = \mathop {\min }\limits_{{\gamma_{ij} \in {\mathbb{R}}^{ + } }} \mathop \sum \limits_{{i \in I_{1} ,{ }j \in I_{2} }} \gamma_{ij} { }d\left( {x_{i}^{\left( 1 \right)} ,x_{j}^{\left( 2 \right)} } \right)} \\ \end{array} $$
(6)

The cost of transport between \({x}_{i}^{\left(1\right)}\) and \({x}_{j}^{\left(2\right)}\), \(d\left({x}_{i}^{\left(1\right)},{x}_{j}^{\left(2\right)}\right)\), is defined by the \(p\)-th power of the norm \(\Vert {x}_{i}^{\left(1\right)}, {x}_{j}^{\left(2\right)}\Vert \) (usually the Euclidean distance).

We define two index sets \({I}_{1}=\left\{1,\dots ,{m}_{1}\right\}\) and \({I}_{2}\) likewise, such that:

$$ \begin{array}{*{20}c} {\mathop \sum \limits_{{i \in I_{1} }} \gamma_{ij} = w_{j}^{\left( 2 \right)} , \forall j \in I_{2}} \\ \end{array} $$
(7)
$$ \begin{array}{*{20}c} {\mathop \sum \limits_{{j \in I_{2} }} \gamma_{ij} = w_{i}^{\left( 1 \right)} , \forall i \in I_{1} } \\ \end{array} $$
(8)

Equations (7) and (8) represent the in-flow and out-flow constraint, respectively. The terms \({\gamma }_{ij}\) are called matching weights between support points \({x}_{i}^{\left(1\right)}\) and \({x}_{j}^{\left(2\right)}\) or the optimal coupling for \({P}^{\left(1\right)}\) and \({P}^{\left(2\right)}\).

The discrete version of the WST distance is usually called Earth Mover’s Distance (EMD). For instance, when measuring the distance between grey scale images, the histogram weights are given by the pixel values and the coordinates by the pixel positions. Another way to look at the computation of the EMD is as a network flow problem. In the specific case of histograms, the entries \({\gamma }_{ij}\) denote how much of the bin \(i\) has to be moved to bin \(j\).

The basic computation of OT between 2 discrete distributions involves solving a network flow problem whose computation scales typically cubic in the sizes of the measure. There are 2 lines of work to reduce the time complexity of OT, simple ground costs can lead to simpler computations. In the general case it is shows to be equivalent to a min-flow algorithm of quadratic computational complexity and, in specific cases, to linear. The computation of EMD turns out to be the solution of a minimum cost flow problem on a bi-partite graph where the bins of \({P}^{(1)}\) are the source nodes and the bins of \({P}^{(2)}\) are the sinks while the edges between sources and sinks are the transportation costs. In the case of one-dimensional histograms, the computation of WST reduces to the comparison of two 1-dimensional histograms which can be performed by a simple sorting and the application of the following Eq. (11).

$$ \begin{array}{*{20}c} {W_{p} \left( {P^{\left( 1 \right)} ,P^{\left( 2 \right)} } \right) = \left( {\frac{1}{n}\mathop \sum \limits_{i}^{n} \left| {x_{i}^{{\left( 1 \right){*}}} - x_{i}^{{\left( 2 \right){*}}} } \right|^{p} } \right)^{\frac{1}{p}} } \\ \end{array} $$
(9)

where \(x_{i}^{{\left( 1 \right){*}}}\) and \(x_{i}^{{\left( 2 \right){*}}}\) are the sorted samples.To clarify the relation between Euclidean and WST distance we consider histograms seen as probability vectors of length d, belonging to the probability simplex:

$$ \begin{array}{*{20}c} {\Sigma_{d} = \left\{ {u \in {\mathbb{R}}_{ + }^{d} | \mathop \sum \limits_{i = 1}^{d} u_{i} = 1} \right\}} \\ \end{array} $$
(10)

Assume that we wish to compare images of 10 × 10 = 100 pixels and that these pixels can only take values in a range of 4 possible colours, dark red (dR), light red (lR), dark blue (dB) and light blue (lB): each image can therefore be associated to a histogram of 4 colours.

The Euclidean distance (the l2 norm of the difference of two vectors) computes the distance between a and b by comparing for each given index i their coordinates \({\mathrm{a}}_{\mathrm{i}}\) and \({\mathrm{b}}_{\mathrm{i}}\) one at a time. The Euclidean distance between a and b is 66, between a and c is 69, and between b and c is 77. For the Manhattan distance (the l1 norm of the difference of two vectors) of three histograms, we obtain that it is 120 between a and b as well as a and c and it is 114 between b and c.

Already in Aitchison (1982) and in Le and Cuturi (2015) it was remarked that the information reflected in histograms lies more in the relative value of their coordinates rather than on their absolute value.

This observation matches with the reasonable assumption that dark and light red have more in common than dark red and dark blue supporting the intuition that c should be closer to a than it is to b.

The Wasserstein distance implements this intuition by carrying out an optimization procedure to compute a distance between histograms.

The transport distance is defined as the lowest cost one could possibly find by considering all possible transport plans from a to b. For a, b, and c, we obtain Table 1.

Table 1 Distances between the histograms in Fig. 3
Fig. 3
figure 3

The above picture is taken from Cuturi and Avis (2014)

We can see that the transport distance agrees with our initial intuition that a is closer to c than b by considering a metric computed on features. The Manhattan distance does not discriminate because it assigns the same value to the pairs (a, b) and (a, c). The Euclidean distance does not discriminate because it results respectively 66.83 between a and b, and 77.24 between b and c, and 72 between a and c. WST distance is also a full metric in that satisfies also the triangle inequality.

3.2 Computational issues

There are some particular cases, very relevant in applications, where WST can be written in an explicit form. Let \(F\) and \(G\) be the cumulative distribution for one-dimensional distributions \(f\) and \(g\) on the real line and \(F^{ - 1}\) and \(G^{ - 1}\) be their quantile functions.

$$ \begin{array}{*{20}c} {W_{p} \left( {f,g} \right) = \left( {\mathop \smallint \limits_{0}^{1} \left| {F^{ - 1} \left( x \right) - G^{ - 1} \left( x \right)} \right|^{p} dx} \right)^{\frac{1}{p}} } \\ \end{array} $$
(11)

Then the computation of WST reduces to the comparison of two 1-dimensional histograms which can be performed by a simple sorting and the application of Eq. (11).

$$ \begin{array}{*{20}c} {W_{p} \left( {f,g} \right) = \left( {\frac{1}{n}\mathop \sum \limits_{i}^{n} \left| {x_{i}^{*} - y_{i}^{*} } \right|^{p} } \right)^{\frac{1}{p}} } \\ \end{array} $$
(12)

where \(x_{i}^{*}\) and \(y_{i}^{*}\) are the sorted samples. In this paper \(p = 1\). The output of the optimization problem is the distance WST and the optimal transport map, where \(M_{ij} \) is the metric cost matrix defining the cost of moving mass from \(x_{i}\) to \(y_{j}\) and \(\gamma_{ij} \in {\mathbb{R}}^{ + }\).

To the term \(W_{p} \left( {f,g} \right)\) we add the entropic regularizer given by the entropy of the matrix \({\Gamma } = \left[ {\gamma_{ij} } \right]\).

$$ \begin{array}{*{20}c} {H\left( {\Gamma } \right) = \mathop \sum \limits_{ij} \gamma_{ij} {\text{log }}\gamma_{ij} } \\ \end{array} $$
(13)
$$ \begin{array}{*{20}c} {\begin{array}{*{20}c} {W_{{}} \left( {f,g} \right) = \mathop {\min }\limits_{{\gamma_{ij} \in {\mathbb{R}}^{{n \times n^{\prime}}} }} \mathop \sum \limits_{i = 1}^{n} \mathop \sum \limits_{j = 1}^{{n^{\prime}}} \left( {\gamma_{ij} M_{ij} + \lambda \gamma_{ij} {\text{log }}\gamma_{ij} } \right)} \\ {s.t. \mathop \sum \limits_{j = 1}^{{n^{\prime}}} \gamma_{ij} = f_{i} , \mathop \sum \limits_{i = 1}^{n} \gamma_{ij} = g_{j} , \gamma_{ij} \ge 0} \\ \end{array} } \\ \end{array} $$
(14)

The additional computational cost of MOEA/WST over MOEA is due to the computation of the WST distance.

The formula (11) that we have used to construct \({G}_{w}=({V}_{w},{E}_{w})\) does no longer work for \(d>1\). To speed up the search we could consider approximate solutions as suggested in Backurs et al. (2020), Atasu and Mittelholzer (2019). In this paper we have solved the full optimal transport problem given by Eq. (5) using the library Python Optimal Transport (POT) (Flamary et al. 2021) and specifically extending to 3d the OT networks simplex solver.

Here we consider the computational cost of OT. Assuming that each objective function is quantized in 10 bins, we have a 3D “image” of 1000 bins. The transport matrix \({\upgamma }_{\mathrm{ij}}\in\Gamma \) has n = 1 M entries, the linear program has a worst-case complexity, using interior point methods, of 3n log n. This can be reduced in several specific cases, notably for univariate distributions using Eq. (7). This is clearly displayed by the behaviour of the wall clock time. Another solution, offered in the OPT package, is the Sinkhorn regularization.

If we define \({K}_{ij}={e}^{-\frac{{d}_{ij}}{\alpha }}\) where \({d}_{ij}\) represents the cost of moving from \({x}_{i}\) to \({y}_{j}\) and \(\lambda \) is a multiplicative coefficient in the entropic regularizer \(H\left(\Gamma \right).\) Then it can be shown that the solution \({\gamma }_{ij}^{*}\) can be expressed as \({u}_{i}{K}_{ij}{v}_{j}\) where \(u\) and \(v\) are unknown vectors. This result means that instead of optimizing over the \(n\times {n}^{^{\prime}}\) values of the full matrix \({[\gamma }_{ij}]\) we have to optimize over \(n+{n}^{^{\prime}}\) values \(u\) and \(v\) (i.e., Eqs. (15) and (16) would result in 2000 variables instead of 1 M).

$$ \begin{array}{*{20}c} {\mathop \sum \limits_{j} T_{ij} = s_{i} \Rightarrow u_{i} \mathop \sum \limits_{j} K_{ij} v_{j} = s_{i} \Rightarrow u_{i} = \frac{{s_{i} }}{{\mathop \sum \nolimits_{j} K_{ij} v_{j} }}} \\ \end{array} $$
(15)
$$ \begin{array}{*{20}c} {\mathop \sum \limits_{i} T_{ij} = d_{j} \Rightarrow v_{j} \mathop \sum \limits_{i} K_{ij} u_{i} = d_{j} \Rightarrow v_{j} = \frac{{d_{j} }}{{\mathop \sum \nolimits_{i} K_{ij} u_{i} }}} \\ \end{array} $$
(16)

Alternating the estimation of \(u\) and \(v\) the algorithm will reach the correct values. This result is the basis of the most effective approximate algorithms which are analyzed in Peyré and Cuturi (2019).

Starting from the consideration that the variables in \(w\) are more important that the matching weights, approximate solvers have been proposed, specifically Sinkhorn solvers. Here it is just important to remark that they allow to manage the trade-off between accuracy and computational cost through a regularization hyperparameter. Entropic regularization enables scalable computations, but large values of the regularization parameter lambda in Eq. (13) could induce an undesirable smoothing effect while low values not only reduce the scalability but might induce several numeric instabilities.

4 Graph representation of the rating matrix

The graph representation of rating matrix is very useful both for natural interpretation and as enabler of many graph analysis tools.

4.1 Cosine graph

The first graph representation of the rating matrix is directly from \(k\)-NN method. Two users/items (vertices) \(i\) and \(j\) are connected by an edge if their distance (given by a similarity measure like cosine or Pearson) is among the \(k\)­smallest distances from \(i\) to other users/items.

A more general representation is given by the cosine graph. Figure 4 shows the distribution of cosine similarity between all pairs of users.

Fig. 4
figure 4

Distribution of the Cosine similarity between all pair of nodes

The cosine similarity between individual users, can be used to build a graph \({G}_{c}=({V}_{c},{E}_{c})\), in which two nodes (users) are linked together if their similarity is above a given threshold (the dotted red line in Fig. 4); then \({V}_{c}={\left\{{u}_{i}\right\}}_{i=1,\dots ,M}\) is the set of users and \({E}_{c}=\left\{\left(i,j\right):\mathrm{cos}\left({u}_{i},{u}_{j}\right)>\tau \right\}\) is the set of edges. Each edge of this graph is then weighted based on the similarity \(\mathrm{cos}\left({u}_{i},{u}_{j}\right)\). Figure 5 displays the resulting graph, in which the edge color represents its weight. The nodes connected by a red edge are the most similar, while the ones connected by a green edge are the most different.

Fig. 5
figure 5

Cosine graph

4.2 Users as histograms, the feature graph and its clustering

Another graph representation is through the association to each user \({u}_{i}\) of a one-dimensional histogram \(h({u}_{i})\): the bins are the equi-subdivisions of the interval \([\mathrm{0,1}]\) for cosine similarity (and \([-\mathrm{1,1}]\) for Pearson correlation) and the weights are the fraction of users whose cosine similarity falls in each bin; the same representation can be item driven. In Fig. 6 three users’ histograms are shown.

Fig. 6
figure 6

Cosine similarity histograms with a bin length of 0.025

According to this representation each user is described by a signature, feature vector, given by the bins and the associated weights. The elements in this feature space are probabilistic distributions. Whose distance is measured by WST distance. Figure 6 shows the distribution of the Wasserstein distance averaged over all the pairs of nodes.

This distributional representation of the users enables construction of another graph (Fig. 7) in which the nodes are the users that are linked if their distributions of similarity are close enough, i.e., their Wasserstein distance is below a given threshold (the red dotted line in Fig. 7). Let \({G}_{w}=({V}_{w},{E}_{w})\) be the Wasserstein graph, then \(V={\left\{h({u}_{i})\right\}}_{i=1,\dots ,M}\) is the set of users represented as histograms and \({E}_{w}=\left\{\left(i,j\right):WST\left(h\left({u}_{i}\right),h\left({u}_{j}\right)\right)<\tau \right\}\) is the set of edges. The edge \((i,j)\) is then weighted based on the Wasserstein distance \(WST\left(h\left({u}_{i}\right),h\left({u}_{j}\right)\right)\) between the similarity distributions of nodes \(i\) and \(j\) (see Fig. 8).

Fig. 7
figure 7

Distribution of the Wasserstein distance between all pair of nodes. To enable a better visualization, we omitted in the figure pair of nodes whose WST is greater than 0.5

Fig. 8
figure 8

Distributional Wasserstein graph

The same procedure could be used to construct the item-base Wasserstein graph.

5 Objective functions

For the problem of finding the optimal top-L recommendation list, three conflicting objectives are considered, i.e., accuracy, coverage, and novelty. The distributional representation of these three metrics enables the definition of an information space that allows the use of MOEA/WST In particular, each recommendation list matrix can be represented by a three dimensional histogram as shown in Fig. 9.

Fig. 9
figure 9

Representation of the recommendation list

5.1 Rating score

To each recommendation list, it is possible to assign a score that represents how “good” the items recommended to users are. This score is based on the sum of the ratings given by the users to the recommended items and is given by the following equation:

$$ \begin{array}{*{20}c} {score = \frac{1}{M \cdot L}\mathop \sum \limits_{{u_{i} \in U}}^{{}} \mathop \sum \limits_{{o_{j} \in S_{L} \left( {u_{i} } \right)}}^{{}} r\left( {u_{i} ,o_{j} } \right)} \\ \end{array} $$
(17)

where \(r({u}_{i},{o}_{j})\) is the rating given by user \({u}_{i}\) to item \({o}_{j}\). Maximize this score ensures that the recommendation list of each user contains only items that user has given a high rating.

This particular definition of score admits a distributional representation. The distribution is given by the values of accuracy of each user as in Eq. (16). This distribution can be represented by a histogram (Fig. 10) in which the support points \({k}_{1},\dots ,{k}_{{N}_{a}}\) correspond to accuracy values, and the weights \({w}_{{k}_{i}}\) with \(i=1,\dots ,{N}_{a}\) represent the fraction of users with a certain value of score \({w}_{{k}_{i}}=\frac{1}{M}\left|\left\{{u}_{i}: \sum_{{o}_{j}\in {S}_{L}({u}_{i})}r({u}_{i},{o}_{j})\in \left[{k}_{i},{k}_{i+1}\right)\right\}\right|\).

Fig. 10
figure 10

Distributional representation of accuracy

One problem with Collaborative Filtering recommendation is the ''popularity bias'': (Abdollahpouri et al. 2019) popular items are being recommended too frequently while most of the items do not get attention. This is one reason why in recent research papers, the score is considered together with other two objectives, coverage and novelty.

5.2 Coverage

A recommender system is expected to provide \(M\) recommendation lists. Each list corresponds to a user and consists of \(L\) items. The coverage of the recommendation list is defined as the number of different items in all users’ top-\(L\) lists.

$$ \begin{array}{*{20}c} {{\text{cov}} erage = \frac{1}{N}\left| {\bigcup\limits_{{u_{i} \in U}} {S_{L} \left( {u_{i} } \right)} } \right|} \\ \end{array} $$
(18)

The objective function coverage is averaged over the number of items \(N\). Coverage reflects the diversity of recommendation. A larger value of coverage is better because more choices are provided to the users.

Also the coverage admits a distributional representation. The distribution is given by the ratio between the non-duplicated items in the recommendation list and the total number of items for each user, i.e., the coverage of the user recommendation list \({S}_{L}\left(u\right)\).

This distribution can be represented by a histogram (Fig. 11) in which the support points are the values of coverage \({k}_{1},\dots ,{k}_{{N}_{c}}\), and the weights \({w}_{{k}_{i}}\) with \(i=1,\dots ,{N}_{c}\) represent the fraction of users with a certain value of coverage \({w}_{{k}_{i}}=\frac{1}{M}\left|\left\{{u}_{i}: \left|{S}_{L}\left({u}_{i}\right)\right|\in \left[{k}_{i},{k}_{i+1}\right)\right\}\right|\).

Fig. 11
figure 11

Histogram representation of coverage

5.3 Novelty

This objective is based on the degree \({d}_{j}\) of an item \({o}_{j}\) that is the number of times it has been rated. Then, the self-information (Zhou et al. 2010) of the item \(o_{j}\) is given by:

$$ \begin{array}{*{20}c} {N_{j} = \log_{2} \frac{M}{{d_{j} }}} \\ \end{array} $$
(19)

The novelty is then defined as the average self-information of all the items in the recommendation lists of each user:

$$ \begin{array}{*{20}c} {Novelty = \frac{1}{M}\mathop \sum \limits_{{u_{i} \in U}}^{{}} \mathop \sum \limits_{{j \in S_{L} \left( {u_{i} } \right)}}^{{}} \frac{{N_{j} }}{L}} \\ \end{array} $$
(20)

It's important to note that novelty also admits a distributional representation. The distribution is given by the values \(\sum_{i\in {S}_{u}}\frac{{N}_{i}}{L}\) for each user \(u\).

This distribution can be represented by a histogram (Fig. 12) in which the support points \({k}_{1},\dots ,{k}_{{N}_{n}}\) are the values of novelty, and the weights \({w}_{{k}_{i}}\) with \(i=1,\dots ,{N}_{n}\) represent the number of users with a certain value of novelty \({w}_{k}=\frac{1}{M}\left|\left\{{u}_{i}: \sum_{j\in {S}_{L}({u}_{i})}\frac{{N}_{j}}{L}\in \left[k,k+1\right)\right\}\right|\).

Fig. 12
figure 12

Distributional representation of novelty

The Eqs. (19) and (20) could be interpreted as relating the novelty to the number of items originally unrated that are recommended to users, in the recommendation lists.

5.4 Encoding of solutions and their representation

A solution is represented as a matrix whose rows are the users in the cluster and the columns represent for each user the top-L items (see Fig. 13).

Fig. 13
figure 13

Matrix encoding of a top-L recommendation list

The multi-objective problem is

$$\underset{\mathrm{x}\in\Omega \subseteq {\mathbb{R}}^{\mathrm{d}}}{\mathrm{max}}F\left(x\right)=\left({f}_{1}\left(x\right),{f}_{2}(x),{f}_{3}\left(x\right)\right)$$

where \(x\) is given by the entries of the matrix shown before. The codomain of each objective function \({f}_{i}\left(x\right)\) is subdivided in \({n}_{i}\) bins.

The distributional representation of the three previously defined objective can be viewed as a three-dimensional histogram. For each recommendation list \({S}_{L}\) the support points of this histogram are the values of accuracy along the x-axis, the values of coverage along the y-axis and the values of novelty along the z-axis; the weights represent the fraction of users whose values of accuracy, coverage and novelty fall in a specific range. These distributions constitute the space on which the MOEA/WST algorithm is based. Therefore, it uses the Wasserstein distance computed by formula (6), (7), (8) to compare the histograms associated to different top-\(L\) recommendation lists. In our problem \({f}_{1}\) is the accuracy given by (16), \({f}_{2}\) is the coverage given by (17) and \({f}_{3}\) is the novelty, given by (19).

6 The description of MOEA/WST

The multi-objective evolutionary algorithm used for the solution of problem (1) is based on the Python framework Pymoo (Blank and Deb 2020). This section is focused on analyzing how the mathematics presented in the previous sections enables the construction of a new genetic operator and how it can be embedded in the general workflow of Pymoo.

The set of non-dominated solutions is called the Pareto set and its image in the objective space the Pareto Front (PF). For a Pareto optimal solution, the value of one objective cannot be improved unless at least one of the other objectives is negatively affected.

In order to approximate the Pareto set MOEAs we follow a strategy based on non-dominated sorting genetic algorithms which is implemented in Pymoo by the algorithm NSGA-II.

The general structure is as follows. The blue modules are basically the same as in Pymoo. The red one “selection” is a contribution of this paper and is analyzed in detail in 6.1.

The algorithm begins with a random selection of a number of individuals among which we select the non-dominated individuals as elements of first approximation of the Pareto set. Then MOEA/WST performs the following steps: (i) selection, (ii) crossover and (iii) mutation.

Pymoo offers different solutions for the above points. For the computations Simulated Binary Crossover with \(\upeta =3\) and Inverse Mutation have been used. The element of novelty of the proposed algorithm with respect to NSGA-II is the selection operator based on the Wasserstein distance analyzed in Sect. 6.1. The new algorithm is accordingly called MOEA/WST. To evaluate the performance of the MOEAs the hypervolume metric is generally adopted given by the volume of the portion of the objective space enclosed by a reference point and the approximate Pareto front.

6.1 Selection

In order to select the pairs of parents to be mated using the crossover operation, we have introduced a problem specific selection method that takes place into the Wasserstein space. In this paper we used the Wasserstein distance between the histograms corresponding to the recommendation lists \({F}_{i}\) and \({M}_{i}\).

First, we randomly sample from the actual Pareto set two pairs of individuals \(({F}_{1},{M}_{1})\) and \(({F}_{2},{M}_{2})\). Then we choose the pair \(\left({F}_{i},{M}_{i}\right)\) as the parents of the new offspring, where \(i=\mathrm{arg}\underset{\mathrm{i}\in \left\{\mathrm{1,2}\right\}}{\mathrm{max}}WST({F}_{i},{M}_{i})\). This favours exploration and diversification in the Wasserstein space (see Fig. 14).

Fig. 14
figure 14

The Wasserstein-based selection operator

7 Computational results

In this section, the computational results over the MovieLens dataset are reported. First, the two algorithms, NSGA-II and MOEA/WST, have been tested on the clusters resulting from the graph in Fig. 4. The optimization has been run for each cluster. Figure 15 shows the Hypervolume over generations of NSGA-II (red) and MOEA/WST (blue). Since multiple runs of the algorithms are performed, the charts display mean and standard deviation of the target metric.

Fig. 15
figure 15

Hypervolume over generations of the three different clusters identified by clustering the cosine graph

The hypervolume curve of MOEA/WST is better than NSGA. Also, in terms of the C-metrics, defined in Sect. 2, and shown in Fig. 16, MOEA/WST is the better performer. Then the two algorithms have been used, with the same settings, on the Wasserstein graph in Fig. 7. The results, respectively for hypervolume and C-metrics are reported respectively in Figs. 17 and 18.

Fig. 16
figure 16

Coverage over generations of the three different clusters identified by clustering the cosine graph

Fig. 17
figure 17

Hypervolume over generations of the three different clusters identified by clustering the Wasserstein graph

Fig. 18
figure 18

Coverage over generations of the three different clusters identified by clustering the Wasserstein graph

Again, both the hypervolume curve and the C-metric of MOEA/WST is better than NSGA-II. It is also important to note that, using the clusters over the Wasserstein graph, both algorithms perform better than using the cosine graph.

8 Limitations and discussion

The main limitation in the application of the Wasserstein distance is its computational cost for multivariate histograms. The straightforward utilization of Eqs. (6), (7) and (8) can become prohibitive. A first solution is to use approximate solvers as those quoted in Sect. 3.2. Another solution, offered in Python Optimal Transport (POT) is the Sinkhorn regularization which has the effect of reducing drastically the computational cost introducing a regularization term. Handling the regularization parameter is not easy and requires a delicate balance between accuracy and numerical stability due to ill conditioning. Another limitation is that the cost matrix is usually assumed in a problem agnostic way: a better solution is to use algorithms that can learn the ground metric using only a training set of labelled histograms as proposed in Cuturi and Avis (2014) and Heitz et al. (2021).

Given the growing importance of WST distance in fields like imaging, the generation of adversarial network among others approximate algorithms are being proposed. Beugnot et al. (2021) introduce before the linear solver a pre-processing step in which they summarize a smaller number of representative samples therefore solving a smaller linear problem. A different approach has been recently suggested in Si et al. (2020) where they provide insights as to why, despite the curse of dimensionality, the WST distance enjoys favourable empirical performance.

The computational overhead related to the computation of the distance is expectedly significant because the support of the histograms is 3-dimensional. As it can be gleaned from Fig. 17, the value 0.538 of hypervolume is reached by MOEA/WST in 25 generations versus the 50 generations required by NSGA-II. Each generation requires 10 function evaluations which brings the advantage of MOEA/WST over NSGA-II to 250 function evaluations. The wall-clock time from these specific computations is 80 s for NSGA-II and 220 s for MOEA/WST of which 140 s are due to the computation of Wasserstein distances which must be computed for each individual. If \(c\) is the time in seconds required for each function evaluations, the balance between the two algorithm is given by \(250c=140.\) This means that for \(c>0.56\) s MOEA/WST is better also in terms of wall-clock time. This computation has been performed for MovieLens 100 k. In a conservative assumption the value c scales linearly with the number of users which means that MovieLens 1 M would give an advantage to MOEA/WST.

9 Conclusions and perspectives

The main conclusion is that embedding in a probabilistic space both the data model and the optimization algorithm in the design of RSs is a novel and potentially useful approach. The elements of this space are histograms which capture in a low dimensional space using a limited number of features the interactions between users and items. The Wasserstein distance is particularly suitable as measure of similarity between histograms. The Wasserstein distance also enables a novel graph representation of the rating matrix: the nodes are the users, and the weights of the edges are the WST distance between users.

The second key result is that a candidate top k-list is represented as a multidimensional histogram which encodes the knowledge obtained from the evaluation of all the objectives function evaluations. The difference between 2 lists can be encoded as the WST distance between their histograms. This enables to define a WST based selection operators in MOEA.

The resulting algorithm MOEA/WST results in better hypervolume and coverage than NSGA-II. The advantage is larger at low generation counts meaning that MOEA/WST is particularly useful when the evaluation of the objectives is even moderately expensive. Moreover, comparing histograms through the Wasserstein distance enables a better visualization. This potential has been largely realized for images but has general applications. As we have remarked before, WST between histograms or point clouds, considered as instances of probability measures, is defined as the smallest cost required to move one measure to another. Because the Wasserstein distance is geodesic when the ground metric is geodesic Optimal transport can be used to compute interpolation between two measures, which is the shortest path in the probability simplex that connects the two measures as end-points. The interpolation process describes a series of intermediate measures during the transport process. We have not considered in this paper the use of the Fréchet mean, also called Wasserstein barycenter, of a set of measures. This is shape-preserving and offers a better synthesis of the set of distributions than the Euclidean distance. The barycenters can be seen as centroids in a clustering process which can be structured as the standard k-means using the Wasserstein distance between elements. The Wasserstein distance, thanks to its linear programming basis, has emerged as a main tool in economic analysis (Galichon 2021) where the issues of labour economics, trade and derivatives pricing have been considered.

The main conclusion of this paper is that the Wasserstein distance is a convenient way to describe complex or high dimensional objects, allowing to reparametrize them so as to work with a reduced number of features. These results have implications, beyond the domain of RSs, for those situations as for instance computer experiments or simulation/optimization, in which associating a distribution to the inputs might be more effective than merely comparing a set of parametric features such as the mean or the higher moments. Indeed, the WST distance takes the whole distribution into account and matches naturally the perceptual notions of nearness and similarity.