Modeling search and session effectiveness

https://doi.org/10.1016/j.ipm.2021.102601Get rights and content

Highlights

  • Web search effectiveness can be measured via the utility gained by users.

  • Similarly, search session effectiveness can be measured via average utility gain.

  • Effectiveness measures also describe models that predict user actions.

  • Session measures can be compared by correlating predicted and observed user actions.

  • We present a session effectiveness metric that correlates better than previous ones.

Abstract

Many information needs cannot be resolved with a single query, and instead lead naturally to a sequence of queries, issued as a search session. In a session test collection, each topic has an associated query sequence, with users assumed to follow that sequence when reformulating their queries. Here we propose a session-based offline evaluation framework as an extension to the existing query-based C/W/L framework, and use that framework to devise an adaptive session-based effectiveness metric, as a way of measuring the overall usefulness of a search session. To realize that goal, data from two commercial search engines is employed to model the two required behaviors: the user conditional continuation probability, and the user conditional reformulation probability. We show that the session-extended C/W/L framework allows the development of new metrics with associated user models that give rise to greater correlation with observed user behavior during search sessions than do previous session metrics, and hence provide a richer context in which to compare retrieval systems at a session level.

Introduction

The information retrieval (IR) community has made extensive use of test collection-based evaluation as a complement to (or even replacement for) user-based studies (Sanderson, 2010, Voorhees, 2002). A test collection and evaluation measure(s), together, can be viewed as a simulation of users, as if they are interacting with search result pages in an operational setting (Sanderson, 2010). Many effectiveness metrics have been developed based on such models of user behavior, including rank-biased precision (RBP) (Moffat & Zobel, 2008); expected reciprocal rank (ERR) (Chapelle, Metlzer, Zhang, & Grinspan, 2009); time-biased gain (TBG) (Smucker & Clarke, 2012); the bejeweled-player model (BPM) (Zhang et al., 2017); INST (Moffat, Bailey, Scholer, & Thomas, 2017); the information foraging model (IFT − model) (Azzopardi, Thomas, & Craswell, 2018); and the data-driven model (DDM) (Azzopardi, White, Thomas, & Craswell, 2020). These metrics all result in the calculation of a numeric quality score in regard to a single search engine result page (SERP) returned for a single query posed in connection with a single information need. Whatever metric is used, the metric scores are then typically aggregated across a set of topics, and, if two or more systems are being compared, a paired statistical test is applied.

More realistically, a user with an information need will typically submit an initial query, and then consider the SERP that is returned. If they fail to find a sufficient number of relevant documents, or feel that their information need has not been adequately resolved for some other reason, they will reformulate their query and examine a second SERP. Such iteration may continue through several – perhaps even many – cycles of refinement/alteration before the search session is concluded. Jansen and Spink (2006) studied a sample of Excite.com and AltaVista.com search query logs collected in 2001 and 2002 respectively, and found that 55% of Excite.com users reformulated their initial queries, and that 47% of AltaVista.com queries were submitted in the context of similar queries. Jansen, Booth, and Spink (2009) define six types of query reformulation, the most important of which are specialization and generalization. Specialization occurs when the reformulated query contains additional terms, in order to seek more precise information; while generalization arises if the reformulated query contains fewer terms than the previous one. A search session might include both types of reformulation.

There are several underlying reasons for reformulation. Smith and Kantor (2008) and Turpin and Hersh (2001) demonstrate that users are able to compensate for the reduced effectiveness of search engine systems by adapting their behavior, and one such adaptation is submitting more queries. Järvelin, Price, Delcambre, and Nielsen (2008) describe a laboratory-based interactive searching study, finding that in some cases initial queries give poor results because users submit query terms that do not accurately cover the topic description, and that reformulation is part of a learning experience. Other experimentation suggests that users tend to pose several short queries in a session, rather than one comprehensive one (Järvelin, 2009, Keskustalo et al., 2009); and that user engagement is also positively correlated with search success (O’Brien, Arguello, & Capra, 2020).

Given that users carry out their searching activities via sessions, a range of proposals for session-based effectiveness evaluation have emerged (Ferrante and Ferro, 2020, Järvelin et al., 2008, Kanoulas et al., 2010, Kanoulas et al., 2011, Lipani et al., 2019, van Dijk et al., 2019, Yang and Lad, 2009). Such mechanisms require session-based test collections, such as the one used in the TREC 2010 Session Track (Kanoulas et al., 2010, Kanoulas et al., 2011). As with query-oriented test collections, session-based ones consist of three components: a collection of documents; a set of topics; and a set of relevance judgments. In a query-oriented test collection we think of each topic as having one query associated with it (or one query per user per topic; see, for example, Bailey, Moffat, Scholer, & Thomas, 2017). In a session-oriented test collection each topic is associated with a sequence of queries, with simulated users assumed to follow that fixed ordering when reformulating queries, deciding whether to continue from one query to the next, without control of the ordering (Järvelin et al., 2008, Kanoulas et al., 2010, Kanoulas et al., 2011, Lipani et al., 2019). Fig. 1 illustrates this idea. Note that the number of queries actually posed by each simulated user is not known, and that they might become satisfied (or disillusioned) at different points in the sequence.

In the framework shown in Fig. 1 the score for topic is a probability-weighted summation over the individual query (that is, SERP) scores for the static query sequence, with the weight associated with query Q,j (the  th topic’s j  th query) derived from the proportion of users that would submit that query. As an aid to understanding this model, note that most effectiveness metrics for single SERPs similarly do not assume specific knowledge of what behaviors an individual user may have performed (for example, which captions or documents got inspected), and assume only that each user interacts with each ranking according to an overall pattern associated with the population of users. Similarly, session evaluation models do not assume knowledge of which queries were issued by each user, but do assume that the population of users has certain gross characteristics.

It is those overall population characteristics that we study in this paper. Table 1 provides further explanation, comparing the inputs, the parameter estimation (fitting), and the modeling (predicting) components of single query search (center column) and multi-query session-based search (right column). When evaluating a single SERP, the main inputs are the ranking of documents and a set of relevance judgments (qrels), and in most effectiveness metrics the user interaction with that SERP is modeled rather than observationally recorded. That is, the SERP effectiveness score in the center column is determined from the modeled interactions, rather than a set of individual actual interaction. A similar arrangement arises in the right column in Table 1. Not only is the per-SERP behavior of each individual user modeled, rather than observed, for the purpose of the evaluation; but also the number of queries issued by that individual is modeled, rather than observed.

To connect between observation and predictive model, the middle row of Table 1 notes that the model parameters can be fitted to, and model structure informed by, detailed observations of users. That intermediate “observational fitting” goal of session evaluation has been considered by a range of authors (Jiang and Allan, 2016, Liu et al., 2018, Liu et al., 2019, Wicaksono and Moffat, 2020, Zhang et al., 2020). Given the sequence of queries submitted by each individual user in a session, the challenge is to aggregate the individual query scores via a weighting scheme tuned to optimize a certain quantity, such as user-reported satisfaction (Jiang and Allan, 2016, Liu et al., 2018, Liu et al., 2019, Wicaksono and Moffat, 2020, Zhang et al., 2020). When addressing this observational goal, there is no probability-weighted sum and no reformulation probabilities, since what queries the user posed and how they reformulated them are both known. For example, Zhang et al. (2020) consider the notion of forgetfulness, with users tending to increasingly discount the utility gathered from earlier queries; that mechanism relies on knowledge of which query in the sequence was the user’s last one. Similarly in a single-SERP observational evaluation (the center cell in the center column in Table 1) it is possible to connect user-reported satisfaction with knowledge of the particular documents viewed (and hence their utilities), including the order in which they were encountered, and which one was viewed last.

Our goal in this paper is to develop an adaptive user model for session evaluation, capturing overall population behaviors rather than individual user actions. In particular, we extend the work of Moffat et al. (2017) and Moffat, Thomas, and Scholer (2013), who connect metrics, user models, and user behaviors. We start by collecting and reporting empirical evidence that adds further support to the observations of Moffat et al., 2017, Moffat et al., 2013 regarding query-level user behaviors, using three large search interaction logs. We then develop a model for session-level behaviors (that is, query reformulation behaviors), and incorporate that extension into session-based search effectiveness measures, capitalizing on the framework described in Fig. 2. In that framework, each user commences their search session by posing a first query, j=1; and then sequentially examining the ranked list of items from its top position, i=1. At each rank position i of the j  th SERP a decision is made: to continue to rank i+1; or to exit. In the latter case, a further choice is made: to issue a reformulated j+1  th query; or to end the session entirely.

As a result there are two important functions to be estimated:

  • 1.

    The continuation function, C(j,i), describing the conditional probability of the user inspecting the item at rank i+1 in the j  th SERP, given that they have just examined the item at rank i; and

  • 2.

    The reformulation function, F(j), describing the conditional probability of the user issuing a j+1  th query, given that they have just ended their inspection of the j  th SERP.

The first of these two functions describes the user’s query-level behavior; the second their session-level behavior. A wide range of session-based metrics can then be characterized via these two functions. The challenge is thus to find models for C(j,i) and F(j) that accurately predict observed behavior – the goal expressed in the bottom right element of Table 1.

In the context of that background, we consider the following research questions:

  • Can the single-query C/W/L framework be extended to query sessions (Section 4)?

  • Can observations in regard to user behavior such as impressions and click-throughs be used to infer values for the corresponding parameters of C/W/L models for single- and multi-query sessions and if so, how well do the fitted models anticipate those observations compared to alternative models (Section 5)?

  • Can a C/W/L-based session evaluation metric be derived that correlates with measured user session satisfaction data, and if so, what practical implications does that definition hold for system evaluation (Section 6)?

Section snippets

Query-based C/W/L framework

One possible goal of effectiveness measurement is for the scores generated by the metric to correspond to the expected benefit (or utility) derived from the SERP by a user. In the C/W/L framework (Moffat et al., 2017, Moffat et al., 2013, Moffat and Zobel, 2008) a user model is characterized by any one of three interconnected functions: (1) the conditional continuation probability that the user will continue from the document at rank i to the one at rank i+1, denoted by C(i); (2) the weight

Interaction logs

This section describe a total of six resources that are used: three from commercial search providers; and then three from laboratory studies.

A session-based C/W/L framework

To obtain a session-based framework we add a second dimension to the query-level C/W/L definitions, as was anticipated by Fig. 2. Fig. 4 explains this idea, unrolling the loops in Fig. 2, and showing the browsing paths a user might follow through a search session. It is assumed as a first base case that the user starts the session by submitting a first query Q1, and then inspecting the first item in the corresponding SERP, to accumulate the gain r1,1. It is also assumed, as a second base case,

Search behaviors

The previous section described the session-based C/W/L structure and its two key factors: the conditional continuation probability (determining query-level behavior), and the conditional reformulation probability (governing session-level behavior). It also showed that two existing session evaluation metrics could be described within that framework, hence establishing the generality of the proposed approach.

In this section we employ interaction logs from two commercial search engines, Seek.com.au

A model-based session metric

Section 5 established a number of behaviors that can be associated with C(i) and F(j) in the extended C/W/L framework shown in Fig. 2 (see Table 9 and Fig. 9). We now crystallize those observed relationships trends into a specific model for session evaluation, and measure its fit against observation compared to two previous session-based evaluation proposals. While the model might still employ parameters such as the initial session target T0, it should not depend on additional information other

Context and conclusions

We have extended the C/W/L framework to session-based effectiveness evaluation, and demonstrated that existing session-based user models can be explained by this generalized evaluation framework. In the session-based C/W/L approach a user model (describing a universe of users) is characterized by two behaviors: their conditional continuation probability at rank i when examining the j  th SERP, C(j,i); and their conditional reformulation probabilityF(j).

Three commercial search interaction logs

Acknowledgments

This work was in part supported by the Australian Research Council (grant LP150100252, held in collaboration with Seek.com; and grant DP190101113). We gratefully acknowledge the assistance of Bahar Salehi and Justin Zobel (The University of Melbourne); Damiano Spina (RMIT University); and Sargol Sadeghi and Vincent Li (Seek). We also thank the anonymous referees, who provided helpful comments that improved the paper.

References (48)

  • JansenB.J. et al.

    How are we searching the world wide web? A comparison of nine search engine transaction logs

    Information Processing & Management

    (2006)
  • O’BrienH.L. et al.

    An empirical study of interest, task complexity, and search behaviour on user engagement

    Information Processing & Management

    (2020)
  • Azzopardi, L., Thomas, P., & Craswell, N. (2018). Measuring the utility of search engine result pages. In Proc. Ann....
  • Azzopardi, L., White, R. W., Thomas, P., & Craswell, N. (2020). Data-driven evaluation metrics for heterogeneous search...
  • Bailey, P., Moffat, A., Scholer, F., & Thomas, P. (2017). Retrieval consistency in the presence of query variations. In...
  • Carterette, B., & Jones, R. (2007). Evaluating search engines by modeling the relationship between relevance and...
  • Chapelle, O., Metlzer, D., Zhang, Y., & Grinspan, P. (2009). Expected reciprocal rank for graded relevance. In Proc....
  • Cutrell, E., & Guan, Z. (2007). What are you looking for? An eye-tracking study of information usage in web search. In...
  • Ferrante, M., & Ferro, N. (2020). Exploiting stopping time to evaluate accumulated relevance. In Proc. Int. Conf. on...
  • JansenB.J. et al.

    Patterns of query reformulation during web searching

    Journal of the American Society for Information Science and Technology

    (2009)
  • Järvelin, K. (2009). Explaining user performance in information retrieval: Challenges to ir evaluation. In Proc. Int....
  • Järvelin, K., Price, S. L., Delcambre, L. M., & Nielsen, M. L. (2008). Discounted cumulated gain based evaluation of...
  • Jiang, J., & Allan, J. (2016). Correlation between system and user metrics in a session. In Proc. Conf. on Human...
  • Jiang, J., He, D., & Allan, J. (2014). Searching, browsing, and clicking in a search session: Changes in user behavior...
  • Joachims, T., Granka, L., Pan, B., Hembrooke, H., & Gay, G. (2005). Accurately interpreting clickthrough data as...
  • Kanoulas, E., Carterette, B., Clough, P. D., & Sanderson, M. (2010). Overview of the TREC 2010 Session Track. In Proc....
  • Kanoulas, E., Carterette, B., Clough, P. D., & Sanderson, M. (2011). Evaluating multi-query sessions. In Proc. Ann....
  • Keskustalo, H., Järvelin, K., Pirkola, A., Sharma, T., & Lykke, M. (2009). Test collection-based IR evaluation needs...
  • Li, J., Arya, D., Ha-Thuc, V., & Sinha, S. (2016). How to get them a dream job? Entity-aware features for personalized...
  • Lipani, A., Carterette, B., & Yilmaz, E. (2019). From a user model for query sessions to session rank biased precision...
  • Liu, M., Liu, Y., Mao, J., Luo, C., & Ma, S. (2018). Towards designing better session search evaluation metrics. In...
  • Liu, M., Mao, J., Liu, Y., Zhang, M., & Ma, S. (2019). Investigating cognitive effects in session-level search user...
  • LuX. et al.

    The effect of pooling and evaluation depth on IR metrics

    Information Retrieval

    (2016)
  • Luo, J., Wing, C., Yang, H., & Hearst, M. (2013). The water filling model and the cube test: Multi-dimensional...
  • Cited by (0)

    View full text