• Home
  • Blog & News
  • Offline Evaluation of Personalization Algorithms Using Causal Inference

2023.11.22

Blog

Offline Evaluation of Personalization Algorithms Using Causal Inference

I am Takuma Udagawa, a researcher in machine learning and causal inference at Sony Group Corporation. This article introduces some fundamentals on the offline evaluation of decision-making algorithms and the paper we presented at AAAI2023.

Image of "Offline Evaluation of Personalization Algorithms Using Causal Inference"

Research Area

Keyword

Share

Introduction

I am Takuma Udagawa, a researcher in machine learning and causal inference at Sony Group Corporation. This article introduces some fundamentals on the offline evaluation of decision-making algorithms and the paper we presented at AAAI2023 [1].

Background

Currently, service providers take various kind of action on users in a decision-making process in many scenarios, such as recommendations in video streaming services, personalization in online advertising, and e-mail newsletters. In particular, the decision-making policy based on machine learning models trained on logged data collected in the past plays an important role in improving key performance indicator (KPI).

The optimization of the decision-making policy is also necessary at Sony, particularly in direct-to-consumer (DTC) services, such as games and entertainment content delivery, and we are considering various applications. The personalization of services is now commonplace, and high-quality personalization is key to improving the user experience and connecting with more users [2].

However, in practical terms, it is expensive to introduce a new decision-making policy into a production system because it is time-consuming to incorporate a new policy into a service, and the performance of that policy may be poor.

What is Off-Policy Evaluation?

Given the above-mentioned background, off-policy evaluation (OPE), a technique that applies causal inference, has become important because it enables us to evaluate the performance of a decision-making policy using logged data (i.e., offline evaluation). The formulation is as follows: We denote the user feature (context) by \(x\in\mathcal{X}\), the action on the user by \(a\in\mathcal{A}\), and the observed reward by \(r\in[0,r_{\rm{max}}]\). Hence, the logged data is \(\mathcal{D}_{b}:=\{(x_i,a_i,r_i)\}_{i=1}^{n}\). The decision-making policy used to collect the logged data is called the behavior policy and denoted by \(\pi_b\) (Figure 1).

The following are examples of features, actions, and rewards. In video streaming services, we can consider the viewing history, age, and country as features, the type of video to be recommended as actions, and the clicks and playback time as rewards. For online advertising, we can consider the browsing history, age, and country as features, the type of advertisement presented as actions, and clicks and conversions as rewards.
Figure 1

We want to know the performance of a new policy \(\pi_e\) (evaluation policy), which is different from the behavior policy \(\pi_b\), and its performance is defined as the expected reward \(V(\pi_{e})\) when it is deployed online (Figure 2). The goal of OPE is to estimate the performance of evaluation policy \(\pi_e\) accurately using the logged data \(\mathcal{D}_b\) offline.

The performance of an evaluation policy is defined as the expected reward for actually implementing the policy online. The goal of OPE is to accurately estimate its performance offline using log data.
Figure 2

Many estimation methods (off-policy estimators \(\hat{V}(\pi_{e};\mathcal{D}_b)\)) have been proposed. Typical estimators are inverse propensity scoring (IPS, or inverse probability weighting (IPW) in some papers) [3,4], direct method (DM) [5], and doubly robust (DR) estimators [6]:

\(\hat{V}_{\rm{IPS}}(\pi_{e};\mathcal{D}_{b}):=\frac{1}{n}\sum_{i=1}^{n}w(x_i,a_i)r_i\)  where \(w(x,a):=\frac{\pi_e(a|x)}{\pi_b(a|x)}\) (importance weight),

\(\hat{V}_{\rm{DM}}(\pi_{e};\mathcal{D}_{b},\hat{q}):=\frac{1}{n}\sum_{i=1}^{n}\sum_{a\in\mathcal{A}}\pi_{e}(a|x_i)\hat{q}(x_i,a)\)  where \(\hat{q}(x,a)\approx q(x,a):=\mathbb{E}[r|x,a]\),

\(\hat{V}_{\rm{DR}}(\pi_{e};\mathcal{D}_{b},\hat{q}):=\frac{1}{n}\sum_{i=1}^{n}w(x_i,a_i)(r_i -\hat{q}(x_i,a_i))+\hat{V}_{\rm{DM}}(\pi_{e};\mathcal{D}_{b},\hat{q})\).

Each off-policy estimator has unique characteristics. For example, IPS is an unbiased estimator and its variance tends to be large when the importance weight is large, whereas DM is not an unbiased estimator and its variance tends to be smaller than that of IPS. Many other advanced off-policy estimators have been proposed; hence, if you are interested, that are available in the literature.

Challenge in Off-Policy Evaluation

As we have mentioned, OPE is a technique that can solve the important practical problems and is an area of active research. A number of off-policy estimators have been proposed, and at first glance, it would seem that OPE has become more practical and easier to use. However, a major challenge remains: the matter of which off-policy estimator to use. The best off-policy estimator depends on configurations, such as the number of actions, sample size, behavior policy \(\pi_b\), and evaluation policy \(\pi_e\). It is empirically known that no single estimator dominates the others [7]. Therefore, to conduct OPE, it is necessary to select an off-policy estimator that is appropriate for the evaluation policy \(\pi_e\) and the data to be processed. This process is called estimator selection.

Proposed Method

Our team has developed a method that enables practitioners to select an appropriate estimator adaptively for a given configuration, which we presented at AAAI2023 [1]. In the following, we provide a brief explanation of our proposed method.

First, we introduce the formulation of estimator selection (Figure 3). The performance of an off-policy estimator is defined as the mean squared error (MSE) of the true and estimated performance of the policy. Estimator selection aims to select an off-policy estimator that minimizes the MSE. However, the MSE cannot be calculated directly because the true performance of the policy is unknown. Therefore, we need to select an off-policy estimator based on an estimated mean squared error \(\hat{\rm{MSE}}\).

The goal of estimator selection is to select the most accurate estimator among the candidate estimators. Since the performance of an off-policy estimator is defined as the MSE of the true and estimated performance of the policy, estimator selection aims to select the off-policy estimator that minimizes the MSE. However, since the true performance of the policy is unknown, the MSE cannot be calculated directly. Therefore, it is necessary to use the estimated MSE.
Figure 3

Figure 4 shows the non-adaptive heuristic, which is the estimator selection procedure used in the existing papers [8,9]. We assume that logged data \(\mathcal{D}_b\) is collected through the A/B testing of two policies \(\pi_A\) and \(\pi_B\). Then, we consider \(\pi_B\) as the pseudo behavior policy and \(\pi_A\) as the pseudo evaluation policy. Leveraging on-policy estimation instead of the unknown true policy performance, we can estimate the MSE. Then, we select the off-policy estimator with the lowest estimated MSE.

We describe the existing procedure called Non-Adaptive Heuristic briefly. Assuming that log data was collected through A/B testing of two policies, one is considered a pseudo behavior policy and the other a pseudo evaluation policy. Leveraging on-policy estimation instead of the unknown true policy performance, we can estimate the MSE. Then, we select the off-policy estimator with the lowest estimated MSE.
Figure 4

However, the non-adaptive heuristic has a critical pitfall. As we mentioned, the optimal off-policy estimator depends on the evaluation policy \(\pi_e\). However, the non-adaptive heuristic is not adaptive to \(\pi_e\) and selects the same off-policy estimator for any \(\pi_e\). This means that it may select an inappropriate off-policy estimator for a given evaluation policy. For example, if \(\pi_e\) and \(\pi_A\) deviate greatly, it is easy to see that it becomes difficult to estimate the MSE accurately. This would result in the inaccurate estimation of the performance of \(\pi_e\).

Our team proposed a method called policy-adaptive estimator selection via importance fitting (PAS-IF) [1] to select an appropriate off-policy estimator for a given evaluation policy \(\pi_e\). The basic idea is shown in Figure 5. We want to estimate the MSE in the original OPE task. Hence, we consider splitting the logged data \(\mathcal{D}_b\) to imitate the behavior policy \(\pi_b\) and evaluation policy \(\pi_e\) as much as possible. In PAS-IF, we take particular note of importance weight \(w(x,a):=\frac{\pi_e(a|x)}{\pi_b(a|x)}\), which characterizes the relationship between behavior policy \(\pi_b\) and evaluation policy \(\pi_e\). By optimizing the way data are subsampled to imitate the importance weight of the original OPE task, we can adaptively construct data as if it were collected by the pseudo behavior policy and pseudo evaluation policy. The subsequent procedure is the same as that in the non-adaptive heuristic. We estimate the MSE using on-policy estimation instead of the unknown true policy performance, and select the off-policy estimator with the lowest estimated MSE. This adaptive subsampling of data and MSE estimation allow us to select the most appropriate off-policy estimator from a number of candidates for a given \(\pi_e\). If you are interested in specific algorithms and experimental results, please refer to the paper [1]. As a future work, we would like to consider applying our proposed method to OPE in reinforcement learning, which is a more complex problem.

This figure shows the idea of our proposed method, Policy-Adaptive Estimator Selection via Importance Fitting (PAS-IF). By subsampling the log data to mimic the behavior and evaluation policies, we can estimate the MSE of the off-policy estimator adaptively according to the evaluation policy.
Figure 5

About Recruitment

At Sony, we will continue to conduct academic and practical research on the personalization of decision-making and improvement of actions, including OPE explained in this article. If you are interested in this area, please visit this page. Additionally, by selecting the Technology Category “Machine Learning,” or “Data Analysis,” you can find related job openings.

References

[1] Udagawa, T.; Kiyohara, H.; Narita, Y.; Saito, Y.; and Tateno, K. 2023. Policy-Adaptive Estimator Selection for Off-Policy Evaluation. Proceedings of the AAAI Conference on Artificial Intelligence, 37(8), 10025-10033. https://doi.org/10.1609/aaai.v37i8.26195

[2] https://wpengine.com/gen-z-eu/

[3] Precup, D.; Sutton, R. S.; and Singh, S. P. 2000. Eligibility Traces for Off-Policy Policy Evaluation. In Proceedings of the 17th International Conference on Machine Learning, 759–766. https://dl.acm.org/doi/10.5555/645529.658134

[4] Strehl, A.; Langford, J.; Li, L.; and Kakade, S. M. 2010. Learning from Logged Implicit Exploration Data. In Advances in Neural Information Processing Systems, volume 23, 2217–2225. https://arxiv.org/abs/1003.0120

[5] Beygelzimer, A.; and Langford, J. 2009. The Offset Tree for Learning with Partial Labels. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 129–138. https://dl.acm.org/doi/10.1145/1557019.1557040

[6] Dudík, M.; Erhan, D.; Langford, J.; and Li, L. 2014. Doubly Robust Policy Evaluation and Optimization. Statistical Science, 29(4): 485–511. https://arxiv.org/abs/1503.02834

[7] Saito, Y.; Joachims, T. 2022. Counterfactual Evaluation and Learning for Interactive Systems. KDD Tutorial, https://counterfactual-ml.github.io/kdd2022-tutorial/

[8] Saito, Y.; Aihara, S.; Matsutani, M.; and Narita, Y. 2021a. Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track. https://arxiv.org/abs/2008.07146

[9] Saito, Y.; Udagawa, T.; Kiyohara, H.; Mogi, K.; Narita, Y.; and Tateno, K. 2021b. Evaluating the Robustness of Off-Policy Evaluation. In Proceedings of the 15th ACM Conference on Recommender Systems, 114–123. https://dl.acm.org/doi/10.1145/3460231.3474245

Share