Netflix’s new algorithm offers optimal recommendation lists for users on a tight budget

Netflix developed a new machine learning algorithm based on strengthening teaching to create an optimal list of recommendations considering a limited time budget for the user. In a recommendation use case, often the time factor in making a decision is ignored, Netflix has added this dimension to its recommendation system and overall decision-making issues, as well as the relevance of recommendations.

The rating problem can be measured in terms of time, because the user takes time to choose what they want to see: reading trailers, previewing, and so on, and different shows require different rating times. This time budget can be considered in the recommendation system, so the recommendation model should build the list of recommendations considering the relevance of the items and the evaluation cost to the user. The user’s time budget may not be directly observable (like preferences), but the goal of the recommendation algorithm is to build the list of recommendations that has a higher chance of engagement for the user. The recommendation system must know the user’s time budget in addition to the user’s latent preferences.

A typical recommendation system works with a bandit-style approach to listing the construction and builds a list of K items as follows:

Image source: A banned-style recommendation system for slate construction

The element classifier evaluates all N available elements and can use the constructed list regarding the additional context. The scores are then passed through a sampler which selects an item from the available items. This is how the scorer and sampling stage, the main components of a recommendation system, insert an element into the K-slot of the board.

READ ALSO :   The imprisoned tiger king sues over songs used in Netflix series

The recommendation system presents the user with a one-dimensional list of K elements (in a simplified setup), the user has a time budget modeled as a positive real number. The problem can be modeled with two characteristics: relevance and cost. When the user evaluates a list item, the cost (time) is consumed and at the end of the budget the user cannot evaluate other suggested items. Each item has a 0-1 probability of being consumed, and the likelihood of choosing an item depends not only on the relevance of the item but also on the number of items the user is able to review.

A recommendation system that seeks to maximize user engagement with the whiteboard must fit as many relevant elements into the user’s budget as possible, making a trade-off between relevance and cost.

The problem is related to the 0/1 Backpack problem in theoretical computer science: the goal is to find the subset of elements with the highest total utility such that the total cost of the subset is not greater than the budget. The 0/1 backpack problem is NP-Complete, there are many approximate solutions. Netflix proposes to model the budget-constrained recommendation as a Markov Decision making and use a Reinforcement teaching to find a solution. In the Markov decision process the key concept is the current status and action taken by the agent. In this problem the referral system is the agent and the interaction between the user and the referral system is modeled as the environment. The recommendation system (agent) builds the list of recommendations by repeatedly selecting the k items deemed appropriate in each slot. The environment (interaction between the user and the recommendation system) is characterized by the temporal budget and the items examined in the list in particular phases. In the real world, the user’s time budget is unknown and can be estimated based on the user’s historical data (e.g. how long the user scrolled before leaving the historical data logs). The reinforcement learning algorithm used for this problem is based on performance estimation, specifically using Learning of time differences to estimate the value function.

READ ALSO :   The fifth season of "Cobra Kai" is Netflix's most reviewed series since "Heartstopper"

Simulation is very useful for studying and better understanding the problem. With the simulation, various recommendation algorithms are trained and compared. A simple metric to the average number of successes of the generated stater called play rate is used to evaluate performance. In addition to the playback rate, it is important to consider the actual size of the whiteboard: one of the ways to improve the playback rate is to build more effective lists, this metric is important for understanding the mechanism of the recommendation algorithms.

Thanks to the flexibility of the simulation and the setting of the simulation parameters, the Netflix team has learned to build optimal lists in an on-policy way using SARSA algorithm. Comparing the RL model and contextual bandit the performance is much better for the reinforcement learning approach for both the effective size of the board and the pace of play. Specifically, for the playback rate, the result is a statistically significant increase in the playback rate for users with small to medium budgets.

The on politics learning is easy to simulate but difficult to perform in realistic recommendation settings because the data is generated from a different policy (behavior policy) and the goal is to learn a new (better) policy from this data. In this case the Q-learning it is the technique that allows the learning function of optimal values ​​in a non-political environment.

Q-learning and SARSA of the on-policy setting are compared and the result is that Q-learning appears to generate very large actual whiteboard sizes without much difference in playback rate. This result is interesting and still unclear, which is why it needs further study to be fully understood.

READ ALSO :   Review of the Netflix series "El Rey, Vicente Fernández" - A man with passion

Tinggalkan komentar