The article "The Evolution Trend of Recommendation System Technology: From Recall to Ranking to Reordering" mainly talks about some obvious technical development trends of recommendation system technology in the last two years. Mainly from the following aspects:
Overall framework of recommendation system
Development trend of recall technology
Development trend of sequencing model technology
Development trend of rearrangement technology
Macro architecture of recommended system:
Four stages of subdivision:
1, traditional: multiple recalls (each recall is equivalent to a single feature sorting result)
2. Future: model recall (introducing multi-features and expanding single-feature ranking to multi-feature ranking model)
(1) model recall
According to the user's item embedding, we use an efficient embedding retrieval tool similar to Faiss to quickly find out the items that meet the user's interests, which is equivalent to using multi-feature fusion as a recall model.
In theory, any supervision model you can see can be used to make this recall model, such as FM/FFM/DNN. The so-called "twin towers" model often refers to the structure that embeds the client and the item separately, rather than a specific model.
It is worth noting that if model recall is used in the recall stage, theoretically, the same optimization objectives should be adopted at the same time as the sequencing model, especially if multi-objective optimization is adopted in the sequencing stage, the recall model should also adopt the same multi-objective optimization accordingly. Similarly, if the whole process includes rough layout module, the rough layout should adopt the same multi-objective optimization as the fine layout, and the optimization objectives of several links should be consistent. Because recall and rough arrangement are the front links of fine arrangement, otherwise, if the optimization objectives are inconsistent, it is likely to filter out the high-quality fine arrangement objectives in the front link and affect the overall effect.
(2) Memory of user behavior sequence
The core is: how to define this item aggregation Function fun. One thing to note here is that the items in the user behavior sequence are arranged in chronological order. Theoretically, any model that can reflect the characteristics of time series or the local correlation of characteristics is suitable for application here, such as CNN, RNN, Transformer, GRU (variant model of RNN) and so on. Which is suitable for integrating user behavior sequence information.
In the recall stage, how to embed according to the user behavior sequence can adopt a supervised model, such as the prediction method of the next prediction; You can also use unsupervised methods. For example, as long as the project can be embedded, it can unsupervised integrate user behavior sequence content, such as Sum Pooling.
(3) user multi-interest splitting (the practice of embedding user interests by using user behavior item sequence)
(4) Knowledge Map Fusion Memory
According to the user's interest entity, after the entity of the knowledge map is embedded and expressed (or directly expanded on the nodes of the knowledge map), the related entities can be expanded through knowledge expansion or according to the embedding similarity.
(5) Recall of graphic neural network model.
The ultimate goal of graph neural network is to obtain the embedded coding of nodes in graph through certain technical means. The most commonly used embedded aggregation tool is CNN. For a graph node, its input can have two kinds of information, one is its own attribute information, such as the example of Weibo cited above; The other is graph structure information, that is, other node information that has direct edge association with the current node. Through CNN, two kinds of information can be encoded and aggregated to form the embedding of graph nodes. Through the information aggregator such as CNN, we can calculate and update the embedding of graph nodes repeatedly, and finally we can get reliable embedding information of graph nodes. In fact, this iterative process reflects the process that distant nodes gradually transmit information through the graph structure, so the graph structure can transmit and supplement knowledge.
We can think further, because graph nodes can carry attribute information, such as the content information of items, which is obviously helpful to solve the cold start problem of items; Because it also allows knowledge to be transmitted in a long distance in the graph, for example, knowledge can be transmitted and supplemented for scenes with less user behavior, which shows that it is also suitable for recommendation scenes with sparse data; On the other hand, the edges in the graph are often constructed by user behavior, which is statistically speaking, essentially a kind of collaborative information. For example, we often say that "item A cooperates with item B" essentially means that many users act on item A and then on item B; Therefore, graph has a good advantage: it is convenient to integrate various heterogeneous information such as collaboration information, user behavior information and content attribute information into a unified framework, which is characterized as embedding, which is its unique advantage and more natural to do. Another unique advantage is the dissemination of information in the diagram, so it should be particularly useful for recommended cold start and sparse data scenarios.
In the early days, graph neural network was used for recommendation, because global information was needed, so the calculation speed was a problem, and often the graph was very small and had no practical value. GraphSAGE, on the other hand, reduces the calculation scale and speeds up the calculation by some means, such as sampling from nearby nodes. Many methods to improve the efficiency of post-calculation are derived from this work. On the basis of GraphSAGE (completed by the same person), PinSage further adopts large-scale distributed computing, which expands the practicability of graph computing, can calculate Pinterest's giant graph with 3 billion nodes and 65.438+0.8 billion edges, and has produced good landing effect. So these two jobs can be used for reference.
Generally speaking, graph model recall is a promising direction worth exploring.
Model optimization objectives reflect what we want to recommend the system to do well, which is usually related to business objectives. This paper mainly discusses it from the technical point of view, and multi-objective optimization and tabular optimization are the most common technical evolution directions at present. List optimization objectives can be used in both sorting stage and rearrangement stage, so we put it in the rearrangement part, and here we mainly introduce multi-objective optimization;
The expressive ability of the model represents whether the model has the ability to make full use of effective features and feature combinations, among which the display feature combination, new feature extractor, the application of reinforcement learning technology and the automatic exploration of model structure by AutoML are obvious technical evolution directions.
From the point of view of features and information, the expansion and integration of information and features are the main technical evolution direction, and the separation of long-term and short-term interests of users, the utilization of user behavior sequence data, graph neural network and multimodal integration are the main technical trends in this regard.
1. 1 model optimization objective-multi-objective optimization
Strictly speaking, the multi-objective optimization of recommendation system (optimizing the goals of click, interaction and duration at the same time) is not only a trend, but also the current research and development status of many companies. For recommendation system, different optimization goals may hinder each other. Multi-objective aims to balance the interaction between different objectives, and if the multi-objective optimization effect is good, it will also greatly promote the business effect. In a word, multi-objective optimization is a technical direction worthy of recommendation by system-related R&D personnel.
From a technical point of view, there are two key problems in multi-objective optimization. The first problem is the model structure of multi-optimization objectives; The second problem is how to define the importance of different optimization objectives (how to find optimization beyond parameters).
2. 1 model expression ability-dominant feature combination
If you summarize the evolution history of industrial CTR model, you will find that the automation of feature engineering and feature combination has always been the most important direction to promote the technical evolution of practical recommendation system, and there is no one. The earliest LR model is basically a combination of artificial feature engineering and artificial feature combination, which is simple and effective but time-consuming and laborious. Then it is developed to the high-order feature combination automation of LR+GBDT and the second-order feature combination automation of FM model. Then the DNN model is introduced. The simple DNN model is actually based on embedding FM model features, adding several MLP hidden layers for implicit nonlinear automatic combination features.
2.2 Model Expression Ability-Evolution of Feature Extractor
From the perspective of feature extractors, the most commonly used feature extractors in mainstream DNN ranking models are still MLP structure in image field, RNN and Transformer in CNN and NLP fields.
The MLP structure is usually two or three MLP hidden layers. At present, there are also theoretical studies showing that MLP structure is inefficient for capturing feature combinations.
It is a very effective structure for CNN to capture local feature association, but it is not suitable for the recommendation model of pure feature input, because there is no necessary order relationship between features in the recommendation field, the weakness of CNN's ability to capture long-distance feature relationship and the shortcoming that RNN can't process it in parallel, so the speed is slow.
Transformer, as the latest and most effective feature extractor in NLP field, is actually very suitable for recommendation from its working mechanism. Why do you say that? The core lies in the multi-head self-attention mechanism of transformer. The MHA structure in NLP will judge the relevance of any two words in the input sentence. If this relationship is applied to the recommendation field, it is to combine the features of any features through MHA. As mentioned above, feature combination is a very important link for recommendation, so from this perspective, Transformer is particularly suitable for modeling feature combination. One layer of transform blocks represents the second-order combination of features, and more transform blocks represent the higher-order combination of features. But in fact, if Transformer is used as recommendation, its application effect does not show obvious advantages, or even has no advantages, which is basically slightly better or similar to the typical MLP structure. This means that maybe we need to transform Transformer according to the characteristics of the recommended domain, instead of directly copying the structure in NLP.
Cutting a picture of the teacher about Transformers is enough to explain the meaning of Transformers, but I still don't understand it very well, haha ~
2.3 suggested AutoML application
AutoML began to appear at the beginning of 17 and flourished in the last three years. Very important research progress has been made in the fields of image and natural language processing. In these fields, at present, we can find a better model structure than human design through AutoML.
2.4 Application of reinforcement learning in recommendation
Reinforcement learning is actually more in line with the recommended scene modeling. Generally speaking, reinforcement learning has several key elements: state, behavior and reward. In the recommended scenario, we can define state St as a collection of user behavior history items; The optional behavior space of recommendation system is a list of recommendation results recommended to users according to their current state ST. It can be seen here that the user behavior space in the recommended scenario is huge, which restricts the application of many reinforcement learning methods in modeling the huge behavior space. Reward is the behavioral value of user interaction with list content given by recommendation system. For example, it can be defined that the reward for clicking an item is 1, the reward for buying an item is 5 ... and so on. With the scene definitions of these elements, we can use typical reinforcement learning to model recommendations.
3. 1? Multi-modal information fusion
Technically speaking, multimodal fusion is essentially to map the information of different modal types into a unified semantic space by embedding coding, so that the information of different modal types and the information expressing the same semantics are completely comparable. For example, the word "apple" spoken in natural language and an apple picture should be encoded by certain technical means, such as embedding. This technical means has a high similarity, which means that different patterns of knowledge are mapped to the same semantic space. In this way, you can search for photos containing apples through text, for example.
3.2? Separation of long-term interests/short-term interests
It is very important for recommendation system to accurately describe users' interests. At present, there are two main ways to describe users' interests. One is to express users' interests from the perspective of user characteristics, which is also the most common; The other is to take the order of projects that users have done as a symbol of users' interests.
Regarding the reordering of List Wise, we can say from two angles, one is to optimize the objective or loss function; The first is the model structure of recommendation module.
Learning sorting is recommended in the system. We know that there are three common optimization objectives: point mode, alignment mode and list mode. Therefore, we must first make it clear that List Wise does not refer to a specific model or a certain model, but refers to the definition of the optimization goal or loss function of the model. Theoretically, all kinds of unused models can be trained by using tabular loss. The simplest definition of loss function is point-by-point, that is, input user characteristics and the characteristics of a single item, grade the items, and sort the items, that is, who should be in front of whom, regardless. Obviously, this method is very simple, direct and efficient in both training and online reasoning, but its disadvantage is that it does not consider the direct correlation of items, which is actually useful in ranking. Paired loss When training the model, the model is directly trained through the sequential relationship between the two items, that is to say, the optimization goal is that the ranking of item A is higher than that of item B, which is similar to this optimization goal. In fact, Pair Wise loss has been widely used in recommendation field, such as BPR loss, which is a typical and very effective Pair Wise loss function. It is often used, especially in implicit feedback, and it is a very effective optimization goal. The loss of List Wise pays more attention to the order of items in the whole list, and will consider the order of items in the whole list to optimize the model. In recommendation, the List Wise loss function is something that many recommendation systems are doing because of the difficulty of training data, slow training speed and slow online reasoning. Although it is still used less now, it pays more attention to the overall optimization of ranking results.
Judging from the model structure. Because the reordering module is often placed behind the fine sorting module, and the fine sorting module has scored the recommended items accurately, the input of the reordering module is often the output result of the highest score of the fine sorting module, that is, orderly. The score or ranking of the fine sorting module is very important reference information for rearranging the modules. Therefore, the output order of this sorting module is more important, and the model that can consider the input order is naturally the first choice for rearranging the models. We know that the most common models considering time series are RNN and transformer, so it is natural to use these two models frequently when rearranging modules. The usual practice is to sort the items in the top results. As the input of RNN or Transformers, RNN or Transformers can obviously consider integrating the characteristics of the current item context, that is, sorting other items in the list and evaluating the effect from the whole list. After feature fusion, RNN or the corresponding position of each input of the transformer outputs the prediction score again, and reorders the items according to the new prediction score, thus completing the purpose of fusing context information and reordering.
References:
1. The evolution trend of recommendation system technology: from recall to sequencing to rearrangement.
/p/ 1000 1968 1
2. Typical work of model recall:
FM model recall: It is recommended that the system recall one of the four models: the all-around FM model.
DNN Twin Towers Memories: Neural Model of Sampling Deviation Correction Recommended by Large Corpus Items
3, user behavior sequence recall typical work:
GRU: Top-k gain recursive neural network based on session recommendation
CNN: Personalized Top-N Sequence Recommendation Based on Convolution Sequence Embedding
Transformers: self-attention order recommendation
4, knowledge map fusion memories of typical works:
KGAT: Knowledge Graph Attention Network for Recommendation
RippleNet: Spreading user preferences on the knowledge graph of recommendation system.
5, graphic neural network model to recall typical work:
GraphSAGE: inductive representation learning on large graphs
PinSage: Graph Product Neural Network for Web Scale Recommendation System
6. Typical work of model multi-objective optimization:
MMOE: Modeling the Task Relationship in Multi-task Learning with Multi-experts
Pareto optimization: Pareto effective algorithm for multi-objective optimization in e-commerce recommendation.
7. The typical work of explicit feature combination:
Depth and breadth. Cross: Deep & Cross-network advertising click prediction
XDeepFM: a recommendation system combining explicit and implicit feature interaction
8. Typical work of feature extractor:
AutoInt: automatic feature interactive learning through self-attention neural network
DeepFM: an end-to-end wide-area & amp deep learning framework for CTR prediction
9. Compare CNN\RNN\ feature extractor: /p/5474394 1
10, recommended application of AutoML in typical work:
ENAS structure search: the application of AutoML in the search of recommended ranking network structure
Bilinear feature combination: fibinet: predict the click rate by combining feature importance and bilinear feature interaction.
1 1. Recommended typical work of reinforcement learning in application:
Youtube: Top-K deviation strategy correction for enhanced recommendation system
Youtube: Reinforcement Learning of Slate-based Recommendation System: An Easy Decomposition and Practical Method
12, a typical work of multimodal fusion:
DNN Memories: Collaborative Multimodal Deep Learning of Personalized Product Retrieval in Facebook Market.
Sorting: image problem: using advanced model server to visually model user behavior.
13, typical work of long-term and short-term interest separation:
1. Neuronews recommendation with long-term and short-term user characteristics
2. Sequence-aware recommendation based on long-term and short-term attention memory networks.
14, typical work of list reordering:
1. Recommended personalized reordering
2. Learn the deep list context model for ranking refinement.