Before it is too late, let's go directly to the contents and conclusions of this article. Friends can read the conclusions first and quickly understand what help bloggers expect this article to bring to their friends:
This book continues the above content and introduces the problem of using flink sql regular join when the exposure stream is associated with the click stream.
This paper introduces how to use flink sql interval join to solve these problems.
Flink sql know why (12): Is it difficult to connect streams? ㈠
Take a look at the actual case in the previous section to see what the output value should look like in the scenario of specific input values.
Scenario: The ordinary exposure log stream (show_log) is associated with the click log stream (click_log) through the log_id, and the associated results of data are distributed.
A wave of input data:
Exposure data:
Click data:
The expected output data are as follows:
The solution of flink sql regular connection in the previous section is as follows:
As mentioned above, if the left table stream (show_log) joins the right table stream (click_log) when the stream data arrives, it will not wait for the direct output of the right table stream (show_log, null), but exit (show_log, null) and send (show _ log, click) when the subsequent right table stream data is copied. This is why there is a retraction flow, which leads to repeated writing of Kafka.
In this regard, we also put forward corresponding solutions. Since the left stream will not wait for the right stream in the left join, can the left stream be forced to wait for the right stream for a while, but not for data irrelevant?
Dangdang Dangdang! ! !
Flink sql interval join in this article appears, and it can wait.
Let's briefly understand the function of interval join through the following sentences and figures (which may have been used by Meng, a friend who is familiar with DataStream), and then introduce the principle in detail.
Interval connection is to use the data of one stream to correlate the data of another stream for a period of time. If it is associated, the associated data will be distributed; If there is no correlation, it will be allocated according to whether it is an external connection (left connection, right connection, full connection) after timeout.
& ltfig caption style = " margin:5px 0px 0px; Filling: 0px Outline: 0px Maximum width:100%; Box size: border-box! Important; Overflow-line break: hyphenation! Important; Text alignment: centered; Color: rgb( 136, 136,136); font-size: 12px; font-family:ping fangsc-Light;" & gt interval join & lt/fig caption & gt;;
Let's take a look at how to write flink sql interval join sql in the above case:
Show _ log.row _ time between click _ log.row _ time-interval'10' minutes and click _ log.row _ time+interval'10' minutes represents the data in the Show _ log table. And row_time in the click_log table in 10 minutes.
The running results are as follows:
The above is the correct result we expect.
The schematic diagram of Flink web ui operator is as follows:
& ltfig caption style = " margin:5px 0px 0px; Filling: 0px Outline: 0px Maximum width:100%; Box size: border-box! Important; Overflow-line break: hyphenation! Important; Text alignment: centered; Color: rgb( 136, 136,136); font-size: 12px; font-family:ping fangsc-Light;" & gtflink web ui & lt/fig caption & gt;
Then you may have a question at this time. I know that the first two data in the result are connected to the output. Then why is show_log join output when it is less than click_log? What is the principle?
The blogger takes you to see the specific source code. First look at the conversion.
& ltfig caption style = " margin:5px 0px 0px; Filling: 0px Outline: 0px Maximum width:100%; Box size: border-box! Important; Overflow-line break: hyphenation! Important; Text alignment: centered; Color: rgb( 136, 136,136); font-size: 12px; font-family:ping fangsc-Light;" & gt conversion & lt/figcaption >
It can be seen that the specific operator of the event time interval join is org. Apache. Flink. Table. Runtime. Operator. Join. KeyedCoprocessOperator with watermark delay.
Its core logic focuses on processElement 1 and processElement2. In processElement 1 and processElement2, org. Apache. Flink. Table. Runtime. Operator. Join. Interval. RowTimeInterval joins are used to handle specific join logic. The important method of RowTimeIntervalJoin is shown in the following figure.
TimeIntervalJoin
Let me explain it to you in detail.
When joining, the left stream and the right stream will wait for each other in the interval. If they wait, the data will be output [+(show_log, click_log)]. If they can't wait, and the time of the other stream has advanced to the point where it is impossible for the current data to join the data of the other stream, then the data will be directly output [+(show_log, null)], [+(null).
For example, show _ log.row _ time-interval'10' minutes and click _ log. row _ time+interval'10' minutes, When the time of click_log is advanced to 2021-111:00: 00, the show_log will reach 20265438+, so this show _ log cannot be compared with the data in click _ log. Because the data in click_log is 2021-1-01:50: 00 to 2021-kloc-0/65438. Show_log directly outputs [+(show_log, null)].
Take the show_log (left table) interval connection click_log (right table) in the above case as an example (whether it is inner interval connection, left interval connection, right interval connection or full interval connection, the following process will be followed):
The above is only the execution flow when the left stream show_log data arrives (that is, ProcessElement 1), and it is also a completely similar execution flow when the right stream click_log arrives (that is, ProcessElement2).
Little friend Meng needs to pay attention to two things when using interval connection:
This paper mainly introduces how flink sql interval can avoid the retract problem of flink regular connection, and explains the operation principle by analyzing its implementation. Bloggers expect you to understand after reading this article: