先从线上的一个CASE说起,在使用GTID的AUTO_POSITION=1模式切换主库时,遭遇Error: 1236错误:

2018-05-08 14:45:52 26147 [ERROR] Slave I/O: Got fatal error 1236 from master when reading data from binary log: ‘The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.’, Error_code: 1236

错误日志的说明是:从库需要一部分GTID集合来补齐数据,而缺失的部分恰好被主库purged了,所以没法建立复制关系。但是缺失的GTID集合是哪些,怎么计算出来的没说。

从MySQL代码里面找下线索,先从官方文档里得知Error:1236对应错误ER_MASTER_FATAL_ERROR_READING_BINLOG,然后定位到sql/rpl_master.cc文件1007行,MySQL版本:5.6.39:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
/*
Setting GTID_PURGED (when GTID_EXECUTED set is empty i.e., when
previous_gtids are also empty) will make binlog rotate. That
leaves first binary log with empty previous_gtids and second
binary log's previous_gtids with the value of gtid_purged.
In find_first_log_not_in_gtid_set() while we search for a binary
log whose previous_gtid_set is subset of slave_gtid_executed,
in this particular case, server will always find the first binary
log with empty previous_gtids which is subset of any given
slave_gtid_executed. Thus Master thinks that it found the first
binary log which is actually not correct and unable to catch
this error situation. Hence adding below extra if condition
to check the situation. Slave should know about Master's purged GTIDs.
If Slave's GTID executed + retrieved set does not contain Master's
complete purged GTID list, that means Slave is requesting(expecting)
GTIDs which were purged by Master. We should let Slave know about the
situation. i.e., throw error if slave's GTID executed set is not
a superset of Master's purged GTID set.
The other case, where user deleted binary logs manually
(without using 'PURGE BINARY LOGS' command) but gtid_purged
is not set by the user, the following if condition cannot catch it.
But that is not a problem because in find_first_log_not_in_gtid_set()
while checking for subset previous_gtids binary log, the logic
will not find one and an error ER_MASTER_HAS_PURGED_REQUIRED_GTIDS
is thrown from there.
*/
if (!gtid_state->get_lost_gtids()->is_subset(slave_gtid_executed))
{
errmsg= ER(ER_MASTER_HAS_PURGED_REQUIRED_GTIDS);
my_errno= ER_MASTER_FATAL_ERROR_READING_BINLOG;
global_sid_lock->unlock();
GOTO_ERR;
}

从代码和注释内容可以了解到,从库executed+retrieved的gtid集合如果不完全包含主库purged的gtid集合,则说明从库无法从主库binlog中拿到想要的gtid集合。转换下即是主库与从库executed gtid的差集,如果已经在gtid_purged集合中,即无法建立同步,发生1236错误。

因此,可以在主库通过下面方法提前判断切换到新主库时是否会出现1236错误:

  1. 去从库停止同步后的gtid_executed集合,记为slave.gtid_executed
  2. 新主库计算gtid_subtract(@@global.gtid_executed, slave.gtid_executed)结果
  3. 对于2步结果,每个uuid的gtid集合,计算gtid_subset(set, @@global.gtid_purged),为1即出错

伪代码如下:

1
2
3
4
5
6
7
for set in `gtid_subtract(@@master.gtid_executed, @@slave.gtid_executed)`; do
if gtid_subset(set, @@master.gtid_purged); {
return ER_MASTER_HAS_PURGED_REQUIRED_GTIDS
}
done
return 0