[Maria-developers] some bugs in dingqing parallel replication

19 Jul 2013

      Hi,guys

I have worked on this branch  https://code.launchpad.net/~knielsen/maria/dingqi-parallel-replication for some days, and found bugs listed below.May this would be helpful to you.

1, when slave switch on table filter,this bug could lead server crash.

how to reappear:
on slave
  set replicate-wild-ignore-table = test.t5 in config file
on master do these operations
  CREATE TABLE test.t3 (a INT AUTO_INCREMENT PRIMARY KEY, b DECIMAL(20,20), c INT);
  SET INSERT_ID=1;
  SET @c=2;
  SET @@rand_seed1=10000000, @@rand_seed2=1000000;
  INSERT INTO t3 VALUES (NULL, RAND(), @c);

codes lead this bug:
  In execute_single_transaction()
    case RAND_EVENT:
        need_remove_from_trans= true;
        if(!rli->is_deferred_event(ev))
          delete ev;
        break;
reason:
  Rand Event object is deleted in execute_single_transaction(), 
  but it's pointer would be used is slave_execute_deferred_events() later.

2, SQL thread could read and apply some log events repeated.

how to reappear:
  it's a little hard to reappear. if you set max_relay_log_size=100M and keep SQL thread closed to IO thread, this bug may reappear.

codes lead this bug:
  In reopen_relay_log()
    rli->event_relay_log_pos= max(rli->event_relay_log_pos, BIN_LOG_HEADER_SIZE);
    my_b_seek(cur_log,rli->event_relay_log_pos);

reason:
  when SQL thread use a hot log,but the hot log was closed by IO thread just recently, SQL thread need to reopen this log and set read offset to rli->event_relay_log_pos, while rli->event_relay_log_pos could be set new value in other thread for there are many threads apply log events.so rli->event_relay_log_pos could be less then rli->future_event_relay_log_pos.

3, SQL thread do not report error information in result of "show slave status"and replication do not stop, when the slave insert duplicate record into a table with primary key.

how to reappear:
  Just need to change master_log_pos to read duplicate records from master.

codes lead this bug:
  In execute_single_transaction()
   retry_transaction:
     ev= trans->event_list_head;
    ... ...
  if (ret && rli->trans_retries < slave_trans_retries)
  { ...
    goto retry_transaction;
   }

reason:
   as I have sayed in other email: Rows_log_event::do_apply_event() do twice but return different results for m_curr_row==m_rows_end in the second time.

4, when do oparetions such as "show slave status" and "stop slave", it could be blocked for a long time.

how to reappear:
  just do "show slave status" again and again.

codes lead this bug:

  In the queue_event()
    case FORMAT_DESCRIPTION_EVENT:
         ...
        wait_for_all_dml_done(&mi->rli, true);
    and in process_io_rotate()
        wait_for_all_dml_done(&mi->rli, true);

reason:
   IO thread could wait in wait_for_all_dml_done() while holding the rpl_mi->data_lock, so operations like "show slave status" could be blocked for waiting rpl_mi->data_lock.

5, "START SLAVE UNTIL" make replication stop in different place.

how to reappear:
  suppose log events in relay log like:
   BEGIN;      ------->pos1
   LOG_EVENT1;
   LOG_EVENT2;
   COMMIT;     ------>pos2
   BEGIN;      ------>pos3
   LOG_EVENT3;    --->stop_pos 
   LOG_EVENT4;
   COMMIT;     ------>pos4

If we do START SLAVE UNTIL relay_log_pos=stop_pos; The replication should stop at pos4 but it stop pos2.

6, log_event->thd is wrong.
   suppose log_event was read is thread_1 so the log_event->thd==thread_1, but this log_event may be dispatch to other thread (suppose thread_2).the log_event is applyed in thread_2 but the  log_event->thd==thread_1.this problem can make log event apply failed in MySQL, but in mariaDB it seems ok.

2013-07-19

nanyi607rao