developers
Threads by month
- ----- 2025 -----
- January
- ----- 2024 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2023 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2022 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2021 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2020 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2019 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2018 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2017 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2016 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2015 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2014 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2013 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2012 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2011 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2010 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2009 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
September 2012
- 13 participants
- 17 discussions
Re: [Maria-developers] MDEV-318 IF (NOT) EXIST clauses for ALTER TABLE
by Sergei Golubchik 13 Jan '13
by Sergei Golubchik 13 Jan '13
13 Jan '13
Hi, Holyfoot!
On Jul 08, holyfoot(a)askmonty.org wrote:
> At file:///home/hf/wmar/mdev-318/
>
> ------------------------------------------------------------
> revno: 3552
> revision-id: holyfoot(a)askmonty.org-20120708152354-d7iu7yoqo3cohidz
> parent: sanja(a)montyprogram.com-20120626184334-ptpg39ptq3t79dg5
> committer: Alexey Botchkov <holyfoot(a)askmonty.org>
> branch nick: mdev-318
> timestamp: Sun 2012-07-08 20:23:54 +0500
> message:
> MDEV-318 IF (NOT) EXIST clauses for ALTER TABLE (MWL #252).
> ALTER TABLE syntax modified and related implementations added to allow:
> ALTER TABLE ADD/DROP COLUMN
> ALTER TABLE ADD/DROP INDEX
> ALTER TABLE ADD/DROP FOREIGN KEY
> ALTER TABLE ADD/DROP PARTITION
> ALTER TABLE CHANGE COLUMN
> ALTER TABLE MODIFY COLUMN
Please, find my comments below:
> === modified file 'mysql-test/r/alter_table.result'
> --- mysql-test/r/alter_table.result 2011-11-21 17:13:14 +0000
> +++ mysql-test/r/alter_table.result 2012-07-11 14:24:32 +0000
> @@ -1282,3 +1282,30 @@
> # Clean-up.
> drop table t1;
> End of 5.1 tests
> +CREATE TABLE t1 (
> +id INT(11) NOT NULL,
> +x_param INT(11) DEFAULT NULL,
> +PRIMARY KEY (id),
> +KEY x_param (x_param)
> +);
> +ALTER TABLE t1 ADD COLUMN IF NOT EXISTS id INT,
> +ADD COLUMN IF NOT EXISTS lol INT AFTER id;
> +Warnings:
> +Note 1060 Duplicate column name 'id'
> +ALTER TABLE t1 ADD COLUMN IF NOT EXISTS lol INT AFTER id;
> +Warnings:
> +Note 1060 Duplicate column name 'lol'
> +ALTER TABLE t1 DROP COLUMN IF EXISTS lol;
> +ALTER TABLE t1 DROP COLUMN IF EXISTS lol;
> +Warnings:
> +Note 1091 Can't DROP 'lol'; check that column/key exists
> +ALTER TABLE t1 ADD KEY IF NOT EXISTS x_param(x_param);
> +Warnings:
> +Note 1061 Duplicate key name 'x_param'
> +ALTER TABLE t1 ADD KEY IF NOT EXISTS x_param(x_param);
typo? you test ADD KEY IF NOT EXISTS twice.
> +Warnings:
> +Note 1061 Duplicate key name 'x_param'
> +ALTER TABLE t1 MODIFY IF EXISTS lol INT;
> +Warnings:
> +Note 1054 Unknown column 'lol' in 't1'
> +DROP TABLE t1;
> === modified file 'sql/field.cc'
> --- sql/field.cc 2012-03-13 14:38:43 +0000
> +++ sql/field.cc 2012-07-11 14:28:40 +0000
> @@ -9182,6 +9182,7 @@ void Create_field::init_for_tmp_table(en
First: please, see the last section in
http://kb.askmonty.org/en/how-to-get-more-out-of-bzr-when-working-on-mariad…
and do as it suggests
I had to apply your patch locally and re-run bzr diff
to get the function names in the diff :(
> (maybe_null ? FIELDFLAG_MAYBE_NULL : 0) |
> (is_unsigned ? 0 : FIELDFLAG_DECIMAL));
> vcol_info= 0;
> + create_if_not_exists= FALSE;
> stored_in_db= TRUE;
> }
>
> === modified file 'sql/sql_lex.h'
> --- sql/sql_lex.h 2011-12-11 16:39:33 +0000
> +++ sql/sql_lex.h 2012-07-11 14:28:40 +0000
> @@ -1816,6 +1816,7 @@ typedef struct st_lex : public Query_tab
> uint16 create_view_algorithm;
> uint8 create_view_check;
> bool drop_if_exists, drop_temporary, local_file, one_shot_set;
> + bool check_exists; /* Enabled if the IF [NOT] EXISTS specified for ALTER */
I think, you can remove drop_if_exists and use check_exists instead
> bool autocommit;
> bool verbose, no_write_to_binlog;
>
> === modified file 'sql/sql_yacc.yy'
> --- sql/sql_yacc.yy 2012-06-10 09:53:06 +0000
> +++ sql/sql_yacc.yy 2012-07-11 14:28:40 +0000
> @@ -4656,8 +4659,16 @@ table_option:
> ;
>
> opt_if_not_exists:
> - /* empty */ { $$= 0; }
> - | IF not EXISTS { $$=HA_LEX_CREATE_IF_NOT_EXISTS; }
> + /* empty */
> + {
> + Lex->check_exists= FALSE;
Let's initialize check_exists either before the grammar is parsed
(like in the add_create_index_prepare and in the create rule)
of in the grammar (here, in the opt_if_not_exists).
But don't do it twice.
> + $$= 0;
> + }
> + | IF not EXISTS
> + {
> + Lex->check_exists= TRUE;
> + $$=HA_LEX_CREATE_IF_NOT_EXISTS;
> + }
> ;
>
> opt_create_table_options:
> @@ -5879,6 +5891,18 @@ opt_ident:
> | field_ident { $$=$1.str; }
> ;
>
> +opt_if_not_exists_ident:
> + opt_if_not_exists opt_ident
> + {
> + LEX *lex= Lex;
> + if (lex->check_exists && lex->sql_command != SQLCOM_ALTER_TABLE)
> + {
> + my_parse_error(ER(ER_SYNTAX_ERROR));
> + MYSQL_YYABORT;
> + }
> + $$= $2;
> + };
> +
why did you create opt_if_not_exists_ident rule, instead of adding
opt_if_not_exists before ident in the alter table grammar?
> opt_component:
> /* empty */ { $$= null_lex_str; }
> | '.' ident { $$= $2; }
> @@ -6455,6 +6482,11 @@ opt_column:
> | COLUMN_SYM {}
> ;
>
> +opt_column_if_exists:
> + opt_column if_exists
> + {}
> + ;
> +
same as above - why a special rule?
> opt_ignore:
> /* empty */ { Lex->ignore= 0;}
> | IGNORE_SYM { Lex->ignore= 1;}
> @@ -10096,8 +10128,16 @@ table_alias_ref:
> ;
>
> if_exists:
please, rename this old rule to opt_if_exists
> - /* empty */ { $$= 0; }
> - | IF EXISTS { $$= 1; }
> + /* empty */
> + {
> + Lex->check_exists= FALSE;
> + $$= 0;
> + }
> + | IF EXISTS
> + {
> + Lex->check_exists= TRUE;
> + $$= 1;
> + }
> ;
>
> opt_temporary:
you didn't change CREATE INDEX and DROP INDEX, did you?
and ALTER TABLE DROP PRIMARY KEY isn't covered either
> === modified file 'sql/sql_table.cc'
> --- sql/sql_table.cc 2012-04-05 21:07:18 +0000
> +++ sql/sql_table.cc 2012-07-11 14:28:40 +0000
> @@ -5788,6 +5788,226 @@ is_index_maintenance_unique (TABLE *tabl
>
>
> /*
> + Preparation for table creation
> +
> + SYNOPSIS
> + handle_if_exists_option()
Why did you do that in a separate function? You have basically duplicated
all checks. normal alter table code checks for duplicate columns/key names
and for non-existant columns/keys for DROP.
this code is still in place after your patch. so, basically,
you'll check for duplicates and non-existant names twice.
what's the point of that?
I'd expect you to take check_exists flag into account
in the existing checking code and convert errors into warnings
as appropriate.
> + thd Thread object.
> + table The altered table.
> + alter_info List of columns and indexes to create
> +
> + DESCRIPTION
> + Looks for the IF [NOT] EXISTS options, checks the states and remove items
> + from the list if existing found.
> +
> + RETURN VALUES
> + NONE
> +*/
> +
> +static void
> +handle_if_exists_options(THD *thd, TABLE *table, Alter_info *alter_info)
> +{
> + Field **f_ptr;
Regards,
Sergei
2
9
Re: [Maria-developers] [Commits] Rev 3435: MDEV-532: Async InnoDB commit checkpoint. in http://bazaar.launchpad.net/~maria-captains/maria/10.0
by Kristian Nielsen 14 Dec '12
by Kristian Nielsen 14 Dec '12
14 Dec '12
Hi Serg,
As we discussed under review of MDEV-232, here is a separate patch that makes
InnoDB/XtraDB commit checkpointing be more asynchroneous.
See MDEV-532 for further description of this task.
I hope you will review this, at your convenience.
- Kristian.
knielsen(a)knielsen-hq.org writes:
> At http://bazaar.launchpad.net/~maria-captains/maria/10.0
>
> ------------------------------------------------------------
> revno: 3435
> revision-id: knielsen(a)knielsen-hq.org-20120914124453-zsap6hjclq3vrb6n
> parent: knielsen(a)knielsen-hq.org-20120913123129-kaujy4cw0jc9o08k
> committer: knielsen(a)knielsen-hq.org
> branch nick: work-10.0-mdev225-181-232
> timestamp: Fri 2012-09-14 14:44:53 +0200
> message:
> MDEV-532: Async InnoDB commit checkpoint.
>
> Make the commit checkpoint inside InnoDB be asynchroneous.
> Implement a background thread in binlog to do the writing and flushing of
> binlog checkpoint events to disk.
> === modified file 'mysql-test/suite/binlog/r/binlog_checkpoint.result'
> --- a/mysql-test/suite/binlog/r/binlog_checkpoint.result 2012-09-13 12:31:29 +0000
> +++ b/mysql-test/suite/binlog/r/binlog_checkpoint.result 2012-09-14 12:44:53 +0000
> @@ -70,8 +70,14 @@ show binlog events in 'master-bin.000003
> Log_name Pos Event_type Server_id End_log_pos Info
> master-bin.000003 # Format_desc # # SERVER_VERSION, BINLOG_VERSION
> master-bin.000003 # Binlog_checkpoint # # master-bin.000001
> +SET DEBUG_SYNC= "RESET";
> +SET @old_dbug= @@global.DEBUG_DBUG;
> +SET GLOBAL debug_dbug="+d,binlog_background_checkpoint_processed";
> SET DEBUG_SYNC= "now SIGNAL con2_continue";
> con1 is still pending, no new binlog checkpoint should have been logged.
> +SET DEBUG_SYNC= "now WAIT_FOR binlog_background_checkpoint_processed";
> +SET GLOBAL debug_dbug= @old_dbug;
> +SET DEBUG_SYNC= "RESET";
> show binlog events in 'master-bin.000003' from <binlog_start>;
> Log_name Pos Event_type Server_id End_log_pos Info
> master-bin.000003 # Format_desc # # SERVER_VERSION, BINLOG_VERSION
>
> === modified file 'mysql-test/suite/binlog/r/binlog_xa_recover.result'
> --- a/mysql-test/suite/binlog/r/binlog_xa_recover.result 2012-09-13 12:31:29 +0000
> +++ b/mysql-test/suite/binlog/r/binlog_xa_recover.result 2012-09-14 12:44:53 +0000
> @@ -118,7 +118,11 @@ master-bin.00000<binlog_start> # Table_m
> master-bin.00000<binlog_start> # Write_rows # # table_id: # flags: STMT_END_F
> master-bin.00000<binlog_start> # Xid # # COMMIT /* XID */
> SET DEBUG_SYNC= "now SIGNAL con10_cont";
> +SET @old_dbug= @@global.DEBUG_DBUG;
> +SET GLOBAL debug_dbug="+d,binlog_background_checkpoint_processed";
> SET DEBUG_SYNC= "now SIGNAL con12_cont";
> +SET DEBUG_SYNC= "now WAIT_FOR binlog_background_checkpoint_processed";
> +SET GLOBAL debug_dbug= @old_dbug;
> SET DEBUG_SYNC= "now SIGNAL con11_cont";
> Checking that master-bin.000004 is the last binlog checkpoint
> show binlog events in 'master-bin.00000<binlog_start>' from <binlog_start>;
>
> === modified file 'mysql-test/suite/binlog/t/binlog_checkpoint.test'
> --- a/mysql-test/suite/binlog/t/binlog_checkpoint.test 2012-09-13 12:31:29 +0000
> +++ b/mysql-test/suite/binlog/t/binlog_checkpoint.test 2012-09-14 12:44:53 +0000
> @@ -71,6 +71,12 @@ SET DEBUG_SYNC= "now WAIT_FOR con2_ready
> --let $binlog_file= master-bin.000003
> --source include/show_binlog_events.inc
>
> +# We need to sync the test case with the background processing of the
> +# commit checkpoint, otherwise we get nondeterministic results.
> +SET DEBUG_SYNC= "RESET";
> +SET @old_dbug= @@global.DEBUG_DBUG;
> +SET GLOBAL debug_dbug="+d,binlog_background_checkpoint_processed";
> +
> SET DEBUG_SYNC= "now SIGNAL con2_continue";
>
> connection con2;
> @@ -78,6 +84,12 @@ reap;
>
> connection default;
> --echo con1 is still pending, no new binlog checkpoint should have been logged.
> +# Make sure commit checkpoint is processed before we check that no checkpoint
> +# event has been binlogged.
> +SET DEBUG_SYNC= "now WAIT_FOR binlog_background_checkpoint_processed";
> +SET GLOBAL debug_dbug= @old_dbug;
> +SET DEBUG_SYNC= "RESET";
> +
> --let $binlog_file= master-bin.000003
> --source include/show_binlog_events.inc
>
>
> === modified file 'mysql-test/suite/binlog/t/binlog_xa_recover.test'
> --- a/mysql-test/suite/binlog/t/binlog_xa_recover.test 2012-09-13 12:31:29 +0000
> +++ b/mysql-test/suite/binlog/t/binlog_xa_recover.test 2012-09-14 12:44:53 +0000
> @@ -14,8 +14,24 @@ CREATE TABLE t1 (a INT PRIMARY KEY, b ME
> # Insert some data to force a couple binlog rotations (3), so we get some
> # normal binlog checkpoints before starting the test.
> INSERT INTO t1 VALUES (100, REPEAT("x", 4100));
> +# Wait for the master-bin.000002 binlog checkpoint to appear.
> +--let $wait_for_all= 0
> +--let $show_statement= SHOW BINLOG EVENTS IN "master-bin.000002"
> +--let $field= Info
> +--let $condition= = "master-bin.000002"
> +--source include/wait_show_condition.inc
> INSERT INTO t1 VALUES (101, REPEAT("x", 4100));
> +--let $wait_for_all= 0
> +--let $show_statement= SHOW BINLOG EVENTS IN "master-bin.000003"
> +--let $field= Info
> +--let $condition= = "master-bin.000003"
> +--source include/wait_show_condition.inc
> INSERT INTO t1 VALUES (102, REPEAT("x", 4100));
> +--let $wait_for_all= 0
> +--let $show_statement= SHOW BINLOG EVENTS IN "master-bin.000004"
> +--let $field= Info
> +--let $condition= = "master-bin.000004"
> +--source include/wait_show_condition.inc
>
> # Now start a bunch of transactions that span multiple binlog
> # files. Leave then in the state prepared-but-not-committed in the engine
> @@ -153,10 +169,19 @@ SET DEBUG_SYNC= "now SIGNAL con10_cont";
> connection con10;
> reap;
> connection default;
> +
> +# We need to sync the test case with the background processing of the
> +# commit checkpoint, otherwise we get nondeterministic results.
> +SET @old_dbug= @@global.DEBUG_DBUG;
> +SET GLOBAL debug_dbug="+d,binlog_background_checkpoint_processed";
> +
> SET DEBUG_SYNC= "now SIGNAL con12_cont";
> connection con12;
> reap;
> connection default;
> +SET DEBUG_SYNC= "now WAIT_FOR binlog_background_checkpoint_processed";
> +SET GLOBAL debug_dbug= @old_dbug;
> +
> SET DEBUG_SYNC= "now SIGNAL con11_cont";
> connection con11;
> reap;
> @@ -210,7 +235,20 @@ RESET MASTER;
> # crash recovery fails due to the error insert used for previous test.
> INSERT INTO t1 VALUES (21, REPEAT("x", 4100));
> INSERT INTO t1 VALUES (22, REPEAT("x", 4100));
> +# Wait for the master-bin.000003 binlog checkpoint to appear.
> +--let $wait_for_all= 0
> +--let $show_statement= SHOW BINLOG EVENTS IN "master-bin.000003"
> +--let $field= Info
> +--let $condition= = "master-bin.000003"
> +--source include/wait_show_condition.inc
> INSERT INTO t1 VALUES (23, REPEAT("x", 4100));
> +# Wait for the last (master-bin.000004) binlog checkpoint to appear.
> +--let $wait_for_all= 0
> +--let $show_statement= SHOW BINLOG EVENTS IN "master-bin.000004"
> +--let $field= Info
> +--let $condition= = "master-bin.000004"
> +--source include/wait_show_condition.inc
> +
> --write_file $MYSQLTEST_VARDIR/tmp/mysqld.1.expect
> wait-binlog_xa_recover.test
> EOF
>
> === modified file 'mysql-test/suite/perfschema/r/all_instances.result'
> --- a/mysql-test/suite/perfschema/r/all_instances.result 2012-06-22 09:46:28 +0000
> +++ b/mysql-test/suite/perfschema/r/all_instances.result 2012-09-14 12:44:53 +0000
> @@ -76,6 +76,7 @@ wait/synch/mutex/sql/Master_info::run_lo
> wait/synch/mutex/sql/Master_info::sleep_lock
> wait/synch/mutex/sql/MDL_map::mutex
> wait/synch/mutex/sql/MDL_wait::LOCK_wait_status
> +wait/synch/mutex/sql/MYSQL_BIN_LOG::LOCK_binlog_thread
> wait/synch/mutex/sql/MYSQL_BIN_LOG::LOCK_index
> wait/synch/mutex/sql/MYSQL_BIN_LOG::LOCK_xid_list
> wait/synch/mutex/sql/MYSQL_RELAY_LOG::LOCK_index
> @@ -129,6 +130,8 @@ wait/synch/cond/sql/Master_info::sleep_c
> wait/synch/cond/sql/Master_info::start_cond
> wait/synch/cond/sql/Master_info::stop_cond
> wait/synch/cond/sql/MDL_context::COND_wait_status
> +wait/synch/cond/sql/MYSQL_BIN_LOG::COND_binlog_thread
> +wait/synch/cond/sql/MYSQL_BIN_LOG::COND_binlog_thread_end
> wait/synch/cond/sql/MYSQL_BIN_LOG::COND_queue_busy
> wait/synch/cond/sql/MYSQL_BIN_LOG::COND_xid_list
> wait/synch/cond/sql/MYSQL_BIN_LOG::update_cond
>
> === modified file 'mysql-test/suite/perfschema/r/relaylog.result'
> --- a/mysql-test/suite/perfschema/r/relaylog.result 2012-06-22 09:46:28 +0000
> +++ b/mysql-test/suite/perfschema/r/relaylog.result 2012-09-14 12:44:53 +0000
> @@ -56,8 +56,11 @@ where event_name like "%MYSQL_BIN_LOG%"
> and event_name not like "%MYSQL_BIN_LOG::update_cond"
> order by event_name;
> EVENT_NAME COUNT_STAR
> +wait/synch/cond/sql/MYSQL_BIN_LOG::COND_binlog_thread NONE
> +wait/synch/cond/sql/MYSQL_BIN_LOG::COND_binlog_thread_end NONE
> wait/synch/cond/sql/MYSQL_BIN_LOG::COND_queue_busy NONE
> wait/synch/cond/sql/MYSQL_BIN_LOG::COND_xid_list NONE
> +wait/synch/mutex/sql/MYSQL_BIN_LOG::LOCK_binlog_thread MANY
> wait/synch/mutex/sql/MYSQL_BIN_LOG::LOCK_index MANY
> wait/synch/mutex/sql/MYSQL_BIN_LOG::LOCK_xid_list MANY
> "Expect no slave relay log"
> @@ -131,8 +134,11 @@ where event_name like "%MYSQL_BIN_LOG%"
> and event_name not like "%MYSQL_BIN_LOG::update_cond"
> order by event_name;
> EVENT_NAME COUNT_STAR
> +wait/synch/cond/sql/MYSQL_BIN_LOG::COND_binlog_thread MANY
> +wait/synch/cond/sql/MYSQL_BIN_LOG::COND_binlog_thread_end NONE
> wait/synch/cond/sql/MYSQL_BIN_LOG::COND_queue_busy NONE
> -wait/synch/cond/sql/MYSQL_BIN_LOG::COND_xid_list NONE
> +wait/synch/cond/sql/MYSQL_BIN_LOG::COND_xid_list MANY
> +wait/synch/mutex/sql/MYSQL_BIN_LOG::LOCK_binlog_thread MANY
> wait/synch/mutex/sql/MYSQL_BIN_LOG::LOCK_index MANY
> wait/synch/mutex/sql/MYSQL_BIN_LOG::LOCK_xid_list MANY
> "Expect a slave relay log"
>
> === modified file 'sql/debug_sync.cc'
> --- a/sql/debug_sync.cc 2012-03-28 17:26:00 +0000
> +++ b/sql/debug_sync.cc 2012-09-14 12:44:53 +0000
> @@ -984,6 +984,7 @@ static bool debug_sync_eval_action(THD *
> DBUG_ENTER("debug_sync_eval_action");
> DBUG_ASSERT(thd);
> DBUG_ASSERT(action_str);
> + DBUG_PRINT("debug_sync", ("action_str='%s'", action_str));
>
> /*
> Get debug sync point name. Or a special command.
>
> === modified file 'sql/log.cc'
> --- a/sql/log.cc 2012-09-13 12:31:29 +0000
> +++ b/sql/log.cc 2012-09-14 12:44:53 +0000
> @@ -53,6 +53,7 @@
> #include "rpl_handler.h"
> #include "debug_sync.h"
> #include "sql_show.h"
> +#include "my_pthread.h"
>
> /* max size of the log message */
> #define MAX_LOG_BUFFER_SIZE 1024
> @@ -106,6 +107,14 @@ static SHOW_VAR binlog_status_vars_detai
> {NullS, NullS, SHOW_LONG}
> };
>
> +/* Variables for the binlog background thread. */
> +static bool binlog_thread_started= false;
> +static bool binlog_background_thread_stop= false;
> +static MYSQL_BIN_LOG::xid_count_per_binlog *
> + binlog_background_thread_queue= NULL;
> +
> +static bool start_binlog_background_thread();
> +
>
> /**
> purge logs, master and slave sides both, related error code
> @@ -2957,12 +2966,27 @@ void MYSQL_BIN_LOG::cleanup()
> my_free(b);
> }
>
> + /* Wait for the binlog thread to stop. */
> + if (!is_relay_log && binlog_thread_started)
> + {
> + mysql_mutex_lock(&LOCK_binlog_thread);
> + binlog_background_thread_stop= true;
> + mysql_cond_signal(&COND_binlog_thread);
> + while (binlog_background_thread_stop)
> + mysql_cond_wait(&COND_binlog_thread_end, &LOCK_binlog_thread);
> + mysql_mutex_unlock(&LOCK_binlog_thread);
> + binlog_thread_started= false;
> + }
> +
> mysql_mutex_destroy(&LOCK_log);
> mysql_mutex_destroy(&LOCK_index);
> mysql_mutex_destroy(&LOCK_xid_list);
> + mysql_mutex_destroy(&LOCK_binlog_thread);
> mysql_cond_destroy(&update_cond);
> mysql_cond_destroy(&COND_queue_busy);
> mysql_cond_destroy(&COND_xid_list);
> + mysql_cond_destroy(&COND_binlog_thread);
> + mysql_cond_destroy(&COND_binlog_thread_end);
> }
> DBUG_VOID_RETURN;
> }
> @@ -2988,6 +3012,11 @@ void MYSQL_BIN_LOG::init_pthread_objects
> mysql_cond_init(m_key_update_cond, &update_cond, 0);
> mysql_cond_init(m_key_COND_queue_busy, &COND_queue_busy, 0);
> mysql_cond_init(key_BINLOG_COND_xid_list, &COND_xid_list, 0);
> +
> + mysql_mutex_init(key_BINLOG_LOCK_binlog_thread,
> + &LOCK_binlog_thread, MY_MUTEX_INIT_FAST);
> + mysql_cond_init(key_BINLOG_COND_binlog_thread, &COND_binlog_thread, 0);
> + mysql_cond_init(key_BINLOG_COND_binlog_thread_end, &COND_binlog_thread_end, 0);
> }
>
>
> @@ -3085,6 +3114,10 @@ bool MYSQL_BIN_LOG::open(const char *log
> DBUG_ENTER("MYSQL_BIN_LOG::open");
> DBUG_PRINT("enter",("log_type: %d",(int) log_type_arg));
>
> + if (!is_relay_log && !binlog_thread_started &&
> + start_binlog_background_thread())
> + DBUG_RETURN(1);
> +
> if (init_and_set_log_file_name(log_name, new_name, log_type_arg,
> io_cache_type_arg))
> {
> @@ -5540,11 +5573,7 @@ bool general_log_write(THD *thd, enum en
> }
>
>
> -/*
> - I would like to make this function static, but this causes compiler warnings
> - when it is declared as friend function in log.h.
> -*/
> -void
> +static void
> binlog_checkpoint_callback(void *cookie)
> {
> MYSQL_BIN_LOG::xid_count_per_binlog *entry=
> @@ -8116,9 +8145,128 @@ int TC_LOG_BINLOG::unlog(ulong cookie, m
> void
> TC_LOG_BINLOG::commit_checkpoint_notify(void *cookie)
> {
> - mark_xid_done(((xid_count_per_binlog *)cookie)->binlog_id, true);
> + xid_count_per_binlog *entry= static_cast<xid_count_per_binlog *>(cookie);
> + mysql_mutex_lock(&LOCK_binlog_thread);
> + entry->next_in_queue= binlog_background_thread_queue;
> + binlog_background_thread_queue= entry;
> + mysql_cond_signal(&COND_binlog_thread);
> + mysql_mutex_unlock(&LOCK_binlog_thread);
> }
>
> +/*
> + Binlog service thread.
> +
> + This thread is used to log binlog checkpoints in the background, rather than
> + in the context of random storage engine threads that happen to call
> + commit_checkpoint_notify_ha() and may not like the delays while syncing
> + binlog to disk or may not be setup with all my_thread_init() and other
> + necessary stuff.
> +
> + In the future, this thread could also be used to do log rotation in the
> + background, which could elimiate all stalls around binlog rotations.
> +*/
> +pthread_handler_t
> +binlog_background_thread(void *arg __attribute__((unused)))
> +{
> + bool stop;
> + MYSQL_BIN_LOG::xid_count_per_binlog *queue, *next;
> + THD *thd;
> +
> + my_thread_init();
> + thd= new THD;
> + thd->system_thread= SYSTEM_THREAD_BINLOG_BACKGROUND;
> + my_pthread_setspecific_ptr(THR_THD, thd);
> + mysql_mutex_lock(&LOCK_thread_count);
> + thd->thread_id= thread_id++;
> + mysql_mutex_unlock(&LOCK_thread_count);
> +
> + for (;;)
> + {
> + /*
> + Wait until there is something in the queue to process, or we are asked
> + to shut down.
> + */
> + thd_proc_info(thd, "Waiting for background binlog tasks");
> + mysql_mutex_lock(&mysql_bin_log.LOCK_binlog_thread);
> + for (;;)
> + {
> + stop= binlog_background_thread_stop;
> + queue= binlog_background_thread_queue;
> + if (stop || queue)
> + break;
> + mysql_cond_wait(&mysql_bin_log.COND_binlog_thread,
> + &mysql_bin_log.LOCK_binlog_thread);
> + }
> + /* Grab the queue, if any. */
> + binlog_background_thread_queue= NULL;
> + mysql_mutex_unlock(&mysql_bin_log.LOCK_binlog_thread);
> +
> + /* Process any incoming commit_checkpoint_notify() calls. */
> + while (queue)
> + {
> + thd_proc_info(thd, "Processing binlog checkpoint notification");
> + /* Grab next pointer first, as mark_xid_done() may free the element. */
> + next= queue->next_in_queue;
> + mysql_bin_log.mark_xid_done(queue->binlog_id, true);
> + queue= next;
> +
> + DBUG_EXECUTE_IF("binlog_background_checkpoint_processed",
> + DBUG_ASSERT(!debug_sync_set_action(
> + thd,
> + STRING_WITH_LEN("now SIGNAL binlog_background_checkpoint_processed")));
> + );
> + }
> +
> + if (stop)
> + break;
> + }
> +
> + thd_proc_info(thd, "Stopping binlog background thread");
> +
> + mysql_mutex_lock(&LOCK_thread_count);
> + delete thd;
> + mysql_mutex_unlock(&LOCK_thread_count);
> +
> + my_thread_end();
> +
> + /* Signal that we are (almost) stopped. */
> + mysql_mutex_lock(&mysql_bin_log.LOCK_binlog_thread);
> + binlog_background_thread_stop= false;
> + mysql_cond_signal(&mysql_bin_log.COND_binlog_thread_end);
> + mysql_mutex_unlock(&mysql_bin_log.LOCK_binlog_thread);
> +
> + return 0;
> +}
> +
> +#ifdef HAVE_PSI_INTERFACE
> +static PSI_thread_key key_thread_binlog;
> +
> +static PSI_thread_info all_binlog_threads[]=
> +{
> + { &key_thread_binlog, "binlog_background", PSI_FLAG_GLOBAL},
> +};
> +#endif /* HAVE_PSI_INTERFACE */
> +
> +static bool
> +start_binlog_background_thread()
> +{
> + pthread_t th;
> +
> +#ifdef HAVE_PSI_INTERFACE
> + if (PSI_server)
> + PSI_server->register_thread("sql", all_binlog_threads,
> + array_elements(all_binlog_threads));
> +#endif
> +
> + if (mysql_thread_create(key_thread_binlog, &th, NULL,
> + binlog_background_thread, NULL))
> + return 1;
> +
> + binlog_thread_started= true;
> + return 0;
> +}
> +
> +
> int TC_LOG_BINLOG::recover(LOG_INFO *linfo, const char *last_log_name,
> IO_CACHE *first_log,
> Format_description_log_event *fdle)
>
> === modified file 'sql/log.h'
> --- a/sql/log.h 2012-09-13 12:31:29 +0000
> +++ b/sql/log.h 2012-09-14 12:44:53 +0000
> @@ -395,8 +395,6 @@ class MYSQL_QUERY_LOG: public MYSQL_LOG
> #define BINLOG_COOKIE_IS_DUMMY(c) \
> ( ((ulong)(c)>>1) == BINLOG_COOKIE_DUMMY_ID )
>
> -void binlog_checkpoint_callback(void *cookie);
> -
> class binlog_cache_mngr;
> class MYSQL_BIN_LOG: public TC_LOG, private MYSQL_LOG
> {
> @@ -451,27 +449,6 @@ class MYSQL_BIN_LOG: public TC_LOG, priv
> };
>
> /*
> - A list of struct xid_count_per_binlog is used to keep track of how many
> - XIDs are in prepared, but not committed, state in each binlog. And how
> - many commit_checkpoint_request()'s are pending.
> -
> - When count drops to zero in a binlog after rotation, it means that there
> - are no more XIDs in prepared state, so that binlog is no longer needed
> - for XA crash recovery, and we can log a new binlog checkpoint event.
> -
> - The list is protected against simultaneous access from multiple
> - threads by LOCK_xid_list.
> - */
> - struct xid_count_per_binlog : public ilink {
> - char *binlog_name;
> - uint binlog_name_len;
> - ulong binlog_id;
> - /* Total prepared XIDs and pending checkpoint requests in this binlog. */
> - long xid_count;
> - xid_count_per_binlog(); /* Give link error if constructor used. */
> - };
> - I_List<xid_count_per_binlog> binlog_xid_count_list;
> - /*
> When this is set, a RESET MASTER is in progress.
>
> Then we should not write any binlog checkpoints into the binlog (that
> @@ -480,7 +457,6 @@ class MYSQL_BIN_LOG: public TC_LOG, priv
> checkpoint arrives - when all have arrived, RESET MASTER will complete.
> */
> bool reset_master_pending;
> - friend void binlog_checkpoint_callback(void *cookie);
>
> /* LOCK_log and LOCK_index are inited by init_pthread_objects() */
> mysql_mutex_t LOCK_index;
> @@ -553,10 +529,35 @@ class MYSQL_BIN_LOG: public TC_LOG, priv
> int write_transaction_or_stmt(group_commit_entry *entry);
> bool write_transaction_to_binlog_events(group_commit_entry *entry);
> void trx_group_commit_leader(group_commit_entry *leader);
> - void mark_xid_done(ulong cookie, bool write_checkpoint);
> - void mark_xids_active(ulong cookie, uint xid_count);
>
> public:
> + /*
> + A list of struct xid_count_per_binlog is used to keep track of how many
> + XIDs are in prepared, but not committed, state in each binlog. And how
> + many commit_checkpoint_request()'s are pending.
> +
> + When count drops to zero in a binlog after rotation, it means that there
> + are no more XIDs in prepared state, so that binlog is no longer needed
> + for XA crash recovery, and we can log a new binlog checkpoint event.
> +
> + The list is protected against simultaneous access from multiple
> + threads by LOCK_xid_list.
> + */
> + struct xid_count_per_binlog : public ilink {
> + char *binlog_name;
> + uint binlog_name_len;
> + ulong binlog_id;
> + /* Total prepared XIDs and pending checkpoint requests in this binlog. */
> + long xid_count;
> + /* For linking in requests to the binlog background thread. */
> + xid_count_per_binlog *next_in_queue;
> + xid_count_per_binlog(); /* Give link error if constructor used. */
> + };
> + I_List<xid_count_per_binlog> binlog_xid_count_list;
> + mysql_mutex_t LOCK_binlog_thread;
> + mysql_cond_t COND_binlog_thread;
> + mysql_cond_t COND_binlog_thread_end;
> +
> using MYSQL_LOG::generate_name;
> using MYSQL_LOG::is_open;
>
> @@ -712,6 +713,8 @@ class MYSQL_BIN_LOG: public TC_LOG, priv
> bool appendv(const char* buf,uint len,...);
> bool append(Log_event* ev);
>
> + void mark_xids_active(ulong cookie, uint xid_count);
> + void mark_xid_done(ulong cookie, bool write_checkpoint);
> void make_log_name(char* buf, const char* log_ident);
> bool is_active(const char* log_file_name);
> bool can_purge_log(const char *log_file_name);
>
> === modified file 'sql/mysqld.cc'
> --- a/sql/mysqld.cc 2012-09-13 12:31:29 +0000
> +++ b/sql/mysqld.cc 2012-09-14 12:44:53 +0000
> @@ -724,6 +724,7 @@ PSI_mutex_key key_LOCK_des_key_file;
> #endif /* HAVE_OPENSSL */
>
> PSI_mutex_key key_BINLOG_LOCK_index, key_BINLOG_LOCK_xid_list,
> + key_BINLOG_LOCK_binlog_thread,
> key_delayed_insert_mutex, key_hash_filo_lock, key_LOCK_active_mi,
> key_LOCK_connection_count, key_LOCK_crypt, key_LOCK_delayed_create,
> key_LOCK_delayed_insert, key_LOCK_delayed_status, key_LOCK_error_log,
> @@ -766,6 +767,7 @@ static PSI_mutex_info all_server_mutexes
>
> { &key_BINLOG_LOCK_index, "MYSQL_BIN_LOG::LOCK_index", 0},
> { &key_BINLOG_LOCK_xid_list, "MYSQL_BIN_LOG::LOCK_xid_list", 0},
> + { &key_BINLOG_LOCK_binlog_thread, "MYSQL_BIN_LOG::LOCK_binlog_thread", 0},
> { &key_RELAYLOG_LOCK_index, "MYSQL_RELAY_LOG::LOCK_index", 0},
> { &key_delayed_insert_mutex, "Delayed_insert::mutex", 0},
> { &key_hash_filo_lock, "hash_filo::lock", 0},
> @@ -834,6 +836,7 @@ PSI_cond_key key_PAGE_cond, key_COND_act
> #endif /* HAVE_MMAP */
>
> PSI_cond_key key_BINLOG_COND_xid_list, key_BINLOG_update_cond,
> + key_BINLOG_COND_binlog_thread, key_BINLOG_COND_binlog_thread_end,
> key_COND_cache_status_changed, key_COND_manager,
> key_COND_rpl_status, key_COND_server_started,
> key_delayed_insert_cond, key_delayed_insert_cond_client,
> @@ -863,6 +866,8 @@ static PSI_cond_info all_server_conds[]=
> #endif /* HAVE_MMAP */
> { &key_BINLOG_COND_xid_list, "MYSQL_BIN_LOG::COND_xid_list", 0},
> { &key_BINLOG_update_cond, "MYSQL_BIN_LOG::update_cond", 0},
> + { &key_BINLOG_COND_binlog_thread, "MYSQL_BIN_LOG::COND_binlog_thread", 0},
> + { &key_BINLOG_COND_binlog_thread_end, "MYSQL_BIN_LOG::COND_binlog_thread_end", 0},
> { &key_BINLOG_COND_queue_busy, "MYSQL_BIN_LOG::COND_queue_busy", 0},
> { &key_RELAYLOG_update_cond, "MYSQL_RELAY_LOG::update_cond", 0},
> { &key_RELAYLOG_COND_queue_busy, "MYSQL_RELAY_LOG::COND_queue_busy", 0},
>
> === modified file 'sql/mysqld.h'
> --- a/sql/mysqld.h 2012-09-13 12:31:29 +0000
> +++ b/sql/mysqld.h 2012-09-14 12:44:53 +0000
> @@ -226,6 +226,7 @@ extern PSI_mutex_key key_LOCK_des_key_fi
> #endif
>
> extern PSI_mutex_key key_BINLOG_LOCK_index, key_BINLOG_LOCK_xid_list,
> + key_BINLOG_LOCK_binlog_thread,
> key_delayed_insert_mutex, key_hash_filo_lock, key_LOCK_active_mi,
> key_LOCK_connection_count, key_LOCK_crypt, key_LOCK_delayed_create,
> key_LOCK_delayed_insert, key_LOCK_delayed_status, key_LOCK_error_log,
> @@ -257,6 +258,7 @@ extern PSI_cond_key key_PAGE_cond, key_C
> #endif /* HAVE_MMAP */
>
> extern PSI_cond_key key_BINLOG_COND_xid_list, key_BINLOG_update_cond,
> + key_BINLOG_COND_binlog_thread, key_BINLOG_COND_binlog_thread_end,
> key_COND_cache_status_changed, key_COND_manager,
> key_COND_rpl_status, key_COND_server_started,
> key_delayed_insert_cond, key_delayed_insert_cond_client,
>
> === modified file 'sql/rpl_rli.cc'
> --- a/sql/rpl_rli.cc 2012-09-13 12:31:29 +0000
> +++ b/sql/rpl_rli.cc 2012-09-14 12:44:53 +0000
> @@ -58,6 +58,7 @@ Relay_log_info::Relay_log_info(bool is_s
> {
> DBUG_ENTER("Relay_log_info::Relay_log_info");
>
> + relay_log.is_relay_log= TRUE;
> #ifdef HAVE_PSI_INTERFACE
> relay_log.set_psi_keys(key_RELAYLOG_LOCK_index,
> key_RELAYLOG_update_cond,
> @@ -206,8 +207,6 @@ a file name for --relay-log-index option
> name_warning_sent= 1;
> }
>
> - rli->relay_log.is_relay_log= TRUE;
> -
> /*
> note, that if open() fails, we'll still have index file open
> but a destructor will take care of that
>
> === modified file 'sql/sql_class.h'
> --- a/sql/sql_class.h 2012-09-13 12:31:29 +0000
> +++ b/sql/sql_class.h 2012-09-14 12:44:53 +0000
> @@ -1244,7 +1244,8 @@ enum enum_thread_type
> SYSTEM_THREAD_SLAVE_SQL= 4,
> SYSTEM_THREAD_NDBCLUSTER_BINLOG= 8,
> SYSTEM_THREAD_EVENT_SCHEDULER= 16,
> - SYSTEM_THREAD_EVENT_WORKER= 32
> + SYSTEM_THREAD_EVENT_WORKER= 32,
> + SYSTEM_THREAD_BINLOG_BACKGROUND= 64
> };
>
> inline char const *
>
> === modified file 'storage/innobase/handler/ha_innodb.cc'
> --- a/storage/innobase/handler/ha_innodb.cc 2012-09-13 12:31:29 +0000
> +++ b/storage/innobase/handler/ha_innodb.cc 2012-09-14 12:44:53 +0000
> @@ -106,6 +106,7 @@ static ulong commit_threads = 0;
> static mysql_mutex_t commit_threads_m;
> static mysql_cond_t commit_cond;
> static mysql_mutex_t commit_cond_m;
> +static mysql_mutex_t pending_checkpoint_mutex;
> static bool innodb_inited = 0;
>
> #define INSIDE_HA_INNOBASE_CC
> @@ -222,11 +223,13 @@ static mysql_pfs_key_t innobase_share_mu
> static mysql_pfs_key_t commit_threads_m_key;
> static mysql_pfs_key_t commit_cond_mutex_key;
> static mysql_pfs_key_t commit_cond_key;
> +static mysql_pfs_key_t pending_checkpoint_mutex_key;
>
> static PSI_mutex_info all_pthread_mutexes[] = {
> {&commit_threads_m_key, "commit_threads_m", 0},
> {&commit_cond_mutex_key, "commit_cond_mutex", 0},
> - {&innobase_share_mutex_key, "innobase_share_mutex", 0}
> + {&innobase_share_mutex_key, "innobase_share_mutex", 0},
> + {&pending_checkpoint_mutex_key, "pending_checkpoint_mutex", 0}
> };
>
> static PSI_cond_info all_innodb_conds[] = {
> @@ -2592,6 +2595,9 @@ innobase_init(
> mysql_mutex_init(commit_cond_mutex_key,
> &commit_cond_m, MY_MUTEX_INIT_FAST);
> mysql_cond_init(commit_cond_key, &commit_cond, NULL);
> + mysql_mutex_init(pending_checkpoint_mutex_key,
> + &pending_checkpoint_mutex,
> + MY_MUTEX_INIT_FAST);
> innodb_inited= 1;
> #ifdef MYSQL_DYNAMIC_PLUGIN
> if (innobase_hton != p) {
> @@ -2639,6 +2645,7 @@ innobase_end(
> mysql_mutex_destroy(&commit_threads_m);
> mysql_mutex_destroy(&commit_cond_m);
> mysql_cond_destroy(&commit_cond);
> + mysql_mutex_destroy(&pending_checkpoint_mutex);
> }
>
> DBUG_RETURN(err);
> @@ -3008,6 +3015,16 @@ innobase_rollback_trx(
> DBUG_RETURN(convert_error_code_to_mysql(error, 0, NULL));
> }
>
> +
> +struct pending_checkpoint {
> + struct pending_checkpoint *next;
> + handlerton *hton;
> + void *cookie;
> + ib_uint64_t lsn;
> +};
> +static struct pending_checkpoint *pending_checkpoint_list;
> +static struct pending_checkpoint *pending_checkpoint_list_end;
> +
> /*****************************************************************//**
> Handle a commit checkpoint request from server layer.
> We simply flush the redo log immediately and do the notify call.*/
> @@ -3017,8 +3034,113 @@ innobase_checkpoint_request(
> handlerton *hton,
> void *cookie)
> {
> - log_buffer_flush_to_disk();
> - commit_checkpoint_notify_ha(hton, cookie);
> + ib_uint64_t lsn;
> + ib_uint64_t flush_lsn;
> + struct pending_checkpoint * entry;
> +
> + /* Do the allocation outside of lock to reduce contention. The normal
> + case is that not everything is flushed, so we will need to enqueue. */
> + entry = static_cast<struct pending_checkpoint *>
> + (my_malloc(sizeof(*entry), MYF(MY_WME)));
> + if (!entry) {
> + sql_print_error("Failed to allocate %u bytes."
> + " Commit checkpoint will be skipped.",
> + static_cast<unsigned>(sizeof(*entry)));
> + return;
> + }
> +
> + entry->next = NULL;
> + entry->hton = hton;
> + entry->cookie = cookie;
> +
> + mysql_mutex_lock(&pending_checkpoint_mutex);
> + lsn = log_get_lsn();
> + flush_lsn = log_get_flush_lsn();
> + if (lsn > flush_lsn) {
> + /* Put the request in queue.
> + When the log gets flushed past the lsn, we will remove the
> + entry from the queue and notify the upper layer. */
> + entry->lsn = lsn;
> + if (pending_checkpoint_list_end) {
> + pending_checkpoint_list_end->next = entry;
> + } else {
> + pending_checkpoint_list = entry;
> + }
> + pending_checkpoint_list_end = entry;
> + entry = NULL;
> + }
> + mysql_mutex_unlock(&pending_checkpoint_mutex);
> +
> + if (entry) {
> + /* We are already flushed. Notify the checkpoint immediately. */
> + commit_checkpoint_notify_ha(entry->hton, entry->cookie);
> + my_free(entry);
> + }
> +}
> +
> +/*****************************************************************//**
> +Log code calls this whenever log has been written and/or flushed up
> +to a new position. We use this to notify upper layer of a new commit
> +checkpoint when necessary.*/
> +extern "C" UNIV_INTERN
> +void
> +innobase_mysql_log_notify(
> +/*===============*/
> + ib_uint64_t write_lsn, /*!< in: LSN written to log file */
> + ib_uint64_t flush_lsn) /*!< in: LSN flushed to disk */
> +{
> + struct pending_checkpoint * pending;
> + struct pending_checkpoint * entry;
> + struct pending_checkpoint * last_ready;
> +
> + /* It is safe to do a quick check for NULL first without lock.
> + Even if we should race, we will at most skip one checkpoint and
> + take the next one, which is harmless. */
> + if (!pending_checkpoint_list)
> + return;
> +
> + mysql_mutex_lock(&pending_checkpoint_mutex);
> + pending = pending_checkpoint_list;
> + if (!pending)
> + {
> + mysql_mutex_unlock(&pending_checkpoint_mutex);
> + return;
> + }
> +
> + last_ready = NULL;
> + for (entry = pending; entry != NULL; entry = entry -> next)
> + {
> + if (entry->lsn > flush_lsn)
> + break;
> + last_ready = entry;
> + }
> +
> + if (last_ready)
> + {
> + /* We found some pending checkpoints that are now flushed to
> + disk. So remove them from the list. */
> + pending_checkpoint_list = entry;
> + if (!entry)
> + pending_checkpoint_list_end = NULL;
> + }
> +
> + mysql_mutex_unlock(&pending_checkpoint_mutex);
> +
> + if (!last_ready)
> + return;
> +
> + /* Now that we have released the lock, notify upper layer about all
> + commit checkpoints that have now completed. */
> + for (;;) {
> + entry = pending;
> + pending = pending->next;
> +
> + commit_checkpoint_notify_ha(entry->hton, entry->cookie);
> +
> + my_free(entry);
> + if (entry == last_ready)
> + break;
> + }
> }
>
> /*****************************************************************//**
>
> === modified file 'storage/innobase/include/ha_prototypes.h'
> --- a/storage/innobase/include/ha_prototypes.h 2011-04-26 17:55:52 +0000
> +++ b/storage/innobase/include/ha_prototypes.h 2012-09-14 12:44:53 +0000
> @@ -136,6 +136,17 @@ innobase_mysql_print_thd(
> uint max_query_len); /*!< in: max query length to print, or 0 to
> use the default max length */
>
> +/*****************************************************************//**
> +Log code calls this whenever log has been written and/or flushed up
> +to a new position. We use this to notify upper layer of a new commit
> +checkpoint when necessary.*/
> +UNIV_INTERN
> +void
> +innobase_mysql_log_notify(
> +/*===============*/
> + ib_uint64_t write_lsn, /*!< in: LSN written to log file */
> + ib_uint64_t flush_lsn); /*!< in: LSN flushed to disk */
> +
> /**************************************************************//**
> Converts a MySQL type to an InnoDB type. Note that this function returns
> the 'mtype' of InnoDB. InnoDB differentiates between MySQL's old <= 4.1
>
> === modified file 'storage/innobase/include/log0log.h'
> --- a/storage/innobase/include/log0log.h 2012-06-07 13:44:26 +0000
> +++ b/storage/innobase/include/log0log.h 2012-09-14 12:44:53 +0000
> @@ -151,6 +151,13 @@ UNIV_INLINE
> ib_uint64_t
> log_get_lsn(void);
> /*=============*/
> +/************************************************************//**
> +Gets the last lsn that is fully flushed to disk.
> +@return last flushed lsn */
> +UNIV_INLINE
> +ib_uint64_t
> +log_get_flush_lsn(void);
> +/*=============*/
> /****************************************************************
> Gets the log group capacity. It is OK to read the value without
> holding log_sys->mutex because it is constant.
>
> === modified file 'storage/innobase/include/log0log.ic'
> --- a/storage/innobase/include/log0log.ic 2011-04-05 07:18:43 +0000
> +++ b/storage/innobase/include/log0log.ic 2012-09-14 12:44:53 +0000
> @@ -411,6 +411,25 @@ log_get_lsn(void)
> return(lsn);
> }
>
> +/************************************************************//**
> +Gets the last lsn that is fully flushed to disk.
> +@return last flushed lsn */
> +UNIV_INLINE
> +ib_uint64_t
> +log_get_flush_lsn(void)
> +/*=============*/
> +{
> + ib_uint64_t lsn;
> +
> + mutex_enter(&(log_sys->mutex));
> +
> + lsn = log_sys->flushed_to_disk_lsn;
> +
> + mutex_exit(&(log_sys->mutex));
> +
> + return(lsn);
> +}
> +
> /****************************************************************
> Gets the log group capacity. It is OK to read the value without
> holding log_sys->mutex because it is constant.
>
> === modified file 'storage/innobase/log/log0log.c'
> --- a/storage/innobase/log/log0log.c 2012-03-21 03:48:12 +0000
> +++ b/storage/innobase/log/log0log.c 2012-09-14 12:44:53 +0000
> @@ -1353,6 +1353,8 @@ log_write_up_to(
> ulint loop_count = 0;
> #endif /* UNIV_DEBUG */
> ulint unlock;
> + ib_uint64_t write_lsn;
> + ib_uint64_t flush_lsn;
>
> if (recv_no_ibuf_operations) {
> /* Recovery is running and no operations on the log files are
> @@ -1530,8 +1532,13 @@ log_write_up_to(
>
> log_flush_do_unlocks(unlock);
>
> + write_lsn = log_sys->write_lsn;
> + flush_lsn = log_sys->flushed_to_disk_lsn;
> +
> mutex_exit(&(log_sys->mutex));
>
> + innobase_mysql_log_notify(write_lsn, flush_lsn);
> +
> return;
>
> do_waits:
>
> === modified file 'storage/xtradb/handler/ha_innodb.cc'
> --- a/storage/xtradb/handler/ha_innodb.cc 2012-09-13 12:31:29 +0000
> +++ b/storage/xtradb/handler/ha_innodb.cc 2012-09-14 12:44:53 +0000
> @@ -120,6 +120,7 @@ static ulong commit_threads = 0;
> static mysql_mutex_t commit_threads_m;
> static mysql_cond_t commit_cond;
> static mysql_mutex_t commit_cond_m;
> +static mysql_mutex_t pending_checkpoint_mutex;
> static bool innodb_inited = 0;
>
>
> @@ -253,11 +254,13 @@ static mysql_pfs_key_t innobase_share_mu
> static mysql_pfs_key_t commit_threads_m_key;
> static mysql_pfs_key_t commit_cond_mutex_key;
> static mysql_pfs_key_t commit_cond_key;
> +static mysql_pfs_key_t pending_checkpoint_mutex_key;
>
> static PSI_mutex_info all_pthread_mutexes[] = {
> {&commit_threads_m_key, "commit_threads_m", 0},
> {&commit_cond_mutex_key, "commit_cond_mutex", 0},
> - {&innobase_share_mutex_key, "innobase_share_mutex", 0}
> + {&innobase_share_mutex_key, "innobase_share_mutex", 0},
> + {&pending_checkpoint_mutex_key, "pending_checkpoint_mutex", 0}
> };
>
> static PSI_cond_info all_innodb_conds[] = {
> @@ -3060,6 +3063,9 @@ innobase_init(
> mysql_mutex_init(commit_cond_mutex_key,
> &commit_cond_m, MY_MUTEX_INIT_FAST);
> mysql_cond_init(commit_cond_key, &commit_cond, NULL);
> + mysql_mutex_init(pending_checkpoint_mutex_key,
> + &pending_checkpoint_mutex,
> + MY_MUTEX_INIT_FAST);
> innodb_inited= 1;
> #ifdef MYSQL_DYNAMIC_PLUGIN
> if (innobase_hton != p) {
> @@ -3107,6 +3113,7 @@ innobase_end(
> mysql_mutex_destroy(&commit_threads_m);
> mysql_mutex_destroy(&commit_cond_m);
> mysql_cond_destroy(&commit_cond);
> + mysql_mutex_destroy(&pending_checkpoint_mutex);
> }
>
> DBUG_RETURN(err);
> @@ -3500,6 +3507,16 @@ innobase_rollback_trx(
> DBUG_RETURN(convert_error_code_to_mysql(error, 0, NULL));
> }
>
> +
> +struct pending_checkpoint {
> + struct pending_checkpoint *next;
> + handlerton *hton;
> + void *cookie;
> + ib_uint64_t lsn;
> +};
> +static struct pending_checkpoint *pending_checkpoint_list;
> +static struct pending_checkpoint *pending_checkpoint_list_end;
> +
> /*****************************************************************//**
> Handle a commit checkpoint request from server layer.
> We simply flush the redo log immediately and do the notify call.*/
> @@ -3509,8 +3526,113 @@ innobase_checkpoint_request(
> handlerton *hton,
> void *cookie)
> {
> - log_buffer_flush_to_disk();
> - commit_checkpoint_notify_ha(hton, cookie);
> + ib_uint64_t lsn;
> + ib_uint64_t flush_lsn;
> + struct pending_checkpoint * entry;
> +
> + /* Do the allocation outside of lock to reduce contention. The normal
> + case is that not everything is flushed, so we will need to enqueue. */
> + entry = static_cast<struct pending_checkpoint *>
> + (my_malloc(sizeof(*entry), MYF(MY_WME)));
> + if (!entry) {
> + sql_print_error("Failed to allocate %u bytes."
> + " Commit checkpoint will be skipped.",
> + static_cast<unsigned>(sizeof(*entry)));
> + return;
> + }
> +
> + entry->next = NULL;
> + entry->hton = hton;
> + entry->cookie = cookie;
> +
> + mysql_mutex_lock(&pending_checkpoint_mutex);
> + lsn = log_get_lsn();
> + flush_lsn = log_get_flush_lsn();
> + if (lsn > flush_lsn) {
> + /* Put the request in queue.
> + When the log gets flushed past the lsn, we will remove the
> + entry from the queue and notify the upper layer. */
> + entry->lsn = lsn;
> + if (pending_checkpoint_list_end) {
> + pending_checkpoint_list_end->next = entry;
> + } else {
> + pending_checkpoint_list = entry;
> + }
> + pending_checkpoint_list_end = entry;
> + entry = NULL;
> + }
> + mysql_mutex_unlock(&pending_checkpoint_mutex);
> +
> + if (entry) {
> + /* We are already flushed. Notify the checkpoint immediately. */
> + commit_checkpoint_notify_ha(entry->hton, entry->cookie);
> + my_free(entry);
> + }
> +}
> +
> +/*****************************************************************//**
> +Log code calls this whenever log has been written and/or flushed up
> +to a new position. We use this to notify upper layer of a new commit
> +checkpoint when necessary.*/
> +extern "C" UNIV_INTERN
> +void
> +innobase_mysql_log_notify(
> +/*===============*/
> + ib_uint64_t write_lsn, /*!< in: LSN written to log file */
> + ib_uint64_t flush_lsn) /*!< in: LSN flushed to disk */
> +{
> + struct pending_checkpoint * pending;
> + struct pending_checkpoint * entry;
> + struct pending_checkpoint * last_ready;
> +
> + /* It is safe to do a quick check for NULL first without lock.
> + Even if we should race, we will at most skip one checkpoint and
> + take the next one, which is harmless. */
> + if (!pending_checkpoint_list)
> + return;
> +
> + mysql_mutex_lock(&pending_checkpoint_mutex);
> + pending = pending_checkpoint_list;
> + if (!pending)
> + {
> + mysql_mutex_unlock(&pending_checkpoint_mutex);
> + return;
> + }
> +
> + last_ready = NULL;
> + for (entry = pending; entry != NULL; entry = entry -> next)
> + {
> + if (entry->lsn > flush_lsn)
> + break;
> + last_ready = entry;
> + }
> +
> + if (last_ready)
> + {
> + /* We found some pending checkpoints that are now flushed to
> + disk. So remove them from the list. */
> + pending_checkpoint_list = entry;
> + if (!entry)
> + pending_checkpoint_list_end = NULL;
> + }
> +
> + mysql_mutex_unlock(&pending_checkpoint_mutex);
> +
> + if (!last_ready)
> + return;
> +
> + /* Now that we have released the lock, notify upper layer about all
> + commit checkpoints that have now completed. */
> + for (;;) {
> + entry = pending;
> + pending = pending->next;
> +
> + commit_checkpoint_notify_ha(entry->hton, entry->cookie);
> +
> + my_free(entry);
> + if (entry == last_ready)
> + break;
> + }
> }
>
> /*****************************************************************//**
>
> === modified file 'storage/xtradb/include/ha_prototypes.h'
> --- a/storage/xtradb/include/ha_prototypes.h 2012-02-21 19:51:56 +0000
> +++ b/storage/xtradb/include/ha_prototypes.h 2012-09-14 12:44:53 +0000
> @@ -136,6 +136,17 @@ innobase_mysql_print_thd(
> uint max_query_len); /*!< in: max query length to print, or 0 to
> use the default max length */
>
> +/*****************************************************************//**
> +Log code calls this whenever log has been written and/or flushed up
> +to a new position. We use this to notify upper layer of a new commit
> +checkpoint when necessary.*/
> +UNIV_INTERN
> +void
> +innobase_mysql_log_notify(
> +/*===============*/
> + ib_uint64_t write_lsn, /*!< in: LSN written to log file */
> + ib_uint64_t flush_lsn); /*!< in: LSN flushed to disk */
> +
> /**************************************************************//**
> Converts a MySQL type to an InnoDB type. Note that this function returns
> the 'mtype' of InnoDB. InnoDB differentiates between MySQL's old <= 4.1
>
> === modified file 'storage/xtradb/include/log0log.h'
> --- a/storage/xtradb/include/log0log.h 2012-08-27 16:13:17 +0000
> +++ b/storage/xtradb/include/log0log.h 2012-09-14 12:44:53 +0000
> @@ -151,6 +151,13 @@ UNIV_INLINE
> ib_uint64_t
> log_get_lsn(void);
> /*=============*/
> +/************************************************************//**
> +Gets the last lsn that is fully flushed to disk.
> +@return last flushed lsn */
> +UNIV_INLINE
> +ib_uint64_t
> +log_get_flush_lsn(void);
> +/*=============*/
> /****************************************************************
> Gets the log group capacity. It is OK to read the value without
> holding log_sys->mutex because it is constant.
>
> === modified file 'storage/xtradb/include/log0log.ic'
> --- a/storage/xtradb/include/log0log.ic 2011-07-14 19:22:41 +0000
> +++ b/storage/xtradb/include/log0log.ic 2012-09-14 12:44:53 +0000
> @@ -411,6 +411,25 @@ log_get_lsn(void)
> return(lsn);
> }
>
> +/************************************************************//**
> +Gets the last lsn that is fully flushed to disk.
> +@return last flushed lsn */
> +UNIV_INLINE
> +ib_uint64_t
> +log_get_flush_lsn(void)
> +/*=============*/
> +{
> + ib_uint64_t lsn;
> +
> + mutex_enter(&(log_sys->mutex));
> +
> + lsn = log_sys->flushed_to_disk_lsn;
> +
> + mutex_exit(&(log_sys->mutex));
> +
> + return(lsn);
> +}
> +
> /****************************************************************
> Gets the log group capacity. It is OK to read the value without
> holding log_sys->mutex because it is constant.
>
> === modified file 'storage/xtradb/log/log0log.c'
> --- a/storage/xtradb/log/log0log.c 2012-08-27 16:13:17 +0000
> +++ b/storage/xtradb/log/log0log.c 2012-09-14 12:44:53 +0000
> @@ -1390,6 +1390,8 @@ log_write_up_to(
> ulint loop_count = 0;
> #endif /* UNIV_DEBUG */
> ulint unlock;
> + ib_uint64_t write_lsn;
> + ib_uint64_t flush_lsn;
>
> if (recv_no_ibuf_operations) {
> /* Recovery is running and no operations on the log files are
> @@ -1568,8 +1570,13 @@ log_write_up_to(
>
> log_flush_do_unlocks(unlock);
>
> + write_lsn = log_sys->write_lsn;
> + flush_lsn = log_sys->flushed_to_disk_lsn;
> +
> mutex_exit(&(log_sys->mutex));
>
> + innobase_mysql_log_notify(write_lsn, flush_lsn);
> +
> return;
>
> do_waits:
>
> _______________________________________________
> commits mailing list
> commits(a)mariadb.org
> https://lists.askmonty.org/cgi-bin/mailman/listinfo/commits
2
2
Is Explain output in Json format planned in MariaDB (refer
http://glebshchepa.blogspot.in/2012/04/optimizer-new-explain-formatjson.html)
?
If so when? and will the implementation details and output format be fully
identical to Oracle/MySQL 5.6?
(I did not find it mentioned here:
http://monty-says.blogspot.dk/2012/09/now-is-good-time-to-be-part-of-future…
http://blog.mariadb.org/mariadb-directions/
.. sorry if I overlooked it somewhere)
-- Peter
-- Webyog
4
8
Re: [Maria-developers] [Commits] Rev 3578: MDEV-495 backport --ignore-db-dir
by Sergei Golubchik 25 Sep '12
by Sergei Golubchik 25 Sep '12
25 Sep '12
Hi, Holyfoot!
On Sep 25, holyfoot(a)askmonty.org wrote:
> revno: 3578
> revision-id: holyfoot(a)askmonty.org-20120925094024-yv0wpbl6k7b6bgg8
> parent: sanja(a)montyprogram.com-20120920094859-lcrxe8lr1mtg8h10
> committer: Alexey Botchkov <holyfoot(a)askmonty.org>
> branch nick: mdev495
> timestamp: Tue 2012-09-25 14:40:24 +0500
> message:
> MDEV-495 backport --ignore-db-dir.
> The feature was backported from MySQL 5.6.
> Some code was added to make commands as
> SELECT * FROM ignored_db.t1;
> CALL ignored_db.proc();
> USE ignored_db;
> to take that option into account.
Just a few comments:
1. add a test case for CREATE DATABASE,
CREATE DATABASE ignored_db should fail.
2. I'd prefer to get an error "db not found" instead of "incorrect db
name", is it difficult to do?
3. add a test for the following: lost+found is ignored, but
CREATE DATABASE `lost+found` should succeed!
(indeed, it creates a directory lost@002bfound, which is not
ignored), and later SELECT, CALL, USE, SHOW DATABASES,
should show that new db.
> === added file 'mysql-test/r/ignore_db_dirs_basic.result'
> --- a/mysql-test/r/ignore_db_dirs_basic.result 1970-01-01 00:00:00 +0000
> +++ b/mysql-test/r/ignore_db_dirs_basic.result 2012-09-25 09:40:24 +0000
> @@ -0,0 +1,28 @@
> +select @@ignore_db_dirs;
> +@@ignore_db_dirs
> +e,lost+found,.mysqlgui,ignored_db
> +# Check that SHOW DATABASES ignores all directories from
> +# @@ignore_db_dirs and all directories with names starting
> +# with '.'
> +SHOW DATABASES;
> +Database
> +information_schema
> +#mysql50#.otherdir
> +mtr
> +mysql
> +test
> +USE ignored_db;
> +ERROR 42000: Incorrect database name 'ignored_db'
> +SELECT * FROM ignored_db.t1;
> +ERROR 42000: Incorrect database name 'ignored_db'
> +CALL ignored_db.p1();
> +ERROR 42000: Incorrect database name 'ignored_db'
> +SELECT COUNT(*) FROM INFORMATION_SCHEMA.SCHEMATA WHERE SCHEMA_NAME='ignored_db';
> +COUNT(*)
> +1
> +SET @@global.ignore_db_dirs = 'aha';
> +ERROR HY000: Variable 'ignore_db_dirs' is a read only variable
> +SET @@local.ignore_db_dirs = 'aha';
> +ERROR HY000: Variable 'ignore_db_dirs' is a read only variable
> +SET @@ignore_db_dirs = 'aha';
> +ERROR HY000: Variable 'ignore_db_dirs' is a read only variable
When merging to 5.5 you'll need to move this test to sys_vars suite
(otherwise sys_vars.all_vars will fail)
> === modified file 'sql/sql_show.cc'
> --- a/sql/sql_show.cc 2012-08-24 13:39:34 +0000
> +++ b/sql/sql_show.cc 2012-09-25 09:40:24 +0000
> @@ -379,6 +379,239 @@ bool mysqld_show_privileges(THD *thd)
> }
>
>
> +/** Hash of LEX_STRINGs used to search for ignored db directories. */
> +static HASH ignore_db_dirs_hash;
> +
> +/**
> + An array of LEX_STRING pointers to collect the options at
> + option parsing time.
> +*/
> +static DYNAMIC_ARRAY ignore_db_dirs_array;
I don't think that sql_show.cc is where all this code really belongs,
but ok, as you like.
Regards,
Sergei
1
0
Hi!
For the record, posting some ha_cassandra benchmarks:
<contents>
Run #1: re-run previous benchmarks
Run #2: evaluate max. network performance
Run #3: Load two dbt3 databases at once
Run #4: load two datasets into different nodes
Results
Conclusions
</contents>
=== Run #1: re-run previous benchmarks ===
My results with
- m1.large mysqld node (mysqld uses optimized binary)
- Cassandra on 2*m1.large nodes
- Loading data with the standard LOAD DATA INFILE command.
- %CPU mysqld =14
sar -n DEV 10:
03:04:41 PM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s
03:04:51 PM lo 0.00 0.00 0.00 0.00 0.00 0.00 0.00
03:04:51 PM eth0 1575.20 156.60 81.47 4458.13 0.00 0.00 0.00
On cassandra node:
- jsvc %CPU 70-170...
02:51:00 PM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s
02:51:05 PM lo 0.00 0.00 0.00 0.00 0.00 0.00 0.00
02:51:05 PM eth0 8325.20 7333.80 3376.25 1696.74 0.00 0.00 0.00
Total load time: time: 19 min.
=== Run #2: evaluate max. network performance ===
for comparison, sending /dev/zero on one host to /dev/null on another with nc:
03:39:24 PM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s
03:39:34 PM lo 0.00 0.00 0.00 0.00 0.00 0.00 0.00
03:39:34 PM eth0 16756.70 3688.50 851.91 87477.61 0.00 0.00 0.00
sending in both directions:
SQL:
03:45:36 PM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s
03:45:46 PM lo 0.00 0.00 0.00 0.00 0.00 0.00 0.00
03:45:46 PM eth0 63308.20 4473.50 73002.80 81484.77 0.00 0.00 0.00
Cassandra:
03:47:32 PM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s
03:47:42 PM lo 0.20 0.20 0.01 0.01 0.00 0.00 0.00
03:47:42 PM eth0 60741.50 15241.70 84110.07 70712.32 0.00 0.00 0.00
=== Run #3: Load two dbt3 databases at once ===
- Loading two DBT-3 databases at once.
- The databases have their own key spaces, and mysql tables
- Still, both loaders connect to the same cassandra host.
Cassandra host:
04:27:57 PM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s
04:28:07 PM eth0 9979.10 8984.00 6006.77 2621.08 0.00 0.00 0.00
04:29:07 PM eth0 9181.80 7315.00 6416.25 2686.83 0.00 0.00 0.00
04:29:17 PM eth0 10277.00 9629.60 6047.00 2672.02 0.00 0.00 0.00
04:30:23 PM CPU %user %nice %system %iowait %steal %idle
04:30:33 PM all 34.56 24.80 24.71 0.18 0.05 15.69
04:30:53 PM all 32.08 11.20 24.25 1.31 0.08 31.07
04:31:03 PM all 35.59 22.55 22.28 0.27 0.00 19.31
it has periodical slow-downs like this:
04:32:38 PM all 53.25 1.30 19.05 3.03 0.00 23.38
04:32:39 PM all 65.84 10.40 21.78 0.00 0.00 1.98
04:32:40 PM all 37.14 33.81 18.10 0.95 0.00 10.00
04:32:41 PM all 42.65 25.98 26.96 0.00 0.00 4.41
04:32:42 PM all 33.17 39.42 19.23 0.00 0.00 8.17
04:32:43 PM all 36.89 36.89 19.90 0.00 0.00 6.31
04:32:44 PM all 39.81 35.55 13.74 0.00 0.00 10.90
04:32:45 PM all 34.45 33.97 22.01 0.00 0.00 9.57
04:32:46 PM all 35.58 36.06 21.15 0.00 0.00 7.21
04:32:47 PM all 31.05 35.16 15.07 0.00 0.00 18.72
04:32:48 PM all 41.31 26.29 20.66 0.00 0.00 11.74
04:32:49 PM all 39.02 32.20 25.37 0.00 0.00 3.41
04:32:50 PM all 26.64 34.50 13.97 0.00 0.00 24.89
04:32:51 PM all 56.93 27.72 12.87 0.00 0.00 2.48
04:32:52 PM all 71.14 15.42 11.44 0.00 0.00 1.99
04:32:53 PM all 64.22 8.82 23.04 0.00 0.00 3.92
04:32:54 PM all 76.73 1.98 19.31 0.00 0.00 1.98
04:32:55 PM all 61.24 0.00 30.14 0.96 0.00 7.66
04:32:56 PM all 29.20 0.00 32.40 0.00 0.00 38.40
04:32:57 PM all 30.29 0.00 29.88 0.00 0.00 39.83
04:32:58 PM all 23.19 0.00 20.29 0.00 0.36 56.16
04:32:59 PM all 23.34 0.00 16.38 0.00 0.35 59.93
04:33:00 PM all 20.07 0.00 21.90 0.00 0.00 58.03
04:33:01 PM all 24.20 0.00 18.86 0.00 0.00 56.94
04:33:02 PM all 20.07 0.00 22.18 0.00 0.70 57.04
Is this a compaction?
mysqld host:
04:29:01 PM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s
04:29:01 PM eth0 1416.20 247.90 75.43 6578.11 0.00 0.00 0.00
04:29:11 PM eth0 1175.00 199.60 62.49 5292.67 0.00 0.00 0.00
04:29:21 PM eth0 1074.60 186.90 57.21 4952.19 0.00 0.00 0.00
04:30:30 PM CPU %user %nice %system %iowait %steal %idle
04:30:40 PM all 7.77 0.00 0.56 0.00 0.05 91.62
04:30:50 PM all 8.20 0.00 0.60 0.00 0.30 90.89
04:31:00 PM all 8.01 0.00 0.66 0.00 0.05 91.28
Conclusions:
- mysqld is still idle most of the time.
- Cassandra is busier. Sometimes, it is 1014/Sep/12 12:50 PM0% busy.
- network use betwen mysqld and cassandra is 2x more, which is expected
Run time: 24 minutes for both (compare to 19 minutes)
=== Run #4: load two datasets into different nodes ===
==== Network ====
SQL node:
05:44:21 PM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s
05:44:41 PM eth0 1178.10 210.10 62.81 5510.03 0.00 0.00 0.00
05:44:51 PM eth0 1062.10 200.90 57.53 5152.63 0.00 0.00 0.00
05:45:01 PM eth0 965.80 194.00 52.53 5012.62 0.00 0.00 0.00
Cassandra1:
05:44:19 PM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s
05:44:49 PM eth0 10159.00 8659.20 4443.26 1798.75 0.00 0.00 0.00
05:44:59 PM eth0 9018.20 7750.10 3982.21 1578.30 0.00 0.00 0.00
05:45:09 PM eth0 8741.30 7521.70 3921.69 1541.55 0.00 0.00 0.00
Cassandra2:
05:44:56 PM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s
05:45:06 PM eth0 9305.60 7906.00 4121.10 1622.92 0.00 0.00 0.00
05:45:16 PM eth0 9186.50 7828.80 4258.12 1681.78 0.00 0.00 0.00
05:45:26 PM eth0 9675.70 8283.10 4306.89 1698.37 0.00 0.00 0.00
==== CPU ====
SQL node:
05:46:45 PM CPU %user %nice %system %iowait %steal %idle
05:46:55 PM all 8.39 0.00 0.70 0.00 0.40 90.51
05:47:05 PM all 7.65 0.00 0.74 0.00 0.70 90.91
05:47:15 PM all 8.76 0.00 0.64 0.00 0.89 89.70
cassandra 1:
05:47:01 PM CPU %user %nice %system %iowait %steal %idle
05:47:11 PM all 27.28 28.24 27.55 0.00 0.05 16.89
05:47:21 PM all 31.45 24.44 27.56 1.38 0.05 15.13
05:47:31 PM all 29.46 19.60 23.19 2.35 0.04 25.36
05:47:41 PM all 38.47 0.04 14.21 0.15 0.07 47.06
05:47:51 PM all 31.04 9.92 17.97 0.39 0.08 40.61
05:48:01 PM all 24.73 28.07 21.77 0.26 0.04 25.12
05:48:11 PM all 19.79 11.67 19.23 1.83 0.07 47.41
05:48:21 PM all 20.29 0.00 11.07 0.06 0.13 68.44
05:48:31 PM all 24.08 0.00 11.96 0.13 0.13 63.69
Cassandra 2:
05:46:56 PM CPU %user %nice %system %iowait %steal %idle
05:47:06 PM all 31.97 27.14 26.91 0.09 0.05 13.85
05:47:16 PM all 32.79 20.43 23.39 2.34 0.04 21.01
05:47:26 PM all 26.43 27.07 27.29 0.14 0.09 18.98
05:47:36 PM all 17.72 8.59 12.00 2.97 0.21 58.52
05:47:46 PM all 13.63 0.00 6.77 0.06 0.30 79.23
05:47:56 PM all 27.19 3.92 8.52 0.11 0.25 60.01
05:48:06 PM all 19.94 27.83 18.47 0.13 0.08 33.54
05:48:16 PM all 13.69 16.01 14.24 2.69 0.15 53.22
05:48:26 PM all 18.13 0.00 9.57 0.25 0.38 71.66
=== Results ===
The first client failed with
ERROR 1928 (HY000) at line 145: Internal error: 'TimedOutException: Default TException.'
real 4m3.463s
user 0m0.008s
sys 0m0.000s
The second completed after:
real 19m57.875s (which matches load time for loading one dataset).
=== Conclusions ===
- need to fix the Timeout error (filed as MDEV-535)
- CPU time spent by mysqld is negligible now (I don't know whether this comes
from using faster CPU for SQL node, or using a release build of mariadb, or
both)
- A confirmation of previous findings: if we use only one connection, the speed
at which we can insert the data is very limited (see: with two connections,
we have inserted 2x data in 20% more time. This is where we used two
connections to connect to the *same* node)
- I am not sure whether I should just start using multiple connections, or also
automatically route the data (both writes and reads) to the right node?
- need to check out Hector client. It seems to be able to connect to
multiple nodes of Cassandra cluster.
BR
Sergei
--
Sergei Petrunia, Software Developer
Monty Program AB, http://askmonty.org
Blog: http://s.petrunia.net/blog
2
1
21 Sep '12
Hi!
I have done some extensions to mytop as part of porting, fixing and enhancing
it for MariaDB. I have now merged my changes with 1.91 and thought it was time
to share the changes with you. Sorry for not doing it sooner!
I also have added Jean Weisbuch patches for --full-queries and reading
of .my.cnf
Here is a diff of my changes. Hope you find some of them useful!
If you have any questions, feel free to ask!
Best regards,
Monty
Creator of MySQL and MariaDB
1
0
21 Sep '12
Hello,
I've figured there may be an additional challenge in getting Cassandra
storage engine into a *MariaDB release*. The problem is packaging.
<contents>
Cassandra SE packaging problems
Need Thrift library
Need recent gcc
How did Sphinx SE manage to avoid a similar mess?
Possible way out
</contents>
== Cassandra SE packaging problems ==
=== Need recent gcc ===
C++ code generated by Thrift uses boost, and so one needs a recent gcc
to compile it with MySQL. We have the same issue with oqgraph. See MDEV-501
for details. I am not aware of crashing problems that ograph+boost
combination has.
Still, recent-gcc requirement may limit the range of platforms we can build
for. For example, I never managed to compile on Ubuntu 9.04.
=== Need Thrift library ===
Cassandra SE uses Thrift protocol to talk to server. In order to do this, it
needs to link againist libthrift.so.
There are no Thrift packages in either Fedora or Ubuntu (actually I haven't
seen Thrift packages for any OS). Thrift homepage only offers source tarballs.
The source tarball uses automake and compiles into
1. thrift compiler binary (we dont actually need it. Source code generated by
thrift needs to be modified anyway.)
2. thrift libraries for a dozen of languages, including libthrift.so, which we
do need.
== How did Sphinx SE manage to avoid a similar mess? ==
I've asked myself a question how do we manage to include Sphinx SE without
having to mess with Sphinx packages.
The answer is: Sphinx SE doesn't need any sphinx libraries, either at compile
time, or at runtime. It implements its client protocol fully, except for
INSERT commands, for which it uses the available mysql client library to talk
to Sphinx server using mysql client-server protocol.
== An unlikely but possible way out ==
Cassandra 1.2 (the next version that is promised to be released "in Q4 2012"
will introduce native network protocol. A week ago they've pushed first
documentation draft. The native network protocol should allow to do
everything that Thrift API allows, and more. There are no C++ clients yet,
though.
If I am faced with a choice betwen
1. implementing a relatively simple network protocol
2. Making Thrift packages for various platforms (and potentially convincing
downstream maintainers to accept them)
then I will definitely pick #1 (I have implemented network protocols in the
past).
There is a big risk, though: native protocol support is very new in Cassandra,
they pushed it about a week ago. It was pushed into their "trunk" which is
under development. I don't know if the code in their trunk is sufficiently
stable for us to develop againist.
BR
Sergei
--
Sergei Petrunia, Software Developer
Monty Program AB, http://askmonty.org
Blog: http://s.petrunia.net/blog
3
3
Thanks!
At beginning Il'll try to do some benchmarks!
On 13/09/2012 22:11, Michael Widenius wrote:
> Hi!
>
>>>>>> "Sergei" == Sergei Golubchik <serg(a)askmonty.org> writes:
> Sergei> Hi, Giacomo!
> Sergei> On Aug 11, Giacomo Picchiarelli wrote:
>>> Hi,
>>>
>>> I'm new to MariaDB and I'm reading the knowledge base.
>>> I'm interested in studying Aria Engine (and contribute, if I can :) ),
>>> are there tasks for newbie developers?
> Sergei> I'm afraid, there are no tasks in the Aria engine right now :(
>
> Sergei> If you're interested to look at other tasks - here is the list of task,
> Sergei> feel free to pick anything from it:
>
> Sergei> https://mariadb.atlassian.net/secure/IssueNavigator.jspa?requestId=11300
>
> Actually, there are some open bugs on Aria:
>
> https://mariadb.atlassian.net/browse/MDEV-333
>
> You can go to:
> https://mariadb.atlassian.net/secure/QuickSearch.jspa
> And search for "Aria"
>
> Same goes for lanchpad:
>
> https://bugs.launchpad.net/maria?field.searchtext=aria&search=Search&field.…
>
> Note that many of the bugs in launchpad for Aria are already fixed.
> However it would be quite valuable if you could help us go trough
> them and mark which still are open and try to fix them yourself.
>
> Another thing that can be done on Aria is to run some benchmarks on it
> and try to find ways to improve the performance on it.
>
> Regards,
> Michael Widenius
>
--
Giacomo Picchiarelli
Skype: g.picchiarelli
2
1
Re: [Maria-developers] [Commits] Rev 3577: MDEV-521 fix. in file:///home/bell/maria/bzr/work-maria-5.3-MDEV-521/
by Sergei Petrunia 20 Sep '12
by Sergei Petrunia 20 Sep '12
20 Sep '12
Hi Sanja,
Ok to push.
On Thu, Sep 20, 2012 at 12:49:03PM +0300, sanja(a)montyprogram.com wrote:
> At file:///home/bell/maria/bzr/work-maria-5.3-MDEV-521/
>
> ------------------------------------------------------------
> revno: 3577
> revision-id: sanja(a)montyprogram.com-20120920094859-lcrxe8lr1mtg8h10
> parent: timour(a)askmonty.org-20120917081346-nqd08d0v27w8zp5c
> committer: sanja(a)montyprogram.com
> branch nick: work-maria-5.3-MDEV-521
> timestamp: Thu 2012-09-20 12:48:59 +0300
> message:
> MDEV-521 fix.
>
> After pullout item during single row subselect transformation it should be fixed properly.
> === modified file 'mysql-test/r/subselect.result'
> --- a/mysql-test/r/subselect.result 2012-08-25 06:15:57 +0000
> +++ b/mysql-test/r/subselect.result 2012-09-20 09:48:59 +0000
> @@ -6163,5 +6163,20 @@ a ( 3, 3 ) NOT IN ( SELECT NULL, NULL )
> 3 NULL
> set optimizer_switch=@mdev367_optimizer_switch;
> DROP TABLE t1;
> +#
> +# MDEV-521 single value subselect transformation problem
> +#
> +CREATE TABLE t1 (f1 char(2), PRIMARY KEY (f1)) ENGINE=MyISAM;
> +INSERT INTO t1 VALUES ('u1'),('u2');
> +SELECT a.* FROM t1 a WHERE ( SELECT EXISTS ( SELECT 1 FROM t1 b WHERE b.f1 = a.f1 ) );
> +f1
> +u1
> +u2
> +FLUSH TABLES;
> +SELECT a.* FROM t1 a WHERE ( SELECT EXISTS ( SELECT 1 FROM t1 b WHERE b.f1 = a.f1 ) );
> +f1
> +u1
> +u2
> +DROP TABLE t1;
> # return optimizer switch changed in the beginning of this test
> set optimizer_switch=@subselect_tmp;
>
> === modified file 'mysql-test/r/subselect_no_mat.result'
> --- a/mysql-test/r/subselect_no_mat.result 2012-08-25 06:15:57 +0000
> +++ b/mysql-test/r/subselect_no_mat.result 2012-09-20 09:48:59 +0000
> @@ -6162,6 +6162,21 @@ a ( 3, 3 ) NOT IN ( SELECT NULL, NULL )
> 3 NULL
> set optimizer_switch=@mdev367_optimizer_switch;
> DROP TABLE t1;
> +#
> +# MDEV-521 single value subselect transformation problem
> +#
> +CREATE TABLE t1 (f1 char(2), PRIMARY KEY (f1)) ENGINE=MyISAM;
> +INSERT INTO t1 VALUES ('u1'),('u2');
> +SELECT a.* FROM t1 a WHERE ( SELECT EXISTS ( SELECT 1 FROM t1 b WHERE b.f1 = a.f1 ) );
> +f1
> +u1
> +u2
> +FLUSH TABLES;
> +SELECT a.* FROM t1 a WHERE ( SELECT EXISTS ( SELECT 1 FROM t1 b WHERE b.f1 = a.f1 ) );
> +f1
> +u1
> +u2
> +DROP TABLE t1;
> # return optimizer switch changed in the beginning of this test
> set optimizer_switch=@subselect_tmp;
> set optimizer_switch=default;
>
> === modified file 'mysql-test/r/subselect_no_opts.result'
> --- a/mysql-test/r/subselect_no_opts.result 2012-08-25 06:15:57 +0000
> +++ b/mysql-test/r/subselect_no_opts.result 2012-09-20 09:48:59 +0000
> @@ -6158,6 +6158,21 @@ a ( 3, 3 ) NOT IN ( SELECT NULL, NULL )
> 3 NULL
> set optimizer_switch=@mdev367_optimizer_switch;
> DROP TABLE t1;
> +#
> +# MDEV-521 single value subselect transformation problem
> +#
> +CREATE TABLE t1 (f1 char(2), PRIMARY KEY (f1)) ENGINE=MyISAM;
> +INSERT INTO t1 VALUES ('u1'),('u2');
> +SELECT a.* FROM t1 a WHERE ( SELECT EXISTS ( SELECT 1 FROM t1 b WHERE b.f1 = a.f1 ) );
> +f1
> +u1
> +u2
> +FLUSH TABLES;
> +SELECT a.* FROM t1 a WHERE ( SELECT EXISTS ( SELECT 1 FROM t1 b WHERE b.f1 = a.f1 ) );
> +f1
> +u1
> +u2
> +DROP TABLE t1;
> # return optimizer switch changed in the beginning of this test
> set optimizer_switch=@subselect_tmp;
> set @optimizer_switch_for_subselect_test=null;
>
> === modified file 'mysql-test/r/subselect_no_scache.result'
> --- a/mysql-test/r/subselect_no_scache.result 2012-08-25 06:15:57 +0000
> +++ b/mysql-test/r/subselect_no_scache.result 2012-09-20 09:48:59 +0000
> @@ -6169,6 +6169,21 @@ a ( 3, 3 ) NOT IN ( SELECT NULL, NULL )
> 3 NULL
> set optimizer_switch=@mdev367_optimizer_switch;
> DROP TABLE t1;
> +#
> +# MDEV-521 single value subselect transformation problem
> +#
> +CREATE TABLE t1 (f1 char(2), PRIMARY KEY (f1)) ENGINE=MyISAM;
> +INSERT INTO t1 VALUES ('u1'),('u2');
> +SELECT a.* FROM t1 a WHERE ( SELECT EXISTS ( SELECT 1 FROM t1 b WHERE b.f1 = a.f1 ) );
> +f1
> +u1
> +u2
> +FLUSH TABLES;
> +SELECT a.* FROM t1 a WHERE ( SELECT EXISTS ( SELECT 1 FROM t1 b WHERE b.f1 = a.f1 ) );
> +f1
> +u1
> +u2
> +DROP TABLE t1;
> # return optimizer switch changed in the beginning of this test
> set optimizer_switch=@subselect_tmp;
> set optimizer_switch=default;
>
> === modified file 'mysql-test/r/subselect_no_semijoin.result'
> --- a/mysql-test/r/subselect_no_semijoin.result 2012-08-25 06:15:57 +0000
> +++ b/mysql-test/r/subselect_no_semijoin.result 2012-09-20 09:48:59 +0000
> @@ -6158,6 +6158,21 @@ a ( 3, 3 ) NOT IN ( SELECT NULL, NULL )
> 3 NULL
> set optimizer_switch=@mdev367_optimizer_switch;
> DROP TABLE t1;
> +#
> +# MDEV-521 single value subselect transformation problem
> +#
> +CREATE TABLE t1 (f1 char(2), PRIMARY KEY (f1)) ENGINE=MyISAM;
> +INSERT INTO t1 VALUES ('u1'),('u2');
> +SELECT a.* FROM t1 a WHERE ( SELECT EXISTS ( SELECT 1 FROM t1 b WHERE b.f1 = a.f1 ) );
> +f1
> +u1
> +u2
> +FLUSH TABLES;
> +SELECT a.* FROM t1 a WHERE ( SELECT EXISTS ( SELECT 1 FROM t1 b WHERE b.f1 = a.f1 ) );
> +f1
> +u1
> +u2
> +DROP TABLE t1;
> # return optimizer switch changed in the beginning of this test
> set optimizer_switch=@subselect_tmp;
> set @optimizer_switch_for_subselect_test=null;
>
> === modified file 'mysql-test/t/subselect.test'
> --- a/mysql-test/t/subselect.test 2012-08-25 06:15:57 +0000
> +++ b/mysql-test/t/subselect.test 2012-09-20 09:48:59 +0000
> @@ -5229,5 +5229,18 @@ set optimizer_switch=@mdev367_optimizer_
>
> DROP TABLE t1;
>
> +--echo #
> +--echo # MDEV-521 single value subselect transformation problem
> +--echo #
> +CREATE TABLE t1 (f1 char(2), PRIMARY KEY (f1)) ENGINE=MyISAM;
> +INSERT INTO t1 VALUES ('u1'),('u2');
> +
> +SELECT a.* FROM t1 a WHERE ( SELECT EXISTS ( SELECT 1 FROM t1 b WHERE b.f1 = a.f1 ) );
> +FLUSH TABLES;
> +SELECT a.* FROM t1 a WHERE ( SELECT EXISTS ( SELECT 1 FROM t1 b WHERE b.f1 = a.f1 ) );
> +
> +# Cleanup
> +DROP TABLE t1;
> +
> --echo # return optimizer switch changed in the beginning of this test
> set optimizer_switch=@subselect_tmp;
>
> === modified file 'sql/item_subselect.cc'
> --- a/sql/item_subselect.cc 2012-09-17 08:13:46 +0000
> +++ b/sql/item_subselect.cc 2012-09-20 09:48:59 +0000
> @@ -1007,11 +1007,9 @@ Item_singlerow_subselect::select_transfo
> }
> substitution= select_lex->item_list.head();
> /*
> - as far as we moved content to upper level, field which depend of
> - 'upper' select is not really dependent => we remove this dependence
> + as far as we moved content to upper level we have to fix dependences & Co
> */
> - substitution->walk(&Item::remove_dependence_processor, 0,
> - (uchar *) select_lex->outer_select());
> + substitution->fix_after_pullout(select_lex->outer_select(), &substitution);
> }
> DBUG_RETURN(false);
> }
>
> _______________________________________________
> commits mailing list
> commits(a)mariadb.org
> https://lists.askmonty.org/cgi-bin/mailman/listinfo/commits
--
BR
Sergei
--
Sergei Petrunia, Software Developer
Monty Program AB, http://askmonty.org
Blog: http://s.petrunia.net/blog
1
0
Robert Hodges <robert.hodges(a)continuent.com> writes:
> I think we need to take this out to one of the email threads. I hope other
> people would be interested in these problems.
Agree, I've quoted liberally and Cc:'ed maria-developers@
> On Aug 31, 2012, at 6:14 AM PDT, Kristian Nielsen wrote:
>> Robert Hodges <robert.hodges(a)continuent.com> writes:
>>
>>> 1.) The server-id of which I was speaking is a very simple patch to allow us
>>> to do the following client command: 'set session server-id=X' where X is
>>> whatever we choose. Currently MySQL replication sets the server ID from an
>>> internal structure (THD as I recall), which means that applications like
>>> Tungsten cannot get the server-id into the binlog when applying updates on a
>>> slave. Server-id is essential to prevent replication loops. Monty and I
>>> discussed this feature briefly over dinner in April. I hacked in support
>>> using a UDF about a year ago but it would be better to have this become a
>>> dynamic session variable.
>> Yes, it does not seem too hard to do (though as always some care is needed to
>> introduce it in a clean way).
>>
>> I created an issue for this with some descriptions:
>>
>> https://mariadb.atlassian.net/browse/MDEV-500
> SET SESSION pseudo_server_id=236;
>
> Then the server_id 236 would to go into the binlog. This seems like a good
> idea and avoids accidentally reseting the server_id, which is a pretty
> significant variable.
Seems there is hope we can have this in 10.0.
>>> 2.) I agree with your criticisms of the MySQL 5.6 replication design. It's
>>> surprisingly complex and looks as if it will have a lot of bugs. I felt at
>>
>> Yes. They need this complexity because their parallel replication (which is
>> similar to Tungsten parallel replication I think?) can arbitrarily reorder
>> transactions in a slave binlog compared to the order in the master binlog.
> We just use the sequential log. The slaves are complex but it does not
> pollute the log. In fact, everything up to the point of applying to the
> slave is purely sequential. I think their design got this wrong.
Well, the MySQL --log-slave-updates work by generating completely new events
on the slave as the received events are executed by the SQL thread(s).
Do you mean that Tungsten does not do this, but uses the original binlog with
original events from the master? How is multi-master (one slave replicating
from several masters) handled (if it is handled) ?
>>> the time that the better approach on transaction IDs would be to put enough
>>> information into logs to enable applications to compute a vector clock on
>>> the stream of binlog events coming from multiple servers. This is a
>>
>> Interesting, I though about a concept of "vector clock" before, but did not
>> know the terminology.
>>
>> Aha, so your point is that even though we will apply changes from multiple
>> upstream master servers in parallel on a single slave, with a vector clock, we
>> can still order them on a slave according to which transactions on one server
>> where visible to which transactions on the other. And stuff like
>> that. Interesting!
>>
>> But note that to be truly "theoretically sound", all queries (ie. SELECT) as
>> well as all applications that interact directly or indirectly with multiple
>> servers need to be included in the vector clock. So it is not without
>> complexity.
> Are you thinking about holding locks? For practical purposes it's enough to
> do writes only. My point is that you have a global ID that is partially
> ordered by originating server. Vector clocks are the traditional way of
> handling that. This is the basis for most mechanisms of conflict
> resolution--you check to see if the clocks are ordered and if they are not
> (because two thing happened at the same time, hence are not comparable), it
> signals a conflict that must be reconciled.
Well, I was mostly just reacting to the phrase "truly theoretically sound". If
readers are not considered in the vector clock, then they are able to observe
different states of the database at different nodes in the cluster. But as you
say, that is probably acceptable.
>> This does not immediately apply to the MySQL design, as they work with a
>> single master (applying in parallel transactions in different databases), and
>> as they arbitrarily reorder between different databases, without any attempt
>> to maintain inter-database consistency. But of course, my point is that this
>> design is fundamentally wrong.
>>
>>> theoretically sound approach to enable both synchronous and asynchronous
>>> multi-master replication. (Galera incidentally uses approximately this
>>> approach for sync replication as you may be aware.) If you give us server
>>> IDs we don't need anything else. Oops, I forgot. Making the binlog truly
>>> crash safe in every possible case is also very helpful.
>>
>> I think it is getting closer to being truly crash safe - can you mention the
>> situations where it is not? And btw., are you interested in the recent testing
>> we (especially Elena) made for testing such crash safety, and the fixes made
>> for it? We found some fairly interesting (and scary) stuff...
> Actually your (MP's?) fix for group commit is the most important
> improvement. Can we turn sync_binlog back on without killing performance?
Yes, well at least performance is substantially improved with sync_binlog=1.
With a parallel load you can see improvements of 10x or more, and even
single-threaded load is improved in 10.0.
>> I have been thinking about this for some time. Basically, the correct approach
>> is to make the binlog into a true transaction log using standard textbook
>> techniques. Pre-allocate the log, make it page based (this will also improve
>> performance). Do ping-pong writes at the end so that we can recover even from
>> torn pages. Stuff like that. Would be cool to do, but so far I have not really
>> seen much interest for something like this.
> It would be a lot of work. If group commit is fixed, then most on-premise
> people can solve enough problems with RAID and BBU that they don't need much
> more.
Right, agree, I thought you had something more subtle than sync_binlog=0 in
mind.
(BTW, I fixed a couple binlog crash recovery bugs recently in MariaDB, which
also affect MySQL).
>>> 3.) I read the email you cited. I need some time to study it in order to
>>> comment intelligently. The most immediate feedback is that I doubt you want
>>> to make assumptions about how slaves will apply transactions. The role of
>>> the master is to provide local transactions in serial order. Looking across
>>> a set of masters you then have a set of easily identifiable streams of
>>> partially ordered transactions. That in turn provides a mathematically
>>> tractable model for reasoning about things like update order, dependencies,
>>> and potential conflicts. Any metadata you can supply about dependencies,
>>> e.g., which schema(s) transactions actually touch is useful but may or may
>>> not be relevant on slaves depending on the application. It's up to slaves
>>> to make decisions on reordering.
>>
>> But think about a three-level multi-master topology. Servers A and B are
>> masters for server C, which is itself a master for D. The DBA/application
>> developer will declare that transactions on A and B are independent, so C can
>> apply them in any order, and in parallel. Clearly on D we will want to also
>> apply them in parallel, and the only way we can do that is to use the
>> different server ids A and B in the stream of events in the binlog of C.
>>
>> So now consider a simple two-level topology with just master C and slave
>> D. Why not still let C put different "server_id"s in different events, as a
>> way to tell server D that events are independent? For example, if the DBA runs
>> a long-running ALTER TABLE, he can ensure that no transactions touch this
>> table while it runs, and just SET SESSION SERVER_ID to something else, and it
>> can run in parallel on D without delaying normal transactions.
>>
>> Or think about MySQL 5.6 style parallel replication, where the DBA declares
>> that on server A, transactions against different databases are to be
>> considered independent and applied in parallel on the slave. While on server B
>> such transactions are not independent. Now we want to setup slave C to
>> replicate from both A and B. Clearly, C is the wrong place to configure
>> whether transactions in different databases are to be considered independent -
>> this is a property of how the transactions are generated, not where they are
>> replicated. But if we just let the server_id of transactions on master A be a
>> hash of the affected database, then MySQL 5.6 style parallel replication just
>> falls out of multi-master for tree, without any of their complexity.
> I have not ever seen this topology arise in practice within MySQL, as people
> tend to keep things pretty homogeneous. It seems to arise mostly in data
> warehouse loading where multiple applications load data into the same store
> for reporting. These are more heterogeneous but arallel information from
> MySQL is of limited value here--data warehouses in Vertica or Hadoop don't
> have constraints and you can just jam data into them in parallel.
Ok.
>> It is just a matter of not artificially restricting transactions generated on
>> a given server to have just one server_id value. Which is in fact exactly what
>> you asked me about originally with SET SESSION SERVER_ID :)
>>
>> On the other hand, the slave is free to try to discover more opportunities for
>> independence / parallelism if it can, and use them. But automatically
>> discovering opportunities for parallelism on the slave is a very hard problem
>> in general, as I am sure you agree ...
> Meanwhile, I don't like the idea of using server_id in the way you describe
> because it overloads the meaning. Server_id just tells us which server
> actually committed the transaction in the first place. The reason we need
> the session variable was to be able to signal this properly. I was not
> intending to use it more broadly.
Ok, I've noted your concern. It would be interesting to learn how you use the
knowledge that one mysqld instance can have at most one server id in generated
transactions?
> More generally, I believe that if you want to allow programmers to declare
> that things are independent it would be best to add SQL-level features. For
> instance, Facebook has a patch that prevents transactions from spanning
> schemas. This is a very useful feature, because it ensures that all updates
> in one schema are independent from others. That in turn means you can
> parallelize without being afraid of deadlocks (a big problem in this space)
> or constraint violations.
>
> You could generalize this into the notion of a shard, which is a scope for
> independent transactions and presumably also referential integrity. This
> would be a very useful feature for SQL databases. I don't know if you have
> read it but Pat Helland has a great paper on this topic called "Life beyond
> Distributed Transactions: an Apostate’s Opinion." It was visionary at the
> time (and still is) and has never been properly implemented in relational
> DBMS to my knowledge.
Well, independent schemas are nice, but not enough. There are many
applications that need the slave to apply transactions in parallel even within
a single table.
- Kristian.
1
0