- developers - lists.mariadb.org

[Maria-developers] New (by Sanja): options for CREATE TABLE (43)
by worklog-noreply＠askmonty.org 11 Aug '09

11 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: options for CREATE TABLE CREATION DATE..: Tue, 11 Aug 2009, 17:02 SUPERVISOR.....: Bothorsen IMPLEMENTOR....: Sanja COPIES TO......: Monty CATEGORY.......: Server-BackLog TASK ID........: 43 (http://askmonty.org/worklog/?tid=43) VERSION........: Server-5.1 STATUS.........: Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 32 (hours remain) ORIG. ESTIMATE.: 32 PROGRESS NOTES: DESCRIPTION: Add ability to create table with additional option which can be passed to engine. Also make current options such as TRANSACTIONAL working via this mechanism. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] New (by Sanja): options for CREATE TABLE (43)
by worklog-noreply＠askmonty.org 11 Aug '09

11 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: options for CREATE TABLE CREATION DATE..: Tue, 11 Aug 2009, 17:02 SUPERVISOR.....: Bothorsen IMPLEMENTOR....: Sanja COPIES TO......: Monty CATEGORY.......: Server-BackLog TASK ID........: 43 (http://askmonty.org/worklog/?tid=43) VERSION........: Server-5.1 STATUS.........: Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 32 (hours remain) ORIG. ESTIMATE.: 32 PROGRESS NOTES: DESCRIPTION: Add ability to create table with additional option which can be passed to engine. Also make current options such as TRANSACTIONAL working via this mechanism. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] New (by Sanja): options for CREATE TABLE (43)
by worklog-noreply＠askmonty.org 11 Aug '09

11 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: options for CREATE TABLE CREATION DATE..: Tue, 11 Aug 2009, 17:02 SUPERVISOR.....: Bothorsen IMPLEMENTOR....: Sanja COPIES TO......: Monty CATEGORY.......: Server-BackLog TASK ID........: 43 (http://askmonty.org/worklog/?tid=43) VERSION........: Server-5.1 STATUS.........: Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 32 (hours remain) ORIG. ESTIMATE.: 32 PROGRESS NOTES: DESCRIPTION: Add ability to create table with additional option which can be passed to engine. Also make current options such as TRANSACTIONAL working via this mechanism. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] Progress (by Monty): Backporting pool of threads to MariaDB (6)
by worklog-noreply＠askmonty.org 11 Aug '09

11 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Backporting pool of threads to MariaDB CREATION DATE..: Mon, 09 Mar 2009, 17:21 SUPERVISOR.....: Monty IMPLEMENTOR....: Monty COPIES TO......: Monty CATEGORY.......: Server-Sprint TASK ID........: 6 (http://askmonty.org/worklog/?tid=6) VERSION........: Server-5.1 STATUS.........: Complete PRIORITY.......: 60 WORKED HOURS...: 16 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 8 PROGRESS NOTES: -=-=(Monty - Tue, 11 Aug 2009, 16:58)=-=- Done, ages ago Worked 16 hours and estimate 0 hours remain (original estimate increased by 8 hours). -=-=(Monty - Thu, 26 Mar 2009, 00:32)=-=- Privacy level updated. --- /tmp/wklog.6.old.6586 2009-03-26 00:32:23.000000000 +0200 +++ /tmp/wklog.6.new.6586 2009-03-26 00:32:23.000000000 +0200 @@ -1 +1 @@ -y +n -=-=(Monty - Thu, 26 Mar 2009, 00:31)=-=- Supervisor updated. --- /tmp/wklog.6.old.6580 2009-03-26 00:31:30.000000000 +0200 +++ /tmp/wklog.6.new.6580 2009-03-26 00:31:30.000000000 +0200 @@ -1 +1 @@ -Knielsen +Monty -=-=(Monty - Fri, 13 Mar 2009, 02:43)=-=- Low Level Design modified. --- /tmp/wklog.6.old.26076 2009-03-13 02:43:17.000000000 +0200 +++ /tmp/wklog.6.new.26076 2009-03-13 02:43:17.000000000 +0200 @@ -1 +1,20 @@ +To be able to work with both one-thread-per-connection and pool-of-threads at +the same time, I added a new global scheduler variable 'extra_thread_scheduler' +that is always using the one-thread-per-connection method. + +To the THD structure was added a pointer to the 'scheduler' variable that should +be used for this connection. + +To do easy handing of two connect counter and two max_connection variables, I +added pointer to these pointer in the scheduler variable.: + +Other changes was: + +- If extra-port was <> 0, start listing to this port too +- At connect time, set THD->scheduler to point to the given scheduler (based on +the port that was used to connect) +- Change some calls that was done trough functions pointer in the scheduler to +instead use thd->scheduler-> +- Change max_connections to *thd->scheduler->max_connections +- Change connection_count to *thd->scheduler->connection_count -=-=(Monty - Fri, 13 Mar 2009, 02:29)=-=- Version updated. --- /tmp/wklog.6.old.25818 2009-03-13 02:29:16.000000000 +0200 +++ /tmp/wklog.6.new.25818 2009-03-13 02:29:16.000000000 +0200 @@ -1 +1 @@ -Server-9.x +Server-5.1 -=-=(Monty - Fri, 13 Mar 2009, 02:29)=-=- Status updated. --- /tmp/wklog.6.old.25818 2009-03-13 02:29:16.000000000 +0200 +++ /tmp/wklog.6.new.25818 2009-03-13 02:29:16.000000000 +0200 @@ -1 +1 @@ -Assigned +Complete -=-=(Monty - Fri, 13 Mar 2009, 02:28)=-=- High Level Description modified. --- /tmp/wklog.6.old.25790 2009-03-13 02:28:25.000000000 +0200 +++ /tmp/wklog.6.new.25790 2009-03-13 02:28:25.000000000 +0200 @@ -8,3 +8,6 @@ Add option --extra-port to allow connections with old one-thread-per-connection method. This is needed to allow root to login and kill threads if something goes wrong. +Add option --extra-max-connections to regulate how many connections can be made +to 'extra-port'. This should work in a similar way as 'max-connections', in the +way that one connection is reserved for a SUPER user. -=-=(Knielsen - Mon, 09 Mar 2009, 19:02)=-=- Version updated. --- /tmp/wklog.6.old.10740 2009-03-09 19:02:38.000000000 +0200 +++ /tmp/wklog.6.new.10740 2009-03-09 19:02:38.000000000 +0200 @@ -1 +1 @@ -WorkLog-3.4 +Server-9.x -=-=(Knielsen - Mon, 09 Mar 2009, 19:02)=-=- Title modified. --- /tmp/wklog.6.old.10740 2009-03-09 19:02:38.000000000 +0200 +++ /tmp/wklog.6.new.10740 2009-03-09 19:02:38.000000000 +0200 @@ -1 +1 @@ -Backporting pool of threads tro MariaDB +Backporting pool of threads to MariaDB DESCRIPTION: Back porting pool of threads to MariaDB We will use code for Maria 6.0, with the following extensions: Add option: --test-ignore-wrong-options to ignore errors in enum values for testing pool-of-threads. (Better than having --pool-of-threads command line option just for testing) Add option --extra-port to allow connections with old one-thread-per-connection method. This is needed to allow root to login and kill threads if something goes wrong. Add option --extra-max-connections to regulate how many connections can be made to 'extra-port'. This should work in a similar way as 'max-connections', in the way that one connection is reserved for a SUPER user. LOW-LEVEL DESIGN: To be able to work with both one-thread-per-connection and pool-of-threads at the same time, I added a new global scheduler variable 'extra_thread_scheduler' that is always using the one-thread-per-connection method. To the THD structure was added a pointer to the 'scheduler' variable that should be used for this connection. To do easy handing of two connect counter and two max_connection variables, I added pointer to these pointer in the scheduler variable.: Other changes was: - If extra-port was <> 0, start listing to this port too - At connect time, set THD->scheduler to point to the given scheduler (based on the port that was used to connect) - Change some calls that was done trough functions pointer in the scheduler to instead use thd->scheduler-> - Change max_connections to *thd->scheduler->max_connections - Change connection_count to *thd->scheduler->connection_count ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] New (by Monty): Add Sphinx storage engine to MariaDB (42)
by worklog-noreply＠askmonty.org 10 Aug '09

10 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Add Sphinx storage engine to MariaDB CREATION DATE..: Mon, 10 Aug 2009, 23:57 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: CATEGORY.......: Maria-BackLog TASK ID........: 42 (http://askmonty.org/worklog/?tid=42) VERSION........: Connector/.NET-5.1 STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 16 (hours remain) ORIG. ESTIMATE.: 16 PROGRESS NOTES: DESCRIPTION: Add the Sphinx storage engine to the MariaDB tree ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

Re: [Maria-developers] keycache flush problem
by Michael Widenius 10 Aug '09

10 Aug '09

Hi! For those that don't know what this is about: - This is a fix for the case where you do a DROP TABLE of a MyISAM table with key delayed MyISAM writes all the changed key pages for the file to disk before closing and then deleting the table. This patch is a first attempt to fix that we don't write the key pages in case of drop. >>>>> "Oleksandr" == Oleksandr Byelkin <Oleksandr> writes: Oleksandr> Hi! Oleksandr> I made different patch from you suggested maybe I am wrong but IMHO it Oleksandr> is better because: Oleksandr> 1) use already existing keycache calls Oleksandr> 2) do not require additional finding table MI_INFO by name and locking Oleksandr> around it Oleksandr> The main idea was that if we are going to drop table it can be passed Oleksandr> via existing table descriptors to the place where we call flush and it Oleksandr> does not matter if other threads trying to do something with the table Oleksandr> we will drop it in any case. Oleksandr> === modified file 'sql/handler.h' Oleksandr> --- sql/handler.h 2009-06-29 21:03:30 +0000 Oleksandr> +++ sql/handler.h 2009-08-09 19:52:01 +0000 Oleksandr> @@ -1342,6 +1342,7 @@ public: Oleksandr> virtual void column_bitmaps_signal(); Oleksandr> uint get_index(void) const { return active_index; } Oleksandr> virtual int close(void)=0; Oleksandr> + virtual void prepare_for_delete() {} Oleksandr> /** Oleksandr> @retval 0 Bulk update used by handler Why using prepare_for_delete(), instead of adding a more general call ? Oleksandr> === modified file 'sql/lock.cc' Oleksandr> --- sql/lock.cc 2009-04-25 10:05:32 +0000 Oleksandr> +++ sql/lock.cc 2009-08-09 22:08:41 +0000 Oleksandr> @@ -1049,10 +1049,12 @@ int lock_table_name(THD *thd, TABLE_LIST Oleksandr> DBUG_RETURN(-1); Oleksandr> table_list->table=table; Oleksandr> + table->s->deleting= table_list->deleting; Oleksandr> /* Return 1 if table is in use */ Oleksandr> DBUG_RETURN(test(remove_table_from_cache(thd, db, table_list->table_name, Oleksandr> - check_in_use ? RTFC_NO_FLAG : RTFC_WAIT_OTHER_THREAD_FLAG))); Oleksandr> + check_in_use ? RTFC_NO_FLAG : RTFC_WAIT_OTHER_THREAD_FLAG, Oleksandr> + table_list->deleting))); Oleksandr> } Oleksandr> === modified file 'sql/mysql_priv.h' Oleksandr> --- sql/mysql_priv.h 2009-04-25 10:05:32 +0000 Oleksandr> +++ sql/mysql_priv.h 2009-08-09 21:51:48 +0000 Oleksandr> @@ -1609,7 +1609,7 @@ uint prep_alter_part_table(THD *thd, TAB Oleksandr> #define RTFC_WAIT_OTHER_THREAD_FLAG 0x0002 Oleksandr> #define RTFC_CHECK_KILLED_FLAG 0x0004 Oleksandr> bool remove_table_from_cache(THD *thd, const char *db, const char *table, Oleksandr> - uint flags); Oleksandr> + uint flags, my_bool deleting); Oleksandr> #define NORMAL_PART_NAME 0 Oleksandr> #define TEMP_PART_NAME 1 Oleksandr> === modified file 'sql/sql_base.cc' Oleksandr> --- sql/sql_base.cc 2009-05-19 09:28:05 +0000 Oleksandr> +++ sql/sql_base.cc 2009-08-09 21:54:54 +0000 Oleksandr> @@ -927,7 +927,7 @@ bool close_cached_tables(THD *thd, TABLE Oleksandr> for (TABLE_LIST *table= tables; table; table= table->next_local) Oleksandr> { Oleksandr> if (remove_table_from_cache(thd, table->db, table->table_name, Oleksandr> - RTFC_OWNED_BY_THD_FLAG)) Oleksandr> + RTFC_OWNED_BY_THD_FLAG, table->deleting)) Oleksandr> found=1; Oleksandr> } Oleksandr> if (!found) Oleksandr> @@ -8395,7 +8395,7 @@ void flush_tables() Oleksandr> */ Oleksandr> bool remove_table_from_cache(THD *thd, const char *db, const char *table_name, Oleksandr> - uint flags) Oleksandr> + uint flags, my_bool deleting) Oleksandr> { Oleksandr> char key[MAX_DBKEY_LENGTH]; Oleksandr> uint key_length; Oleksandr> @@ -8482,7 +8482,10 @@ bool remove_table_from_cache(THD *thd, c Oleksandr> } Oleksandr> } Oleksandr> while (unused_tables && !unused_tables->s->version) Oleksandr> + { Oleksandr> + unused_tables->s->deleting= deleting; Oleksandr> VOID(hash_delete(&open_cache,(uchar*) unused_tables)); Oleksandr> + } Oleksandr> DBUG_PRINT("info", ("Removing table from table_def_cache")); Oleksandr> /* Remove table from table definition cache if it's not in use */ Oleksandr> @@ -8676,7 +8679,8 @@ int abort_and_upgrade_lock(ALTER_PARTITI Oleksandr> /* If MERGE child, forward lock handling to parent. */ Oleksandr> mysql_lock_abort(lpt->thd, lpt->table->parent ? lpt->table->parent : Oleksandr> lpt->table, TRUE); Oleksandr> - VOID(remove_table_from_cache(lpt->thd, lpt->db, lpt->table_name, flags)); Oleksandr> + VOID(remove_table_from_cache(lpt->thd, lpt->db, lpt->table_name, flags, Oleksandr> + FALSE)); Oleksandr> VOID(pthread_mutex_unlock(&LOCK_open)); Oleksandr> DBUG_RETURN(0); Oleksandr> } Oleksandr> @@ -8701,7 +8705,7 @@ void close_open_tables_and_downgrade(ALT Oleksandr> { Oleksandr> VOID(pthread_mutex_lock(&LOCK_open)); Oleksandr> remove_table_from_cache(lpt->thd, lpt->db, lpt->table_name, Oleksandr> - RTFC_WAIT_OTHER_THREAD_FLAG); Oleksandr> + RTFC_WAIT_OTHER_THREAD_FLAG, FALSE); Oleksandr> VOID(pthread_mutex_unlock(&LOCK_open)); Oleksandr> /* If MERGE child, forward lock handling to parent. */ Oleksandr> mysql_lock_downgrade_write(lpt->thd, lpt->table->parent ? lpt->table->parent : Oleksandr> === modified file 'sql/sql_table.cc' Oleksandr> --- sql/sql_table.cc 2009-06-18 12:39:21 +0000 Oleksandr> +++ sql/sql_table.cc 2009-08-09 21:48:04 +0000 Oleksandr> @@ -1599,6 +1599,8 @@ int mysql_rm_table_part2(THD *thd, TABLE Oleksandr> if ((share= get_cached_table_share(table->db, table->table_name))) Oleksandr> table->db_type= share->db_type(); Oleksandr> + table->deleting= TRUE; Oleksandr> + Oleksandr> /* Disable drop of enabled log tables */ Oleksandr> if (share && (share->table_category == TABLE_CATEGORY_PERFORMANCE) && Oleksandr> check_if_log_table(table->db_length, table->db, Oleksandr> @@ -1676,7 +1678,7 @@ int mysql_rm_table_part2(THD *thd, TABLE Oleksandr> abort_locked_tables(thd, db, table->table_name); Oleksandr> remove_table_from_cache(thd, db, table->table_name, Oleksandr> RTFC_WAIT_OTHER_THREAD_FLAG | Oleksandr> - RTFC_CHECK_KILLED_FLAG); Oleksandr> + RTFC_CHECK_KILLED_FLAG, TRUE); Oleksandr> /* Oleksandr> If the table was used in lock tables, remember it so that Oleksandr> unlock_table_names can free it Oleksandr> @@ -3862,7 +3864,7 @@ void wait_while_table_is_used(THD *thd,T Oleksandr> /* Wait until all there are no other threads that has this table open */ Oleksandr> remove_table_from_cache(thd, table->s->db.str, Oleksandr> table->s->table_name.str, Oleksandr> - RTFC_WAIT_OTHER_THREAD_FLAG); Oleksandr> + RTFC_WAIT_OTHER_THREAD_FLAG, FALSE); Oleksandr> /* extra() call must come only after all instances above are closed */ Oleksandr> VOID(table->file->extra(function)); Oleksandr> DBUG_VOID_RETURN; Oleksandr> @@ -4366,7 +4368,7 @@ static bool mysql_admin_table(THD* thd, Oleksandr> remove_table_from_cache(thd, table->table->s->db.str, Oleksandr> table->table->s->table_name.str, Oleksandr> RTFC_WAIT_OTHER_THREAD_FLAG | Oleksandr> - RTFC_CHECK_KILLED_FLAG); Oleksandr> + RTFC_CHECK_KILLED_FLAG, FALSE); Oleksandr> thd->exit_cond(old_message); Oleksandr> DBUG_EXECUTE_IF("wait_in_mysql_admin_table", wait_for_kill_signal(thd);); Oleksandr> if (thd->killed) Oleksandr> @@ -4624,7 +4626,8 @@ send_result_message: Oleksandr> { Oleksandr> pthread_mutex_lock(&LOCK_open); Oleksandr> remove_table_from_cache(thd, table->table->s->db.str, Oleksandr> - table->table->s->table_name.str, RTFC_NO_FLAG); Oleksandr> + table->table->s->table_name.str, Oleksandr> + RTFC_NO_FLAG, FALSE); Oleksandr> pthread_mutex_unlock(&LOCK_open); Oleksandr> } Oleksandr> /* May be something modified consequently we have to invalidate cache */ Oleksandr> === modified file 'sql/table.cc' Oleksandr> --- sql/table.cc 2009-06-29 21:03:30 +0000 Oleksandr> +++ sql/table.cc 2009-08-09 20:46:07 +0000 Oleksandr> @@ -1960,7 +1960,12 @@ int closefrm(register TABLE *table, bool Oleksandr> DBUG_PRINT("enter", ("table: 0x%lx", (long) table)); Oleksandr> if (table->db_stat) Oleksandr> - error=table->file->close(); Oleksandr> + { Oleksandr> + if (table->s->deleting) Oleksandr> + table->file->prepare_for_delete(); Oleksandr> + error= table->file->close(); Oleksandr> + } Oleksandr> + As we have a handler here, we not instead do ? table->file->extra(HA_EXTRA_PREPARE_FOR_DROP); There is no reason to add an extra prepare_for_delete() here. Oleksandr> --- storage/myisam/ha_myisam.cc 2009-06-29 21:03:30 +0000 Oleksandr> +++ storage/myisam/ha_myisam.cc 2009-08-09 20:42:15 +0000 Oleksandr> @@ -26,7 +26,9 @@ Oleksandr> #include <myisampack.h> Oleksandr> #include "ha_myisam.h" Oleksandr> #include <stdarg.h> Oleksandr> +C_MODE_START Oleksandr> #include "myisamdef.h" Oleksandr> +C_MODE_END Oleksandr> #include "rt_index.h" With my suggested change, no reason to do any changes in ha_myisam.cc or ha_myisam.h <cut> Oleksandr> +++ storage/myisam/mi_close.c 2009-08-09 22:01:32 +0000 Oleksandr> @@ -65,8 +65,9 @@ int mi_close(register MI_INFO *info) Oleksandr> { Oleksandr> if (share->kfile >= 0 && Oleksandr> flush_key_blocks(share->key_cache, share->kfile, Oleksandr> - share->temporary ? FLUSH_IGNORE_CHANGED : Oleksandr> - FLUSH_RELEASE)) Oleksandr> + (share->temporary || share->deleting) ? Oleksandr> + FLUSH_IGNORE_CHANGED : Oleksandr> + FLUSH_RELEASE)) Oleksandr> error=my_errno; Oleksandr> if (share->kfile >= 0) Oleksandr> { No reason for the above change. 1) In my suggestion, no reason to do this. 2) If we implement it your way, we could reuse 'share->temporary' for this cse. Oleksandr> === modified file 'storage/myisam/mi_locking.c' Oleksandr> --- storage/myisam/mi_locking.c 2009-04-01 09:34:52 +0000 Oleksandr> +++ storage/myisam/mi_locking.c 2009-08-09 20:42:00 +0000 Oleksandr> @@ -68,7 +68,10 @@ int mi_lock_database(MI_INFO *info, int Oleksandr> --share->tot_locks; Oleksandr> if (info->lock_type == F_WRLCK && !share->w_locks && Oleksandr> !share->delay_key_write && flush_key_blocks(share->key_cache, Oleksandr> - share->kfile,FLUSH_KEEP)) Oleksandr> + share->kfile, Oleksandr> + (share->deleting ? Oleksandr> + FLUSH_IGNORE_CHANGED : Oleksandr> + FLUSH_KEEP))) No reason to do the above. Reasons: - In case of delay_key_write, they above code will not be executed. - If delay_key_write is not set, things was flushed at previous statement. Did I miss some case? Oleksandr> --- storage/myisam/myisamdef.h 2009-04-25 09:04:38 +0000 Oleksandr> +++ storage/myisam/myisamdef.h 2009-08-09 20:41:25 +0000 Oleksandr> @@ -218,6 +218,7 @@ typedef struct st_mi_isam_share Oleksandr> my_bool changed, /* If changed since lock */ Oleksandr> global_changed, /* If changed since open */ Oleksandr> not_flushed, temporary, delay_key_write, concurrent_insert; Oleksandr> + my_bool deleting; /* we are going to delete this table */ Not needed. --------------- Other fixes: Please fix mi_extra.c as we dicussed (move the #ifdef so that things are flushed) do also the folloing fix to ma_extra.c: if (share->kfile.file >= 0) _ma_decrement_open_count(info); -> if (share->kfile.file >= 0 && do_flush) _ma_decrement_open_count(info); The idea is that we should not decrement the open_count in case of drop. This will ensure that if we die between flushing the key cache and close, the index will be rechecked. ------------- Regards, Monty

1 0

[Maria-developers] Progress (by Guest): Replication tasks (39)
by worklog-noreply＠askmonty.org 10 Aug '09

10 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Replication tasks CREATION DATE..: Sun, 09 Aug 2009, 12:24 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: CATEGORY.......: Client-RawIdeaBin TASK ID........: 39 (http://askmonty.org/worklog/?tid=39) VERSION........: Server-9.x STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 17 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: -=-=(Guest - Mon, 10 Aug 2009, 16:32)=-=- Adding 1 hour for Monty's initial work on starting the architecture review. Worked 1 hour and estimate 0 hours remain (original estimate increased by 1 hour). -=-=(Psergey - Mon, 10 Aug 2009, 15:59)=-=- Re-searched and added subtasks. Worked 16 hours and estimate 0 hours remain (original estimate increased by 16 hours). -=-=(Psergey - Mon, 10 Aug 2009, 15:31)=-=- Dependency created: 39 now depends on 41 -=-=(Guest - Mon, 10 Aug 2009, 14:52)=-=- Dependency created: 39 now depends on 40 -=-=(Psergey - Sun, 09 Aug 2009, 12:27)=-=- Dependency created: 39 now depends on 36 -=-=(Psergey - Sun, 09 Aug 2009, 12:24)=-=- Dependency created: 39 now depends on 38 -=-=(Psergey - Sun, 09 Aug 2009, 12:24)=-=- Dependency created: 39 now depends on 37 DESCRIPTION: A combine task for all replication tasks. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] Progress (by Guest): Replication tasks (39)
by worklog-noreply＠askmonty.org 10 Aug '09

10 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Replication tasks CREATION DATE..: Sun, 09 Aug 2009, 12:24 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: CATEGORY.......: Client-RawIdeaBin TASK ID........: 39 (http://askmonty.org/worklog/?tid=39) VERSION........: Server-9.x STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 17 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: -=-=(Guest - Mon, 10 Aug 2009, 16:32)=-=- Adding 1 hour for Monty's initial work on starting the architecture review. Worked 1 hour and estimate 0 hours remain (original estimate increased by 1 hour). -=-=(Psergey - Mon, 10 Aug 2009, 15:59)=-=- Re-searched and added subtasks. Worked 16 hours and estimate 0 hours remain (original estimate increased by 16 hours). -=-=(Psergey - Mon, 10 Aug 2009, 15:31)=-=- Dependency created: 39 now depends on 41 -=-=(Guest - Mon, 10 Aug 2009, 14:52)=-=- Dependency created: 39 now depends on 40 -=-=(Psergey - Sun, 09 Aug 2009, 12:27)=-=- Dependency created: 39 now depends on 36 -=-=(Psergey - Sun, 09 Aug 2009, 12:24)=-=- Dependency created: 39 now depends on 38 -=-=(Psergey - Sun, 09 Aug 2009, 12:24)=-=- Dependency created: 39 now depends on 37 DESCRIPTION: A combine task for all replication tasks. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] Progress (by Psergey): Replication tasks (39)
by worklog-noreply＠askmonty.org 10 Aug '09

10 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Replication tasks CREATION DATE..: Sun, 09 Aug 2009, 12:24 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: CATEGORY.......: Client-RawIdeaBin TASK ID........: 39 (http://askmonty.org/worklog/?tid=39) VERSION........: Server-9.x STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 16 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: -=-=(Psergey - Mon, 10 Aug 2009, 15:59)=-=- Re-searched and added subtasks. Worked 16 hours and estimate 0 hours remain (original estimate increased by 16 hours). -=-=(Psergey - Mon, 10 Aug 2009, 15:31)=-=- Dependency created: 39 now depends on 41 -=-=(Guest - Mon, 10 Aug 2009, 14:52)=-=- Dependency created: 39 now depends on 40 -=-=(Psergey - Sun, 09 Aug 2009, 12:27)=-=- Dependency created: 39 now depends on 36 -=-=(Psergey - Sun, 09 Aug 2009, 12:24)=-=- Dependency created: 39 now depends on 38 -=-=(Psergey - Sun, 09 Aug 2009, 12:24)=-=- Dependency created: 39 now depends on 37 DESCRIPTION: A combine task for all replication tasks. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] Progress (by Psergey): Replication tasks (39)
by worklog-noreply＠askmonty.org 10 Aug '09

10 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Replication tasks CREATION DATE..: Sun, 09 Aug 2009, 12:24 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: CATEGORY.......: Client-RawIdeaBin TASK ID........: 39 (http://askmonty.org/worklog/?tid=39) VERSION........: Server-9.x STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 16 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: -=-=(Psergey - Mon, 10 Aug 2009, 15:59)=-=- Re-searched and added subtasks. Worked 16 hours and estimate 0 hours remain (original estimate increased by 16 hours). -=-=(Psergey - Mon, 10 Aug 2009, 15:31)=-=- Dependency created: 39 now depends on 41 -=-=(Guest - Mon, 10 Aug 2009, 14:52)=-=- Dependency created: 39 now depends on 40 -=-=(Psergey - Sun, 09 Aug 2009, 12:27)=-=- Dependency created: 39 now depends on 36 -=-=(Psergey - Sun, 09 Aug 2009, 12:24)=-=- Dependency created: 39 now depends on 38 -=-=(Psergey - Sun, 09 Aug 2009, 12:24)=-=- Dependency created: 39 now depends on 37 DESCRIPTION: A combine task for all replication tasks. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] Updated (by Psergey): Add a mysqlbinlog option to filter certain kinds of statements (41)
by worklog-noreply＠askmonty.org 10 Aug '09

10 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Add a mysqlbinlog option to filter certain kinds of statements CREATION DATE..: Mon, 10 Aug 2009, 15:30 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: CATEGORY.......: Client-BackLog TASK ID........: 41 (http://askmonty.org/worklog/?tid=41) VERSION........: Benchmarks-3.0 STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: -=-=(Psergey - Mon, 10 Aug 2009, 15:47)=-=- High-Level Specification modified. --- /tmp/wklog.41.old.13282 2009-08-10 15:47:13.000000000 +0300 +++ /tmp/wklog.41.new.13282 2009-08-10 15:47:13.000000000 +0300 @@ -2,3 +2,10 @@ - If we decide to parse the statement, SQL-verb filtering will be trivial - If we decide not to parse the statement, we still can reliably distinguish the statement by matching the first characters against a set of patterns. + +If we chose the second, we'll have to perform certain normalization before +matching the patterns: + - Remove all comments from the command + - Remove all pre-space + - Compare the string case-insensitively + - etc -=-=(Psergey - Mon, 10 Aug 2009, 15:35)=-=- High-Level Specification modified. --- /tmp/wklog.41.old.12689 2009-08-10 15:35:04.000000000 +0300 +++ /tmp/wklog.41.new.12689 2009-08-10 15:35:04.000000000 +0300 @@ -1 +1,4 @@ - +The implementation will depend on design choices made in WL#40: +- If we decide to parse the statement, SQL-verb filtering will be trivial +- If we decide not to parse the statement, we still can reliably distinguish the +statement by matching the first characters against a set of patterns. -=-=(Psergey - Mon, 10 Aug 2009, 15:31)=-=- Dependency created: 39 now depends on 41 DESCRIPTION: Add a mysqlbinlog option to filter certain kinds of statements, i.e. (syntax subject to discussion): mysqlbinlog --exclude='alter table,drop table,alter database,...' HIGH-LEVEL SPECIFICATION: The implementation will depend on design choices made in WL#40: - If we decide to parse the statement, SQL-verb filtering will be trivial - If we decide not to parse the statement, we still can reliably distinguish the statement by matching the first characters against a set of patterns. If we chose the second, we'll have to perform certain normalization before matching the patterns: - Remove all comments from the command - Remove all pre-space - Compare the string case-insensitively - etc ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] Updated (by Psergey): Add a mysqlbinlog option to filter certain kinds of statements (41)
by worklog-noreply＠askmonty.org 10 Aug '09

10 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Add a mysqlbinlog option to filter certain kinds of statements CREATION DATE..: Mon, 10 Aug 2009, 15:30 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: CATEGORY.......: Client-BackLog TASK ID........: 41 (http://askmonty.org/worklog/?tid=41) VERSION........: Benchmarks-3.0 STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: -=-=(Psergey - Mon, 10 Aug 2009, 15:47)=-=- High-Level Specification modified. --- /tmp/wklog.41.old.13282 2009-08-10 15:47:13.000000000 +0300 +++ /tmp/wklog.41.new.13282 2009-08-10 15:47:13.000000000 +0300 @@ -2,3 +2,10 @@ - If we decide to parse the statement, SQL-verb filtering will be trivial - If we decide not to parse the statement, we still can reliably distinguish the statement by matching the first characters against a set of patterns. + +If we chose the second, we'll have to perform certain normalization before +matching the patterns: + - Remove all comments from the command + - Remove all pre-space + - Compare the string case-insensitively + - etc -=-=(Psergey - Mon, 10 Aug 2009, 15:35)=-=- High-Level Specification modified. --- /tmp/wklog.41.old.12689 2009-08-10 15:35:04.000000000 +0300 +++ /tmp/wklog.41.new.12689 2009-08-10 15:35:04.000000000 +0300 @@ -1 +1,4 @@ - +The implementation will depend on design choices made in WL#40: +- If we decide to parse the statement, SQL-verb filtering will be trivial +- If we decide not to parse the statement, we still can reliably distinguish the +statement by matching the first characters against a set of patterns. -=-=(Psergey - Mon, 10 Aug 2009, 15:31)=-=- Dependency created: 39 now depends on 41 DESCRIPTION: Add a mysqlbinlog option to filter certain kinds of statements, i.e. (syntax subject to discussion): mysqlbinlog --exclude='alter table,drop table,alter database,...' HIGH-LEVEL SPECIFICATION: The implementation will depend on design choices made in WL#40: - If we decide to parse the statement, SQL-verb filtering will be trivial - If we decide not to parse the statement, we still can reliably distinguish the statement by matching the first characters against a set of patterns. If we chose the second, we'll have to perform certain normalization before matching the patterns: - Remove all comments from the command - Remove all pre-space - Compare the string case-insensitively - etc ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] Updated (by Psergey): Add a mysqlbinlog option to change the used database (36)
by worklog-noreply＠askmonty.org 10 Aug '09

10 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Add a mysqlbinlog option to change the used database CREATION DATE..: Fri, 07 Aug 2009, 14:57 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: CATEGORY.......: Server-RawIdeaBin TASK ID........: 36 (http://askmonty.org/worklog/?tid=36) VERSION........: Server-9.x STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: -=-=(Psergey - Mon, 10 Aug 2009, 15:41)=-=- High-Level Specification modified. --- /tmp/wklog.36.old.13035 2009-08-10 15:41:51.000000000 +0300 +++ /tmp/wklog.36.new.13035 2009-08-10 15:41:51.000000000 +0300 @@ -1,5 +1,7 @@ Context ------- +(See http://askmonty.org/wiki/index.php/Scratch/ReplicationOptions for global +overview) At the moment, the server has a replication slave option --replicate-rewrite-db="from->to" -=-=(Guest - Mon, 10 Aug 2009, 11:12)=-=- High-Level Specification modified. --- /tmp/wklog.36.old.6580 2009-08-10 11:12:36.000000000 +0300 +++ /tmp/wklog.36.new.6580 2009-08-10 11:12:36.000000000 +0300 @@ -1,4 +1,3 @@ - Context ------- At the moment, the server has a replication slave option @@ -67,6 +66,6 @@ It will be possible to do the rewrites either on the slave ( --replicate-rewrite-db will work for all kinds of statements), or in -mysqlbinlog (adding a comment is easy and doesn't require use to parse the -statement). +mysqlbinlog (adding a comment is easy and doesn't require mysqlbinlog to +parse the statement). -=-=(Psergey - Sun, 09 Aug 2009, 23:53)=-=- High-Level Specification modified. --- /tmp/wklog.36.old.13425 2009-08-09 23:53:54.000000000 +0300 +++ /tmp/wklog.36.new.13425 2009-08-09 23:53:54.000000000 +0300 @@ -1 +1,72 @@ +Context +------- +At the moment, the server has a replication slave option + + --replicate-rewrite-db="from->to" + +the option affects +- Table_map_log_event (all RBR events) +- Load_log_event (LOAD DATA) +- Query_log_event (SBR-based updates, with the usual assumption that the + statement refers to tables in current database, so that changing the current + database will make the statement to work on a table in a different database). + +What we could do +---------------- + +Option1: make mysqlbinlog accept --replicate-rewrite-db option +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Make mysqlbinlog accept --replicate-rewrite-db options and process them to the +same extent as replication slave would process --replicate-rewrite-db option. + + +Option2: Add database-agnostic RBR events and --strip-db option +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Right now RBR events require a databasename. It is not possible to have RBR +event stream that won't mention which database the events are for. When I +tried to use debugger and specify empty database name, attempt to apply the +binlog resulted in this error: + +090809 17:38:44 [ERROR] Slave SQL: Error 'Table '.tablename' doesn't exist' on +opening tables, + +We could do as follows: +- Make the server interpret empty database name in RBR event (i.e. in a + Table_map_log_event) as "use current database". Binlog slave thread + probably should not allow such events as it doesn't have a natural current + database. +- Add a mysqlbinlog --strip-db option that would + = not produce any "USE dbname" statements + = change databasename for all RBR events to be empty + +That way, mysqlbinlog output will be database-agnostic and apply to the +current database. +(this will have the usual limitations that we assume that all statements in +the binlog refer to the current database). + +Option3: Enhance database rewrite +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +If there is a need to support database change for statements that use +dbname.tablename notation and are replicated as statements (i.e. are DDL +statements and/or DML statements that are binlogged as statements), +then that could be supported as follows: + +- Make the server's parser recognize special form of comments + + /* !database-alias(oldname,newname) */ + + and save the mapping somewhere + +- Put the hooks in table open and name resolution code to use the saved + mapping. + + +Once we've done the above, it will be easy to perform a complete, +no-compromise or restrictions database name change in binary log. + +It will be possible to do the rewrites either on the slave ( +--replicate-rewrite-db will work for all kinds of statements), or in +mysqlbinlog (adding a comment is easy and doesn't require use to parse the +statement). + -=-=(Psergey - Sun, 09 Aug 2009, 12:27)=-=- Dependency created: 39 now depends on 36 -=-=(Psergey - Fri, 07 Aug 2009, 14:57)=-=- Title modified. --- /tmp/wklog.36.old.14687 2009-08-07 14:57:49.000000000 +0300 +++ /tmp/wklog.36.new.14687 2009-08-07 14:57:49.000000000 +0300 @@ -1 +1 @@ -Add a mysqlbinlog option to change the database +Add a mysqlbinlog option to change the used database DESCRIPTION: Sometimes there is a need to take a binary log and apply it to a database with a different name than the original name of the database on binlog producer. If one is using statement-based replication, he can achieve this by grepping out "USE dbname" statements out of the output of mysqlbinlog(*). With row-based replication this is no longer possible, as database name is encoded within the the BINLOG '....' statement. This task is about adding an option to mysqlbinlog that would allow to change the names of used databases in both RBR and SBR events. (*) this implies that all statements refer to tables in the current database, doesn't catch updates made inside stored functions and so forth, but still works for a practially-important subset of cases. HIGH-LEVEL SPECIFICATION: Context ------- (See http://askmonty.org/wiki/index.php/Scratch/ReplicationOptions for global overview) At the moment, the server has a replication slave option --replicate-rewrite-db="from->to" the option affects - Table_map_log_event (all RBR events) - Load_log_event (LOAD DATA) - Query_log_event (SBR-based updates, with the usual assumption that the statement refers to tables in current database, so that changing the current database will make the statement to work on a table in a different database). What we could do ---------------- Option1: make mysqlbinlog accept --replicate-rewrite-db option ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Make mysqlbinlog accept --replicate-rewrite-db options and process them to the same extent as replication slave would process --replicate-rewrite-db option. Option2: Add database-agnostic RBR events and --strip-db option ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Right now RBR events require a databasename. It is not possible to have RBR event stream that won't mention which database the events are for. When I tried to use debugger and specify empty database name, attempt to apply the binlog resulted in this error: 090809 17:38:44 [ERROR] Slave SQL: Error 'Table '.tablename' doesn't exist' on opening tables, We could do as follows: - Make the server interpret empty database name in RBR event (i.e. in a Table_map_log_event) as "use current database". Binlog slave thread probably should not allow such events as it doesn't have a natural current database. - Add a mysqlbinlog --strip-db option that would = not produce any "USE dbname" statements = change databasename for all RBR events to be empty That way, mysqlbinlog output will be database-agnostic and apply to the current database. (this will have the usual limitations that we assume that all statements in the binlog refer to the current database). Option3: Enhance database rewrite ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If there is a need to support database change for statements that use dbname.tablename notation and are replicated as statements (i.e. are DDL statements and/or DML statements that are binlogged as statements), then that could be supported as follows: - Make the server's parser recognize special form of comments /* !database-alias(oldname,newname) */ and save the mapping somewhere - Put the hooks in table open and name resolution code to use the saved mapping. Once we've done the above, it will be easy to perform a complete, no-compromise or restrictions database name change in binary log. It will be possible to do the rewrites either on the slave ( --replicate-rewrite-db will work for all kinds of statements), or in mysqlbinlog (adding a comment is easy and doesn't require mysqlbinlog to parse the statement). ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] Updated (by Psergey): Add a mysqlbinlog option to change the used database (36)
by worklog-noreply＠askmonty.org 10 Aug '09

10 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Add a mysqlbinlog option to change the used database CREATION DATE..: Fri, 07 Aug 2009, 14:57 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: CATEGORY.......: Server-RawIdeaBin TASK ID........: 36 (http://askmonty.org/worklog/?tid=36) VERSION........: Server-9.x STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: -=-=(Psergey - Mon, 10 Aug 2009, 15:41)=-=- High-Level Specification modified. --- /tmp/wklog.36.old.13035 2009-08-10 15:41:51.000000000 +0300 +++ /tmp/wklog.36.new.13035 2009-08-10 15:41:51.000000000 +0300 @@ -1,5 +1,7 @@ Context ------- +(See http://askmonty.org/wiki/index.php/Scratch/ReplicationOptions for global +overview) At the moment, the server has a replication slave option --replicate-rewrite-db="from->to" -=-=(Guest - Mon, 10 Aug 2009, 11:12)=-=- High-Level Specification modified. --- /tmp/wklog.36.old.6580 2009-08-10 11:12:36.000000000 +0300 +++ /tmp/wklog.36.new.6580 2009-08-10 11:12:36.000000000 +0300 @@ -1,4 +1,3 @@ - Context ------- At the moment, the server has a replication slave option @@ -67,6 +66,6 @@ It will be possible to do the rewrites either on the slave ( --replicate-rewrite-db will work for all kinds of statements), or in -mysqlbinlog (adding a comment is easy and doesn't require use to parse the -statement). +mysqlbinlog (adding a comment is easy and doesn't require mysqlbinlog to +parse the statement). -=-=(Psergey - Sun, 09 Aug 2009, 23:53)=-=- High-Level Specification modified. --- /tmp/wklog.36.old.13425 2009-08-09 23:53:54.000000000 +0300 +++ /tmp/wklog.36.new.13425 2009-08-09 23:53:54.000000000 +0300 @@ -1 +1,72 @@ +Context +------- +At the moment, the server has a replication slave option + + --replicate-rewrite-db="from->to" + +the option affects +- Table_map_log_event (all RBR events) +- Load_log_event (LOAD DATA) +- Query_log_event (SBR-based updates, with the usual assumption that the + statement refers to tables in current database, so that changing the current + database will make the statement to work on a table in a different database). + +What we could do +---------------- + +Option1: make mysqlbinlog accept --replicate-rewrite-db option +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Make mysqlbinlog accept --replicate-rewrite-db options and process them to the +same extent as replication slave would process --replicate-rewrite-db option. + + +Option2: Add database-agnostic RBR events and --strip-db option +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Right now RBR events require a databasename. It is not possible to have RBR +event stream that won't mention which database the events are for. When I +tried to use debugger and specify empty database name, attempt to apply the +binlog resulted in this error: + +090809 17:38:44 [ERROR] Slave SQL: Error 'Table '.tablename' doesn't exist' on +opening tables, + +We could do as follows: +- Make the server interpret empty database name in RBR event (i.e. in a + Table_map_log_event) as "use current database". Binlog slave thread + probably should not allow such events as it doesn't have a natural current + database. +- Add a mysqlbinlog --strip-db option that would + = not produce any "USE dbname" statements + = change databasename for all RBR events to be empty + +That way, mysqlbinlog output will be database-agnostic and apply to the +current database. +(this will have the usual limitations that we assume that all statements in +the binlog refer to the current database). + +Option3: Enhance database rewrite +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +If there is a need to support database change for statements that use +dbname.tablename notation and are replicated as statements (i.e. are DDL +statements and/or DML statements that are binlogged as statements), +then that could be supported as follows: + +- Make the server's parser recognize special form of comments + + /* !database-alias(oldname,newname) */ + + and save the mapping somewhere + +- Put the hooks in table open and name resolution code to use the saved + mapping. + + +Once we've done the above, it will be easy to perform a complete, +no-compromise or restrictions database name change in binary log. + +It will be possible to do the rewrites either on the slave ( +--replicate-rewrite-db will work for all kinds of statements), or in +mysqlbinlog (adding a comment is easy and doesn't require use to parse the +statement). + -=-=(Psergey - Sun, 09 Aug 2009, 12:27)=-=- Dependency created: 39 now depends on 36 -=-=(Psergey - Fri, 07 Aug 2009, 14:57)=-=- Title modified. --- /tmp/wklog.36.old.14687 2009-08-07 14:57:49.000000000 +0300 +++ /tmp/wklog.36.new.14687 2009-08-07 14:57:49.000000000 +0300 @@ -1 +1 @@ -Add a mysqlbinlog option to change the database +Add a mysqlbinlog option to change the used database DESCRIPTION: Sometimes there is a need to take a binary log and apply it to a database with a different name than the original name of the database on binlog producer. If one is using statement-based replication, he can achieve this by grepping out "USE dbname" statements out of the output of mysqlbinlog(*). With row-based replication this is no longer possible, as database name is encoded within the the BINLOG '....' statement. This task is about adding an option to mysqlbinlog that would allow to change the names of used databases in both RBR and SBR events. (*) this implies that all statements refer to tables in the current database, doesn't catch updates made inside stored functions and so forth, but still works for a practially-important subset of cases. HIGH-LEVEL SPECIFICATION: Context ------- (See http://askmonty.org/wiki/index.php/Scratch/ReplicationOptions for global overview) At the moment, the server has a replication slave option --replicate-rewrite-db="from->to" the option affects - Table_map_log_event (all RBR events) - Load_log_event (LOAD DATA) - Query_log_event (SBR-based updates, with the usual assumption that the statement refers to tables in current database, so that changing the current database will make the statement to work on a table in a different database). What we could do ---------------- Option1: make mysqlbinlog accept --replicate-rewrite-db option ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Make mysqlbinlog accept --replicate-rewrite-db options and process them to the same extent as replication slave would process --replicate-rewrite-db option. Option2: Add database-agnostic RBR events and --strip-db option ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Right now RBR events require a databasename. It is not possible to have RBR event stream that won't mention which database the events are for. When I tried to use debugger and specify empty database name, attempt to apply the binlog resulted in this error: 090809 17:38:44 [ERROR] Slave SQL: Error 'Table '.tablename' doesn't exist' on opening tables, We could do as follows: - Make the server interpret empty database name in RBR event (i.e. in a Table_map_log_event) as "use current database". Binlog slave thread probably should not allow such events as it doesn't have a natural current database. - Add a mysqlbinlog --strip-db option that would = not produce any "USE dbname" statements = change databasename for all RBR events to be empty That way, mysqlbinlog output will be database-agnostic and apply to the current database. (this will have the usual limitations that we assume that all statements in the binlog refer to the current database). Option3: Enhance database rewrite ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If there is a need to support database change for statements that use dbname.tablename notation and are replicated as statements (i.e. are DDL statements and/or DML statements that are binlogged as statements), then that could be supported as follows: - Make the server's parser recognize special form of comments /* !database-alias(oldname,newname) */ and save the mapping somewhere - Put the hooks in table open and name resolution code to use the saved mapping. Once we've done the above, it will be easy to perform a complete, no-compromise or restrictions database name change in binary log. It will be possible to do the rewrites either on the slave ( --replicate-rewrite-db will work for all kinds of statements), or in mysqlbinlog (adding a comment is easy and doesn't require mysqlbinlog to parse the statement). ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] Updated (by Psergey): Add a mysqlbinlog option to filter updates to certain tables (40)
by worklog-noreply＠askmonty.org 10 Aug '09

10 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Add a mysqlbinlog option to filter updates to certain tables CREATION DATE..: Mon, 10 Aug 2009, 13:25 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: Psergey CATEGORY.......: Server-RawIdeaBin TASK ID........: 40 (http://askmonty.org/worklog/?tid=40) VERSION........: Server-9.x STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: -=-=(Psergey - Mon, 10 Aug 2009, 15:41)=-=- High-Level Specification modified. --- /tmp/wklog.40.old.12989 2009-08-10 15:41:23.000000000 +0300 +++ /tmp/wklog.40.new.12989 2009-08-10 15:41:23.000000000 +0300 @@ -1,6 +1,7 @@ - 1. Context ---------- +(See http://askmonty.org/wiki/index.php/Scratch/ReplicationOptions for global +overview) At the moment, the server has these replication slave options: --replicate-do-table=db.tbl -=-=(Guest - Mon, 10 Aug 2009, 14:52)=-=- Dependency created: 39 now depends on 40 -=-=(Guest - Mon, 10 Aug 2009, 14:51)=-=- High Level Description modified. --- /tmp/wklog.40.old.16985 2009-08-10 14:51:59.000000000 +0300 +++ /tmp/wklog.40.new.16985 2009-08-10 14:51:59.000000000 +0300 @@ -1,3 +1,4 @@ Replication slave can be set to filter updates to certain tables with ---replicate-[wild-]{do,ignore}-table options. This task is about adding similar -functionality to mysqlbinlog. +--replicate-[wild-]{do,ignore}-table options. + +This task is about adding similar functionality to mysqlbinlog. -=-=(Guest - Mon, 10 Aug 2009, 14:51)=-=- High-Level Specification modified. --- /tmp/wklog.40.old.16949 2009-08-10 14:51:33.000000000 +0300 +++ /tmp/wklog.40.new.16949 2009-08-10 14:51:33.000000000 +0300 @@ -1 +1,73 @@ +1. Context +---------- +At the moment, the server has these replication slave options: + + --replicate-do-table=db.tbl + --replicate-ignore-table=db.tbl + --replicate-wild-do-table=pattern.pattern + --replicate-wild-ignore-table=pattern.pattern + +They affect both RBR and SBR events. SBR events are checked after the +statement has been parsed, the server iterates over list of used tables and +checks them againist --replicate instructions. + +What is interesting is that this scheme still allows to update the ignored +table through a VIEW. + +2. Table filtering in mysqlbinlog +--------------------------------- + +Per-table filtering of RBR events is easy (as it is relatively easy to extract +the name of the table that the event applies to). + +Per-table filtering of SBR events is hard, as generally it is not apparent +which tables the statement refers to. + +This opens possible options: + +2.1 Put the parser into mysqlbinlog +----------------------------------- +Once we have a full parser in mysqlbinlog, we'll be able to check which tables +are used by a statement, and will allow to show behaviour identical to those +that one obtains when using --replicate-* slave options. + +(It is not clear how much effort is needed to put the parser into mysqlbinlog. +Any guesses?) + + +2.2 Use dumb regexp match +------------------------- +Use a really dumb approach. A query is considered to be modifying table X if +it matches an expression + +CREATE TABLE $tablename +DROP $tablename +UPDATE ...$tablename ... SET // here '...' can't contain the word 'SET' +DELETE ...$tablename ... WHERE // same as above +ALTER TABLE $tablename +.. etc (go get from the grammar) .. + +The advantage over doing the same in awk is that mysqlbinlog will also process +RBR statements, and together with that will provide a working solution for +those who are careful with their table names not mixing with string constants +and such. + +(TODO: string constants are of particular concern as they come from +[potentially hostile] users, unlike e.g. table aliases which come from +[not hostile] developers. Remove also all string constants before attempting +to do match?) + +2.3 Have the master put annotations +----------------------------------- +We could add a master option so that it injects into query a mark that tells +which tables the query will affect, e.g. for the query + + UPDATE t1 LEFT JOIN db3.t2 ON ... WHERE ... + + +the binlog will have + + /* !mysqlbinlog: updates t1,db3.t2 */ UPDATE t1 LEFT JOIN ... + +and further processing in mysqlbinlog will be trivial. DESCRIPTION: Replication slave can be set to filter updates to certain tables with --replicate-[wild-]{do,ignore}-table options. This task is about adding similar functionality to mysqlbinlog. HIGH-LEVEL SPECIFICATION: 1. Context ---------- (See http://askmonty.org/wiki/index.php/Scratch/ReplicationOptions for global overview) At the moment, the server has these replication slave options: --replicate-do-table=db.tbl --replicate-ignore-table=db.tbl --replicate-wild-do-table=pattern.pattern --replicate-wild-ignore-table=pattern.pattern They affect both RBR and SBR events. SBR events are checked after the statement has been parsed, the server iterates over list of used tables and checks them againist --replicate instructions. What is interesting is that this scheme still allows to update the ignored table through a VIEW. 2. Table filtering in mysqlbinlog --------------------------------- Per-table filtering of RBR events is easy (as it is relatively easy to extract the name of the table that the event applies to). Per-table filtering of SBR events is hard, as generally it is not apparent which tables the statement refers to. This opens possible options: 2.1 Put the parser into mysqlbinlog ----------------------------------- Once we have a full parser in mysqlbinlog, we'll be able to check which tables are used by a statement, and will allow to show behaviour identical to those that one obtains when using --replicate-* slave options. (It is not clear how much effort is needed to put the parser into mysqlbinlog. Any guesses?) 2.2 Use dumb regexp match ------------------------- Use a really dumb approach. A query is considered to be modifying table X if it matches an expression CREATE TABLE $tablename DROP $tablename UPDATE ...$tablename ... SET // here '...' can't contain the word 'SET' DELETE ...$tablename ... WHERE // same as above ALTER TABLE $tablename .. etc (go get from the grammar) .. The advantage over doing the same in awk is that mysqlbinlog will also process RBR statements, and together with that will provide a working solution for those who are careful with their table names not mixing with string constants and such. (TODO: string constants are of particular concern as they come from [potentially hostile] users, unlike e.g. table aliases which come from [not hostile] developers. Remove also all string constants before attempting to do match?) 2.3 Have the master put annotations ----------------------------------- We could add a master option so that it injects into query a mark that tells which tables the query will affect, e.g. for the query UPDATE t1 LEFT JOIN db3.t2 ON ... WHERE ... the binlog will have /* !mysqlbinlog: updates t1,db3.t2 */ UPDATE t1 LEFT JOIN ... and further processing in mysqlbinlog will be trivial. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] Updated (by Psergey): Add a mysqlbinlog option to filter updates to certain tables (40)
by worklog-noreply＠askmonty.org 10 Aug '09

10 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Add a mysqlbinlog option to filter updates to certain tables CREATION DATE..: Mon, 10 Aug 2009, 13:25 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: Psergey CATEGORY.......: Server-RawIdeaBin TASK ID........: 40 (http://askmonty.org/worklog/?tid=40) VERSION........: Server-9.x STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: -=-=(Psergey - Mon, 10 Aug 2009, 15:41)=-=- High-Level Specification modified. --- /tmp/wklog.40.old.12989 2009-08-10 15:41:23.000000000 +0300 +++ /tmp/wklog.40.new.12989 2009-08-10 15:41:23.000000000 +0300 @@ -1,6 +1,7 @@ - 1. Context ---------- +(See http://askmonty.org/wiki/index.php/Scratch/ReplicationOptions for global +overview) At the moment, the server has these replication slave options: --replicate-do-table=db.tbl -=-=(Guest - Mon, 10 Aug 2009, 14:52)=-=- Dependency created: 39 now depends on 40 -=-=(Guest - Mon, 10 Aug 2009, 14:51)=-=- High Level Description modified. --- /tmp/wklog.40.old.16985 2009-08-10 14:51:59.000000000 +0300 +++ /tmp/wklog.40.new.16985 2009-08-10 14:51:59.000000000 +0300 @@ -1,3 +1,4 @@ Replication slave can be set to filter updates to certain tables with ---replicate-[wild-]{do,ignore}-table options. This task is about adding similar -functionality to mysqlbinlog. +--replicate-[wild-]{do,ignore}-table options. + +This task is about adding similar functionality to mysqlbinlog. -=-=(Guest - Mon, 10 Aug 2009, 14:51)=-=- High-Level Specification modified. --- /tmp/wklog.40.old.16949 2009-08-10 14:51:33.000000000 +0300 +++ /tmp/wklog.40.new.16949 2009-08-10 14:51:33.000000000 +0300 @@ -1 +1,73 @@ +1. Context +---------- +At the moment, the server has these replication slave options: + + --replicate-do-table=db.tbl + --replicate-ignore-table=db.tbl + --replicate-wild-do-table=pattern.pattern + --replicate-wild-ignore-table=pattern.pattern + +They affect both RBR and SBR events. SBR events are checked after the +statement has been parsed, the server iterates over list of used tables and +checks them againist --replicate instructions. + +What is interesting is that this scheme still allows to update the ignored +table through a VIEW. + +2. Table filtering in mysqlbinlog +--------------------------------- + +Per-table filtering of RBR events is easy (as it is relatively easy to extract +the name of the table that the event applies to). + +Per-table filtering of SBR events is hard, as generally it is not apparent +which tables the statement refers to. + +This opens possible options: + +2.1 Put the parser into mysqlbinlog +----------------------------------- +Once we have a full parser in mysqlbinlog, we'll be able to check which tables +are used by a statement, and will allow to show behaviour identical to those +that one obtains when using --replicate-* slave options. + +(It is not clear how much effort is needed to put the parser into mysqlbinlog. +Any guesses?) + + +2.2 Use dumb regexp match +------------------------- +Use a really dumb approach. A query is considered to be modifying table X if +it matches an expression + +CREATE TABLE $tablename +DROP $tablename +UPDATE ...$tablename ... SET // here '...' can't contain the word 'SET' +DELETE ...$tablename ... WHERE // same as above +ALTER TABLE $tablename +.. etc (go get from the grammar) .. + +The advantage over doing the same in awk is that mysqlbinlog will also process +RBR statements, and together with that will provide a working solution for +those who are careful with their table names not mixing with string constants +and such. + +(TODO: string constants are of particular concern as they come from +[potentially hostile] users, unlike e.g. table aliases which come from +[not hostile] developers. Remove also all string constants before attempting +to do match?) + +2.3 Have the master put annotations +----------------------------------- +We could add a master option so that it injects into query a mark that tells +which tables the query will affect, e.g. for the query + + UPDATE t1 LEFT JOIN db3.t2 ON ... WHERE ... + + +the binlog will have + + /* !mysqlbinlog: updates t1,db3.t2 */ UPDATE t1 LEFT JOIN ... + +and further processing in mysqlbinlog will be trivial. DESCRIPTION: Replication slave can be set to filter updates to certain tables with --replicate-[wild-]{do,ignore}-table options. This task is about adding similar functionality to mysqlbinlog. HIGH-LEVEL SPECIFICATION: 1. Context ---------- (See http://askmonty.org/wiki/index.php/Scratch/ReplicationOptions for global overview) At the moment, the server has these replication slave options: --replicate-do-table=db.tbl --replicate-ignore-table=db.tbl --replicate-wild-do-table=pattern.pattern --replicate-wild-ignore-table=pattern.pattern They affect both RBR and SBR events. SBR events are checked after the statement has been parsed, the server iterates over list of used tables and checks them againist --replicate instructions. What is interesting is that this scheme still allows to update the ignored table through a VIEW. 2. Table filtering in mysqlbinlog --------------------------------- Per-table filtering of RBR events is easy (as it is relatively easy to extract the name of the table that the event applies to). Per-table filtering of SBR events is hard, as generally it is not apparent which tables the statement refers to. This opens possible options: 2.1 Put the parser into mysqlbinlog ----------------------------------- Once we have a full parser in mysqlbinlog, we'll be able to check which tables are used by a statement, and will allow to show behaviour identical to those that one obtains when using --replicate-* slave options. (It is not clear how much effort is needed to put the parser into mysqlbinlog. Any guesses?) 2.2 Use dumb regexp match ------------------------- Use a really dumb approach. A query is considered to be modifying table X if it matches an expression CREATE TABLE $tablename DROP $tablename UPDATE ...$tablename ... SET // here '...' can't contain the word 'SET' DELETE ...$tablename ... WHERE // same as above ALTER TABLE $tablename .. etc (go get from the grammar) .. The advantage over doing the same in awk is that mysqlbinlog will also process RBR statements, and together with that will provide a working solution for those who are careful with their table names not mixing with string constants and such. (TODO: string constants are of particular concern as they come from [potentially hostile] users, unlike e.g. table aliases which come from [not hostile] developers. Remove also all string constants before attempting to do match?) 2.3 Have the master put annotations ----------------------------------- We could add a master option so that it injects into query a mark that tells which tables the query will affect, e.g. for the query UPDATE t1 LEFT JOIN db3.t2 ON ... WHERE ... the binlog will have /* !mysqlbinlog: updates t1,db3.t2 */ UPDATE t1 LEFT JOIN ... and further processing in mysqlbinlog will be trivial. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] Updated (by Psergey): Add a mysqlbinlog option to filter certain kinds of statements (41)
by worklog-noreply＠askmonty.org 10 Aug '09

10 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Add a mysqlbinlog option to filter certain kinds of statements CREATION DATE..: Mon, 10 Aug 2009, 15:30 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: CATEGORY.......: Client-BackLog TASK ID........: 41 (http://askmonty.org/worklog/?tid=41) VERSION........: Benchmarks-3.0 STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: -=-=(Psergey - Mon, 10 Aug 2009, 15:35)=-=- High-Level Specification modified. --- /tmp/wklog.41.old.12689 2009-08-10 15:35:04.000000000 +0300 +++ /tmp/wklog.41.new.12689 2009-08-10 15:35:04.000000000 +0300 @@ -1 +1,4 @@ - +The implementation will depend on design choices made in WL#40: +- If we decide to parse the statement, SQL-verb filtering will be trivial +- If we decide not to parse the statement, we still can reliably distinguish the +statement by matching the first characters against a set of patterns. -=-=(Psergey - Mon, 10 Aug 2009, 15:31)=-=- Dependency created: 39 now depends on 41 DESCRIPTION: Add a mysqlbinlog option to filter certain kinds of statements, i.e. (syntax subject to discussion): mysqlbinlog --exclude='alter table,drop table,alter database,...' HIGH-LEVEL SPECIFICATION: The implementation will depend on design choices made in WL#40: - If we decide to parse the statement, SQL-verb filtering will be trivial - If we decide not to parse the statement, we still can reliably distinguish the statement by matching the first characters against a set of patterns. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] Updated (by Psergey): Add a mysqlbinlog option to filter certain kinds of statements (41)
by worklog-noreply＠askmonty.org 10 Aug '09

10 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Add a mysqlbinlog option to filter certain kinds of statements CREATION DATE..: Mon, 10 Aug 2009, 15:30 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: CATEGORY.......: Client-BackLog TASK ID........: 41 (http://askmonty.org/worklog/?tid=41) VERSION........: Benchmarks-3.0 STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: -=-=(Psergey - Mon, 10 Aug 2009, 15:35)=-=- High-Level Specification modified. --- /tmp/wklog.41.old.12689 2009-08-10 15:35:04.000000000 +0300 +++ /tmp/wklog.41.new.12689 2009-08-10 15:35:04.000000000 +0300 @@ -1 +1,4 @@ - +The implementation will depend on design choices made in WL#40: +- If we decide to parse the statement, SQL-verb filtering will be trivial +- If we decide not to parse the statement, we still can reliably distinguish the +statement by matching the first characters against a set of patterns. -=-=(Psergey - Mon, 10 Aug 2009, 15:31)=-=- Dependency created: 39 now depends on 41 DESCRIPTION: Add a mysqlbinlog option to filter certain kinds of statements, i.e. (syntax subject to discussion): mysqlbinlog --exclude='alter table,drop table,alter database,...' HIGH-LEVEL SPECIFICATION: The implementation will depend on design choices made in WL#40: - If we decide to parse the statement, SQL-verb filtering will be trivial - If we decide not to parse the statement, we still can reliably distinguish the statement by matching the first characters against a set of patterns. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] New (by Psergey): Add a mysqlbinlog option to filter certain kinds of statements (41)
by worklog-noreply＠askmonty.org 10 Aug '09

10 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Add a mysqlbinlog option to filter certain kinds of statements CREATION DATE..: Mon, 10 Aug 2009, 15:30 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: CATEGORY.......: Client-BackLog TASK ID........: 41 (http://askmonty.org/worklog/?tid=41) VERSION........: Benchmarks-3.0 STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: DESCRIPTION: Add a mysqlbinlog option to filter certain kinds of statements, i.e. (syntax subject to discussion): mysqlbinlog --exclude='alter table,drop table,alter database,...' ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] New (by Psergey): Add a mysqlbinlog option to filter certain kinds of statements (41)
by worklog-noreply＠askmonty.org 10 Aug '09

10 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Add a mysqlbinlog option to filter certain kinds of statements CREATION DATE..: Mon, 10 Aug 2009, 15:30 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: CATEGORY.......: Client-BackLog TASK ID........: 41 (http://askmonty.org/worklog/?tid=41) VERSION........: Benchmarks-3.0 STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: DESCRIPTION: Add a mysqlbinlog option to filter certain kinds of statements, i.e. (syntax subject to discussion): mysqlbinlog --exclude='alter table,drop table,alter database,...' ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] Updated (by Guest): Add a mysqlbinlog option to filter updates to certain tables (40)
by worklog-noreply＠askmonty.org 10 Aug '09

10 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Add a mysqlbinlog option to filter updates to certain tables CREATION DATE..: Mon, 10 Aug 2009, 13:25 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: Psergey CATEGORY.......: Server-RawIdeaBin TASK ID........: 40 (http://askmonty.org/worklog/?tid=40) VERSION........: Server-9.x STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: -=-=(Guest - Mon, 10 Aug 2009, 14:51)=-=- High Level Description modified. --- /tmp/wklog.40.old.16985 2009-08-10 14:51:59.000000000 +0300 +++ /tmp/wklog.40.new.16985 2009-08-10 14:51:59.000000000 +0300 @@ -1,3 +1,4 @@ Replication slave can be set to filter updates to certain tables with ---replicate-[wild-]{do,ignore}-table options. This task is about adding similar -functionality to mysqlbinlog. +--replicate-[wild-]{do,ignore}-table options. + +This task is about adding similar functionality to mysqlbinlog. -=-=(Guest - Mon, 10 Aug 2009, 14:51)=-=- High-Level Specification modified. --- /tmp/wklog.40.old.16949 2009-08-10 14:51:33.000000000 +0300 +++ /tmp/wklog.40.new.16949 2009-08-10 14:51:33.000000000 +0300 @@ -1 +1,73 @@ +1. Context +---------- +At the moment, the server has these replication slave options: + + --replicate-do-table=db.tbl + --replicate-ignore-table=db.tbl + --replicate-wild-do-table=pattern.pattern + --replicate-wild-ignore-table=pattern.pattern + +They affect both RBR and SBR events. SBR events are checked after the +statement has been parsed, the server iterates over list of used tables and +checks them againist --replicate instructions. + +What is interesting is that this scheme still allows to update the ignored +table through a VIEW. + +2. Table filtering in mysqlbinlog +--------------------------------- + +Per-table filtering of RBR events is easy (as it is relatively easy to extract +the name of the table that the event applies to). + +Per-table filtering of SBR events is hard, as generally it is not apparent +which tables the statement refers to. + +This opens possible options: + +2.1 Put the parser into mysqlbinlog +----------------------------------- +Once we have a full parser in mysqlbinlog, we'll be able to check which tables +are used by a statement, and will allow to show behaviour identical to those +that one obtains when using --replicate-* slave options. + +(It is not clear how much effort is needed to put the parser into mysqlbinlog. +Any guesses?) + + +2.2 Use dumb regexp match +------------------------- +Use a really dumb approach. A query is considered to be modifying table X if +it matches an expression + +CREATE TABLE $tablename +DROP $tablename +UPDATE ...$tablename ... SET // here '...' can't contain the word 'SET' +DELETE ...$tablename ... WHERE // same as above +ALTER TABLE $tablename +.. etc (go get from the grammar) .. + +The advantage over doing the same in awk is that mysqlbinlog will also process +RBR statements, and together with that will provide a working solution for +those who are careful with their table names not mixing with string constants +and such. + +(TODO: string constants are of particular concern as they come from +[potentially hostile] users, unlike e.g. table aliases which come from +[not hostile] developers. Remove also all string constants before attempting +to do match?) + +2.3 Have the master put annotations +----------------------------------- +We could add a master option so that it injects into query a mark that tells +which tables the query will affect, e.g. for the query + + UPDATE t1 LEFT JOIN db3.t2 ON ... WHERE ... + + +the binlog will have + + /* !mysqlbinlog: updates t1,db3.t2 */ UPDATE t1 LEFT JOIN ... + +and further processing in mysqlbinlog will be trivial. DESCRIPTION: Replication slave can be set to filter updates to certain tables with --replicate-[wild-]{do,ignore}-table options. This task is about adding similar functionality to mysqlbinlog. HIGH-LEVEL SPECIFICATION: 1. Context ---------- At the moment, the server has these replication slave options: --replicate-do-table=db.tbl --replicate-ignore-table=db.tbl --replicate-wild-do-table=pattern.pattern --replicate-wild-ignore-table=pattern.pattern They affect both RBR and SBR events. SBR events are checked after the statement has been parsed, the server iterates over list of used tables and checks them againist --replicate instructions. What is interesting is that this scheme still allows to update the ignored table through a VIEW. 2. Table filtering in mysqlbinlog --------------------------------- Per-table filtering of RBR events is easy (as it is relatively easy to extract the name of the table that the event applies to). Per-table filtering of SBR events is hard, as generally it is not apparent which tables the statement refers to. This opens possible options: 2.1 Put the parser into mysqlbinlog ----------------------------------- Once we have a full parser in mysqlbinlog, we'll be able to check which tables are used by a statement, and will allow to show behaviour identical to those that one obtains when using --replicate-* slave options. (It is not clear how much effort is needed to put the parser into mysqlbinlog. Any guesses?) 2.2 Use dumb regexp match ------------------------- Use a really dumb approach. A query is considered to be modifying table X if it matches an expression CREATE TABLE $tablename DROP $tablename UPDATE ...$tablename ... SET // here '...' can't contain the word 'SET' DELETE ...$tablename ... WHERE // same as above ALTER TABLE $tablename .. etc (go get from the grammar) .. The advantage over doing the same in awk is that mysqlbinlog will also process RBR statements, and together with that will provide a working solution for those who are careful with their table names not mixing with string constants and such. (TODO: string constants are of particular concern as they come from [potentially hostile] users, unlike e.g. table aliases which come from [not hostile] developers. Remove also all string constants before attempting to do match?) 2.3 Have the master put annotations ----------------------------------- We could add a master option so that it injects into query a mark that tells which tables the query will affect, e.g. for the query UPDATE t1 LEFT JOIN db3.t2 ON ... WHERE ... the binlog will have /* !mysqlbinlog: updates t1,db3.t2 */ UPDATE t1 LEFT JOIN ... and further processing in mysqlbinlog will be trivial. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] Updated (by Guest): Add a mysqlbinlog option to filter updates to certain tables (40)
by worklog-noreply＠askmonty.org 10 Aug '09

10 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Add a mysqlbinlog option to filter updates to certain tables CREATION DATE..: Mon, 10 Aug 2009, 13:25 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: Psergey CATEGORY.......: Server-RawIdeaBin TASK ID........: 40 (http://askmonty.org/worklog/?tid=40) VERSION........: Server-9.x STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: -=-=(Guest - Mon, 10 Aug 2009, 14:51)=-=- High Level Description modified. --- /tmp/wklog.40.old.16985 2009-08-10 14:51:59.000000000 +0300 +++ /tmp/wklog.40.new.16985 2009-08-10 14:51:59.000000000 +0300 @@ -1,3 +1,4 @@ Replication slave can be set to filter updates to certain tables with ---replicate-[wild-]{do,ignore}-table options. This task is about adding similar -functionality to mysqlbinlog. +--replicate-[wild-]{do,ignore}-table options. + +This task is about adding similar functionality to mysqlbinlog. -=-=(Guest - Mon, 10 Aug 2009, 14:51)=-=- High-Level Specification modified. --- /tmp/wklog.40.old.16949 2009-08-10 14:51:33.000000000 +0300 +++ /tmp/wklog.40.new.16949 2009-08-10 14:51:33.000000000 +0300 @@ -1 +1,73 @@ +1. Context +---------- +At the moment, the server has these replication slave options: + + --replicate-do-table=db.tbl + --replicate-ignore-table=db.tbl + --replicate-wild-do-table=pattern.pattern + --replicate-wild-ignore-table=pattern.pattern + +They affect both RBR and SBR events. SBR events are checked after the +statement has been parsed, the server iterates over list of used tables and +checks them againist --replicate instructions. + +What is interesting is that this scheme still allows to update the ignored +table through a VIEW. + +2. Table filtering in mysqlbinlog +--------------------------------- + +Per-table filtering of RBR events is easy (as it is relatively easy to extract +the name of the table that the event applies to). + +Per-table filtering of SBR events is hard, as generally it is not apparent +which tables the statement refers to. + +This opens possible options: + +2.1 Put the parser into mysqlbinlog +----------------------------------- +Once we have a full parser in mysqlbinlog, we'll be able to check which tables +are used by a statement, and will allow to show behaviour identical to those +that one obtains when using --replicate-* slave options. + +(It is not clear how much effort is needed to put the parser into mysqlbinlog. +Any guesses?) + + +2.2 Use dumb regexp match +------------------------- +Use a really dumb approach. A query is considered to be modifying table X if +it matches an expression + +CREATE TABLE $tablename +DROP $tablename +UPDATE ...$tablename ... SET // here '...' can't contain the word 'SET' +DELETE ...$tablename ... WHERE // same as above +ALTER TABLE $tablename +.. etc (go get from the grammar) .. + +The advantage over doing the same in awk is that mysqlbinlog will also process +RBR statements, and together with that will provide a working solution for +those who are careful with their table names not mixing with string constants +and such. + +(TODO: string constants are of particular concern as they come from +[potentially hostile] users, unlike e.g. table aliases which come from +[not hostile] developers. Remove also all string constants before attempting +to do match?) + +2.3 Have the master put annotations +----------------------------------- +We could add a master option so that it injects into query a mark that tells +which tables the query will affect, e.g. for the query + + UPDATE t1 LEFT JOIN db3.t2 ON ... WHERE ... + + +the binlog will have + + /* !mysqlbinlog: updates t1,db3.t2 */ UPDATE t1 LEFT JOIN ... + +and further processing in mysqlbinlog will be trivial. DESCRIPTION: Replication slave can be set to filter updates to certain tables with --replicate-[wild-]{do,ignore}-table options. This task is about adding similar functionality to mysqlbinlog. HIGH-LEVEL SPECIFICATION: 1. Context ---------- At the moment, the server has these replication slave options: --replicate-do-table=db.tbl --replicate-ignore-table=db.tbl --replicate-wild-do-table=pattern.pattern --replicate-wild-ignore-table=pattern.pattern They affect both RBR and SBR events. SBR events are checked after the statement has been parsed, the server iterates over list of used tables and checks them againist --replicate instructions. What is interesting is that this scheme still allows to update the ignored table through a VIEW. 2. Table filtering in mysqlbinlog --------------------------------- Per-table filtering of RBR events is easy (as it is relatively easy to extract the name of the table that the event applies to). Per-table filtering of SBR events is hard, as generally it is not apparent which tables the statement refers to. This opens possible options: 2.1 Put the parser into mysqlbinlog ----------------------------------- Once we have a full parser in mysqlbinlog, we'll be able to check which tables are used by a statement, and will allow to show behaviour identical to those that one obtains when using --replicate-* slave options. (It is not clear how much effort is needed to put the parser into mysqlbinlog. Any guesses?) 2.2 Use dumb regexp match ------------------------- Use a really dumb approach. A query is considered to be modifying table X if it matches an expression CREATE TABLE $tablename DROP $tablename UPDATE ...$tablename ... SET // here '...' can't contain the word 'SET' DELETE ...$tablename ... WHERE // same as above ALTER TABLE $tablename .. etc (go get from the grammar) .. The advantage over doing the same in awk is that mysqlbinlog will also process RBR statements, and together with that will provide a working solution for those who are careful with their table names not mixing with string constants and such. (TODO: string constants are of particular concern as they come from [potentially hostile] users, unlike e.g. table aliases which come from [not hostile] developers. Remove also all string constants before attempting to do match?) 2.3 Have the master put annotations ----------------------------------- We could add a master option so that it injects into query a mark that tells which tables the query will affect, e.g. for the query UPDATE t1 LEFT JOIN db3.t2 ON ... WHERE ... the binlog will have /* !mysqlbinlog: updates t1,db3.t2 */ UPDATE t1 LEFT JOIN ... and further processing in mysqlbinlog will be trivial. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] Updated (by Guest): Add a mysqlbinlog option to filter updates to certain tables (40)
by worklog-noreply＠askmonty.org 10 Aug '09

10 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Add a mysqlbinlog option to filter updates to certain tables CREATION DATE..: Mon, 10 Aug 2009, 13:25 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: Psergey CATEGORY.......: Server-RawIdeaBin TASK ID........: 40 (http://askmonty.org/worklog/?tid=40) VERSION........: Server-9.x STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: -=-=(Guest - Mon, 10 Aug 2009, 14:51)=-=- High-Level Specification modified. --- /tmp/wklog.40.old.16949 2009-08-10 14:51:33.000000000 +0300 +++ /tmp/wklog.40.new.16949 2009-08-10 14:51:33.000000000 +0300 @@ -1 +1,73 @@ +1. Context +---------- +At the moment, the server has these replication slave options: + + --replicate-do-table=db.tbl + --replicate-ignore-table=db.tbl + --replicate-wild-do-table=pattern.pattern + --replicate-wild-ignore-table=pattern.pattern + +They affect both RBR and SBR events. SBR events are checked after the +statement has been parsed, the server iterates over list of used tables and +checks them againist --replicate instructions. + +What is interesting is that this scheme still allows to update the ignored +table through a VIEW. + +2. Table filtering in mysqlbinlog +--------------------------------- + +Per-table filtering of RBR events is easy (as it is relatively easy to extract +the name of the table that the event applies to). + +Per-table filtering of SBR events is hard, as generally it is not apparent +which tables the statement refers to. + +This opens possible options: + +2.1 Put the parser into mysqlbinlog +----------------------------------- +Once we have a full parser in mysqlbinlog, we'll be able to check which tables +are used by a statement, and will allow to show behaviour identical to those +that one obtains when using --replicate-* slave options. + +(It is not clear how much effort is needed to put the parser into mysqlbinlog. +Any guesses?) + + +2.2 Use dumb regexp match +------------------------- +Use a really dumb approach. A query is considered to be modifying table X if +it matches an expression + +CREATE TABLE $tablename +DROP $tablename +UPDATE ...$tablename ... SET // here '...' can't contain the word 'SET' +DELETE ...$tablename ... WHERE // same as above +ALTER TABLE $tablename +.. etc (go get from the grammar) .. + +The advantage over doing the same in awk is that mysqlbinlog will also process +RBR statements, and together with that will provide a working solution for +those who are careful with their table names not mixing with string constants +and such. + +(TODO: string constants are of particular concern as they come from +[potentially hostile] users, unlike e.g. table aliases which come from +[not hostile] developers. Remove also all string constants before attempting +to do match?) + +2.3 Have the master put annotations +----------------------------------- +We could add a master option so that it injects into query a mark that tells +which tables the query will affect, e.g. for the query + + UPDATE t1 LEFT JOIN db3.t2 ON ... WHERE ... + + +the binlog will have + + /* !mysqlbinlog: updates t1,db3.t2 */ UPDATE t1 LEFT JOIN ... + +and further processing in mysqlbinlog will be trivial. DESCRIPTION: Replication slave can be set to filter updates to certain tables with --replicate-[wild-]{do,ignore}-table options. This task is about adding similar functionality to mysqlbinlog. HIGH-LEVEL SPECIFICATION: 1. Context ---------- At the moment, the server has these replication slave options: --replicate-do-table=db.tbl --replicate-ignore-table=db.tbl --replicate-wild-do-table=pattern.pattern --replicate-wild-ignore-table=pattern.pattern They affect both RBR and SBR events. SBR events are checked after the statement has been parsed, the server iterates over list of used tables and checks them againist --replicate instructions. What is interesting is that this scheme still allows to update the ignored table through a VIEW. 2. Table filtering in mysqlbinlog --------------------------------- Per-table filtering of RBR events is easy (as it is relatively easy to extract the name of the table that the event applies to). Per-table filtering of SBR events is hard, as generally it is not apparent which tables the statement refers to. This opens possible options: 2.1 Put the parser into mysqlbinlog ----------------------------------- Once we have a full parser in mysqlbinlog, we'll be able to check which tables are used by a statement, and will allow to show behaviour identical to those that one obtains when using --replicate-* slave options. (It is not clear how much effort is needed to put the parser into mysqlbinlog. Any guesses?) 2.2 Use dumb regexp match ------------------------- Use a really dumb approach. A query is considered to be modifying table X if it matches an expression CREATE TABLE $tablename DROP $tablename UPDATE ...$tablename ... SET // here '...' can't contain the word 'SET' DELETE ...$tablename ... WHERE // same as above ALTER TABLE $tablename .. etc (go get from the grammar) .. The advantage over doing the same in awk is that mysqlbinlog will also process RBR statements, and together with that will provide a working solution for those who are careful with their table names not mixing with string constants and such. (TODO: string constants are of particular concern as they come from [potentially hostile] users, unlike e.g. table aliases which come from [not hostile] developers. Remove also all string constants before attempting to do match?) 2.3 Have the master put annotations ----------------------------------- We could add a master option so that it injects into query a mark that tells which tables the query will affect, e.g. for the query UPDATE t1 LEFT JOIN db3.t2 ON ... WHERE ... the binlog will have /* !mysqlbinlog: updates t1,db3.t2 */ UPDATE t1 LEFT JOIN ... and further processing in mysqlbinlog will be trivial. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] Updated (by Guest): Add a mysqlbinlog option to filter updates to certain tables (40)
by worklog-noreply＠askmonty.org 10 Aug '09

10 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Add a mysqlbinlog option to filter updates to certain tables CREATION DATE..: Mon, 10 Aug 2009, 13:25 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: Psergey CATEGORY.......: Server-RawIdeaBin TASK ID........: 40 (http://askmonty.org/worklog/?tid=40) VERSION........: Server-9.x STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: -=-=(Guest - Mon, 10 Aug 2009, 14:51)=-=- High-Level Specification modified. --- /tmp/wklog.40.old.16949 2009-08-10 14:51:33.000000000 +0300 +++ /tmp/wklog.40.new.16949 2009-08-10 14:51:33.000000000 +0300 @@ -1 +1,73 @@ +1. Context +---------- +At the moment, the server has these replication slave options: + + --replicate-do-table=db.tbl + --replicate-ignore-table=db.tbl + --replicate-wild-do-table=pattern.pattern + --replicate-wild-ignore-table=pattern.pattern + +They affect both RBR and SBR events. SBR events are checked after the +statement has been parsed, the server iterates over list of used tables and +checks them againist --replicate instructions. + +What is interesting is that this scheme still allows to update the ignored +table through a VIEW. + +2. Table filtering in mysqlbinlog +--------------------------------- + +Per-table filtering of RBR events is easy (as it is relatively easy to extract +the name of the table that the event applies to). + +Per-table filtering of SBR events is hard, as generally it is not apparent +which tables the statement refers to. + +This opens possible options: + +2.1 Put the parser into mysqlbinlog +----------------------------------- +Once we have a full parser in mysqlbinlog, we'll be able to check which tables +are used by a statement, and will allow to show behaviour identical to those +that one obtains when using --replicate-* slave options. + +(It is not clear how much effort is needed to put the parser into mysqlbinlog. +Any guesses?) + + +2.2 Use dumb regexp match +------------------------- +Use a really dumb approach. A query is considered to be modifying table X if +it matches an expression + +CREATE TABLE $tablename +DROP $tablename +UPDATE ...$tablename ... SET // here '...' can't contain the word 'SET' +DELETE ...$tablename ... WHERE // same as above +ALTER TABLE $tablename +.. etc (go get from the grammar) .. + +The advantage over doing the same in awk is that mysqlbinlog will also process +RBR statements, and together with that will provide a working solution for +those who are careful with their table names not mixing with string constants +and such. + +(TODO: string constants are of particular concern as they come from +[potentially hostile] users, unlike e.g. table aliases which come from +[not hostile] developers. Remove also all string constants before attempting +to do match?) + +2.3 Have the master put annotations +----------------------------------- +We could add a master option so that it injects into query a mark that tells +which tables the query will affect, e.g. for the query + + UPDATE t1 LEFT JOIN db3.t2 ON ... WHERE ... + + +the binlog will have + + /* !mysqlbinlog: updates t1,db3.t2 */ UPDATE t1 LEFT JOIN ... + +and further processing in mysqlbinlog will be trivial. DESCRIPTION: Replication slave can be set to filter updates to certain tables with --replicate-[wild-]{do,ignore}-table options. This task is about adding similar functionality to mysqlbinlog. HIGH-LEVEL SPECIFICATION: 1. Context ---------- At the moment, the server has these replication slave options: --replicate-do-table=db.tbl --replicate-ignore-table=db.tbl --replicate-wild-do-table=pattern.pattern --replicate-wild-ignore-table=pattern.pattern They affect both RBR and SBR events. SBR events are checked after the statement has been parsed, the server iterates over list of used tables and checks them againist --replicate instructions. What is interesting is that this scheme still allows to update the ignored table through a VIEW. 2. Table filtering in mysqlbinlog --------------------------------- Per-table filtering of RBR events is easy (as it is relatively easy to extract the name of the table that the event applies to). Per-table filtering of SBR events is hard, as generally it is not apparent which tables the statement refers to. This opens possible options: 2.1 Put the parser into mysqlbinlog ----------------------------------- Once we have a full parser in mysqlbinlog, we'll be able to check which tables are used by a statement, and will allow to show behaviour identical to those that one obtains when using --replicate-* slave options. (It is not clear how much effort is needed to put the parser into mysqlbinlog. Any guesses?) 2.2 Use dumb regexp match ------------------------- Use a really dumb approach. A query is considered to be modifying table X if it matches an expression CREATE TABLE $tablename DROP $tablename UPDATE ...$tablename ... SET // here '...' can't contain the word 'SET' DELETE ...$tablename ... WHERE // same as above ALTER TABLE $tablename .. etc (go get from the grammar) .. The advantage over doing the same in awk is that mysqlbinlog will also process RBR statements, and together with that will provide a working solution for those who are careful with their table names not mixing with string constants and such. (TODO: string constants are of particular concern as they come from [potentially hostile] users, unlike e.g. table aliases which come from [not hostile] developers. Remove also all string constants before attempting to do match?) 2.3 Have the master put annotations ----------------------------------- We could add a master option so that it injects into query a mark that tells which tables the query will affect, e.g. for the query UPDATE t1 LEFT JOIN db3.t2 ON ... WHERE ... the binlog will have /* !mysqlbinlog: updates t1,db3.t2 */ UPDATE t1 LEFT JOIN ... and further processing in mysqlbinlog will be trivial. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] Updated (by Knielsen): Using the Valgrind API in mysqld (23)
by worklog-noreply＠askmonty.org 10 Aug '09

10 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Using the Valgrind API in mysqld CREATION DATE..: Fri, 22 May 2009, 11:43 SUPERVISOR.....: Monty IMPLEMENTOR....: Knielsen COPIES TO......: CATEGORY.......: Server-Sprint TASK ID........: 23 (http://askmonty.org/worklog/?tid=23) VERSION........: Server-5.1 STATUS.........: Code-Review PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 40 (hours remain) ORIG. ESTIMATE.: 40 PROGRESS NOTES: -=-=(Knielsen - Mon, 10 Aug 2009, 14:27)=-=- Low Level Design modified. --- /tmp/wklog.23.old.16018 2009-08-10 14:27:09.000000000 +0300 +++ /tmp/wklog.23.new.16018 2009-08-10 14:27:09.000000000 +0300 @@ -5,3 +5,5 @@ - sql/item_strfunc.cc (Item_func_compress). +Another good place is in the TRASH_MEM macro. + -=-=(Knielsen - Wed, 24 Jun 2009, 15:55)=-=- Supervisor updated. --- /tmp/wklog.23.old.944 2009-06-24 15:55:57.000000000 +0300 +++ /tmp/wklog.23.new.944 2009-06-24 15:55:57.000000000 +0300 @@ -1 +1 @@ -Knielsen +Monty -=-=(Knielsen - Wed, 24 Jun 2009, 15:53)=-=- Version updated. --- /tmp/wklog.23.old.911 2009-06-24 15:53:32.000000000 +0300 +++ /tmp/wklog.23.new.911 2009-06-24 15:53:32.000000000 +0300 @@ -1 +1 @@ -Maria-1.0 +Server-5.1 -=-=(Knielsen - Wed, 24 Jun 2009, 15:52)=-=- Version updated. --- /tmp/wklog.23.old.897 2009-06-24 15:52:43.000000000 +0300 +++ /tmp/wklog.23.new.897 2009-06-24 15:52:43.000000000 +0300 @@ -1 +1 @@ -Connector/.NET-2.1 +Maria-1.0 -=-=(Knielsen - Wed, 24 Jun 2009, 15:52)=-=- Version updated. --- /tmp/wklog.23.old.895 2009-06-24 15:52:28.000000000 +0300 +++ /tmp/wklog.23.new.895 2009-06-24 15:52:28.000000000 +0300 @@ -1 +1 @@ -Maria-1.0 +Connector/.NET-2.1 -=-=(Knielsen - Wed, 24 Jun 2009, 15:35)=-=- Version updated. --- /tmp/wklog.23.old.32742 2009-06-24 15:35:48.000000000 +0300 +++ /tmp/wklog.23.new.32742 2009-06-24 15:35:48.000000000 +0300 @@ -1 +1 @@ -Server-5.1 +Maria-1.0 -=-=(Knielsen - Fri, 22 May 2009, 14:31)=-=- Low Level Design modified. --- /tmp/wklog.23.old.24587 2009-05-22 14:31:52.000000000 +0300 +++ /tmp/wklog.23.new.24587 2009-05-22 14:31:52.000000000 +0300 @@ -1 +1,7 @@ +Two places where we call into libz, and where checking for defined parameters +would be good: + + - mysys/my_compress.c + + - sql/item_strfunc.cc (Item_func_compress). -=-=(Guest - Fri, 22 May 2009, 12:04)=-=- High-Level Specification modified. --- /tmp/wklog.23.old.18061 2009-05-22 12:04:05.000000000 +0300 +++ /tmp/wklog.23.new.18061 2009-05-22 12:04:05.000000000 +0300 @@ -26,3 +26,5 @@ initialised, it is possible to detect problems earlier, speeding up debugging. Such code can be added in more places over time as development and debugging goes on. + +See also a patch here: http://bugs.mysql.com/bug.php?id=44582 -=-=(Knielsen - Fri, 22 May 2009, 11:52)=-=- High-Level Specification modified. --- /tmp/wklog.23.old.17628 2009-05-22 11:52:33.000000000 +0300 +++ /tmp/wklog.23.new.17628 2009-05-22 11:52:33.000000000 +0300 @@ -1 +1,28 @@ +With custom memory allocators, using the Valgrind APIs we can tell Valgrind when +a memory block is allocated (so that data read from memory is marked as undefined +instead of being defined or not at random depending on prior use); and when a +memory block is freed (so that use after freeing can be reported as an error). +In some cases cheking for leaks may also be appropriate. + +Another possibility is to add an explicit check for whether memory is defined. + +One place this would be useful is when calling libz. Due to the design of that +library, Valgrind produces lots of false alarms about using undefined values +(I think the issue is that it runs a few bytes off of initialized memory to +reduce boundary checks in each loop iteration, then after the loop has checks to +avoid using the undefined part of the result). This means we have lots of libz +Valgrind suppressions and continue to add more as new warnings surface. So we +might easily miss a real problem in this area. This could be improved by adding +explicit checks at the call to libz functions that the passed memory is properly +defined. + +Another use is to improve debugging. It is often the case when debugging a +warning about using un-initialised memory that the detection happens long after +the real problem, the un-initialized value being passed along through the code +for a long time before being detected. This makes debugging the problem slow. + +By adding in strategic places code that asserts that a specific value must be +initialised, it is possible to detect problems earlier, speeding up debugging. +Such code can be added in more places over time as development and debugging +goes on. DESCRIPTION: Valgrind (the memcheck tool) has some very useful APIs that can be used in mysqld when testing with Valgrind to improve testing and/or debugging: file:///usr/share/doc/valgrind/html/mc-manual.html#mc-manual.clientreqs file:///usr/share/doc/valgrind/html/mc-manual.html#mc-manual.mempools This worklog is about adding configure checks and headers to allow to use these in a way that continues to work on machines where the Valgrind headers or functionality is missing. It also includes adding some basic Valgrind enhancements: - Adding Valgrind annotations to custom memory allocators so that Valgrind can detect leaks, use-before-init, and use-after-free problems also for these allocators. - Adding checks for definedness in appropriate places (eg. when calling libz). HIGH-LEVEL SPECIFICATION: With custom memory allocators, using the Valgrind APIs we can tell Valgrind when a memory block is allocated (so that data read from memory is marked as undefined instead of being defined or not at random depending on prior use); and when a memory block is freed (so that use after freeing can be reported as an error). In some cases cheking for leaks may also be appropriate. Another possibility is to add an explicit check for whether memory is defined. One place this would be useful is when calling libz. Due to the design of that library, Valgrind produces lots of false alarms about using undefined values (I think the issue is that it runs a few bytes off of initialized memory to reduce boundary checks in each loop iteration, then after the loop has checks to avoid using the undefined part of the result). This means we have lots of libz Valgrind suppressions and continue to add more as new warnings surface. So we might easily miss a real problem in this area. This could be improved by adding explicit checks at the call to libz functions that the passed memory is properly defined. Another use is to improve debugging. It is often the case when debugging a warning about using un-initialised memory that the detection happens long after the real problem, the un-initialized value being passed along through the code for a long time before being detected. This makes debugging the problem slow. By adding in strategic places code that asserts that a specific value must be initialised, it is possible to detect problems earlier, speeding up debugging. Such code can be added in more places over time as development and debugging goes on. See also a patch here: http://bugs.mysql.com/bug.php?id=44582 LOW-LEVEL DESIGN: Two places where we call into libz, and where checking for defined parameters would be good: - mysys/my_compress.c - sql/item_strfunc.cc (Item_func_compress). Another good place is in the TRASH_MEM macro. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] Updated (by Knielsen): Using the Valgrind API in mysqld (23)
by worklog-noreply＠askmonty.org 10 Aug '09

10 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Using the Valgrind API in mysqld CREATION DATE..: Fri, 22 May 2009, 11:43 SUPERVISOR.....: Monty IMPLEMENTOR....: Knielsen COPIES TO......: CATEGORY.......: Server-Sprint TASK ID........: 23 (http://askmonty.org/worklog/?tid=23) VERSION........: Server-5.1 STATUS.........: Code-Review PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 40 (hours remain) ORIG. ESTIMATE.: 40 PROGRESS NOTES: -=-=(Knielsen - Mon, 10 Aug 2009, 14:27)=-=- Low Level Design modified. --- /tmp/wklog.23.old.16018 2009-08-10 14:27:09.000000000 +0300 +++ /tmp/wklog.23.new.16018 2009-08-10 14:27:09.000000000 +0300 @@ -5,3 +5,5 @@ - sql/item_strfunc.cc (Item_func_compress). +Another good place is in the TRASH_MEM macro. + -=-=(Knielsen - Wed, 24 Jun 2009, 15:55)=-=- Supervisor updated. --- /tmp/wklog.23.old.944 2009-06-24 15:55:57.000000000 +0300 +++ /tmp/wklog.23.new.944 2009-06-24 15:55:57.000000000 +0300 @@ -1 +1 @@ -Knielsen +Monty -=-=(Knielsen - Wed, 24 Jun 2009, 15:53)=-=- Version updated. --- /tmp/wklog.23.old.911 2009-06-24 15:53:32.000000000 +0300 +++ /tmp/wklog.23.new.911 2009-06-24 15:53:32.000000000 +0300 @@ -1 +1 @@ -Maria-1.0 +Server-5.1 -=-=(Knielsen - Wed, 24 Jun 2009, 15:52)=-=- Version updated. --- /tmp/wklog.23.old.897 2009-06-24 15:52:43.000000000 +0300 +++ /tmp/wklog.23.new.897 2009-06-24 15:52:43.000000000 +0300 @@ -1 +1 @@ -Connector/.NET-2.1 +Maria-1.0 -=-=(Knielsen - Wed, 24 Jun 2009, 15:52)=-=- Version updated. --- /tmp/wklog.23.old.895 2009-06-24 15:52:28.000000000 +0300 +++ /tmp/wklog.23.new.895 2009-06-24 15:52:28.000000000 +0300 @@ -1 +1 @@ -Maria-1.0 +Connector/.NET-2.1 -=-=(Knielsen - Wed, 24 Jun 2009, 15:35)=-=- Version updated. --- /tmp/wklog.23.old.32742 2009-06-24 15:35:48.000000000 +0300 +++ /tmp/wklog.23.new.32742 2009-06-24 15:35:48.000000000 +0300 @@ -1 +1 @@ -Server-5.1 +Maria-1.0 -=-=(Knielsen - Fri, 22 May 2009, 14:31)=-=- Low Level Design modified. --- /tmp/wklog.23.old.24587 2009-05-22 14:31:52.000000000 +0300 +++ /tmp/wklog.23.new.24587 2009-05-22 14:31:52.000000000 +0300 @@ -1 +1,7 @@ +Two places where we call into libz, and where checking for defined parameters +would be good: + + - mysys/my_compress.c + + - sql/item_strfunc.cc (Item_func_compress). -=-=(Guest - Fri, 22 May 2009, 12:04)=-=- High-Level Specification modified. --- /tmp/wklog.23.old.18061 2009-05-22 12:04:05.000000000 +0300 +++ /tmp/wklog.23.new.18061 2009-05-22 12:04:05.000000000 +0300 @@ -26,3 +26,5 @@ initialised, it is possible to detect problems earlier, speeding up debugging. Such code can be added in more places over time as development and debugging goes on. + +See also a patch here: http://bugs.mysql.com/bug.php?id=44582 -=-=(Knielsen - Fri, 22 May 2009, 11:52)=-=- High-Level Specification modified. --- /tmp/wklog.23.old.17628 2009-05-22 11:52:33.000000000 +0300 +++ /tmp/wklog.23.new.17628 2009-05-22 11:52:33.000000000 +0300 @@ -1 +1,28 @@ +With custom memory allocators, using the Valgrind APIs we can tell Valgrind when +a memory block is allocated (so that data read from memory is marked as undefined +instead of being defined or not at random depending on prior use); and when a +memory block is freed (so that use after freeing can be reported as an error). +In some cases cheking for leaks may also be appropriate. + +Another possibility is to add an explicit check for whether memory is defined. + +One place this would be useful is when calling libz. Due to the design of that +library, Valgrind produces lots of false alarms about using undefined values +(I think the issue is that it runs a few bytes off of initialized memory to +reduce boundary checks in each loop iteration, then after the loop has checks to +avoid using the undefined part of the result). This means we have lots of libz +Valgrind suppressions and continue to add more as new warnings surface. So we +might easily miss a real problem in this area. This could be improved by adding +explicit checks at the call to libz functions that the passed memory is properly +defined. + +Another use is to improve debugging. It is often the case when debugging a +warning about using un-initialised memory that the detection happens long after +the real problem, the un-initialized value being passed along through the code +for a long time before being detected. This makes debugging the problem slow. + +By adding in strategic places code that asserts that a specific value must be +initialised, it is possible to detect problems earlier, speeding up debugging. +Such code can be added in more places over time as development and debugging +goes on. DESCRIPTION: Valgrind (the memcheck tool) has some very useful APIs that can be used in mysqld when testing with Valgrind to improve testing and/or debugging: file:///usr/share/doc/valgrind/html/mc-manual.html#mc-manual.clientreqs file:///usr/share/doc/valgrind/html/mc-manual.html#mc-manual.mempools This worklog is about adding configure checks and headers to allow to use these in a way that continues to work on machines where the Valgrind headers or functionality is missing. It also includes adding some basic Valgrind enhancements: - Adding Valgrind annotations to custom memory allocators so that Valgrind can detect leaks, use-before-init, and use-after-free problems also for these allocators. - Adding checks for definedness in appropriate places (eg. when calling libz). HIGH-LEVEL SPECIFICATION: With custom memory allocators, using the Valgrind APIs we can tell Valgrind when a memory block is allocated (so that data read from memory is marked as undefined instead of being defined or not at random depending on prior use); and when a memory block is freed (so that use after freeing can be reported as an error). In some cases cheking for leaks may also be appropriate. Another possibility is to add an explicit check for whether memory is defined. One place this would be useful is when calling libz. Due to the design of that library, Valgrind produces lots of false alarms about using undefined values (I think the issue is that it runs a few bytes off of initialized memory to reduce boundary checks in each loop iteration, then after the loop has checks to avoid using the undefined part of the result). This means we have lots of libz Valgrind suppressions and continue to add more as new warnings surface. So we might easily miss a real problem in this area. This could be improved by adding explicit checks at the call to libz functions that the passed memory is properly defined. Another use is to improve debugging. It is often the case when debugging a warning about using un-initialised memory that the detection happens long after the real problem, the un-initialized value being passed along through the code for a long time before being detected. This makes debugging the problem slow. By adding in strategic places code that asserts that a specific value must be initialised, it is possible to detect problems earlier, speeding up debugging. Such code can be added in more places over time as development and debugging goes on. See also a patch here: http://bugs.mysql.com/bug.php?id=44582 LOW-LEVEL DESIGN: Two places where we call into libz, and where checking for defined parameters would be good: - mysys/my_compress.c - sql/item_strfunc.cc (Item_func_compress). Another good place is in the TRASH_MEM macro. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] New (by Psergey): Add a mysqlbinlog option to filter updates to certain tables (40)
by worklog-noreply＠askmonty.org 10 Aug '09

10 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Add a mysqlbinlog option to filter updates to certain tables CREATION DATE..: Mon, 10 Aug 2009, 13:25 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: Psergey CATEGORY.......: Server-RawIdeaBin TASK ID........: 40 (http://askmonty.org/worklog/?tid=40) VERSION........: Server-9.x STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: DESCRIPTION: Replication slave can be set to filter updates to certain tables with --replicate-[wild-]{do,ignore}-table options. This task is about adding similar functionality to mysqlbinlog. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] New (by Psergey): Add a mysqlbinlog option to filter updates to certain tables (40)
by worklog-noreply＠askmonty.org 10 Aug '09

10 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Add a mysqlbinlog option to filter updates to certain tables CREATION DATE..: Mon, 10 Aug 2009, 13:25 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: Psergey CATEGORY.......: Server-RawIdeaBin TASK ID........: 40 (http://askmonty.org/worklog/?tid=40) VERSION........: Server-9.x STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: DESCRIPTION: Replication slave can be set to filter updates to certain tables with --replicate-[wild-]{do,ignore}-table options. This task is about adding similar functionality to mysqlbinlog. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] Updated (by Guest): Add a mysqlbinlog option to change the used database (36)
by worklog-noreply＠askmonty.org 10 Aug '09

10 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Add a mysqlbinlog option to change the used database CREATION DATE..: Fri, 07 Aug 2009, 14:57 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: CATEGORY.......: Server-RawIdeaBin TASK ID........: 36 (http://askmonty.org/worklog/?tid=36) VERSION........: Server-9.x STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: -=-=(Guest - Mon, 10 Aug 2009, 11:12)=-=- High-Level Specification modified. --- /tmp/wklog.36.old.6580 2009-08-10 11:12:36.000000000 +0300 +++ /tmp/wklog.36.new.6580 2009-08-10 11:12:36.000000000 +0300 @@ -1,4 +1,3 @@ - Context ------- At the moment, the server has a replication slave option @@ -67,6 +66,6 @@ It will be possible to do the rewrites either on the slave ( --replicate-rewrite-db will work for all kinds of statements), or in -mysqlbinlog (adding a comment is easy and doesn't require use to parse the -statement). +mysqlbinlog (adding a comment is easy and doesn't require mysqlbinlog to +parse the statement). -=-=(Psergey - Sun, 09 Aug 2009, 23:53)=-=- High-Level Specification modified. --- /tmp/wklog.36.old.13425 2009-08-09 23:53:54.000000000 +0300 +++ /tmp/wklog.36.new.13425 2009-08-09 23:53:54.000000000 +0300 @@ -1 +1,72 @@ +Context +------- +At the moment, the server has a replication slave option + + --replicate-rewrite-db="from->to" + +the option affects +- Table_map_log_event (all RBR events) +- Load_log_event (LOAD DATA) +- Query_log_event (SBR-based updates, with the usual assumption that the + statement refers to tables in current database, so that changing the current + database will make the statement to work on a table in a different database). + +What we could do +---------------- + +Option1: make mysqlbinlog accept --replicate-rewrite-db option +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Make mysqlbinlog accept --replicate-rewrite-db options and process them to the +same extent as replication slave would process --replicate-rewrite-db option. + + +Option2: Add database-agnostic RBR events and --strip-db option +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Right now RBR events require a databasename. It is not possible to have RBR +event stream that won't mention which database the events are for. When I +tried to use debugger and specify empty database name, attempt to apply the +binlog resulted in this error: + +090809 17:38:44 [ERROR] Slave SQL: Error 'Table '.tablename' doesn't exist' on +opening tables, + +We could do as follows: +- Make the server interpret empty database name in RBR event (i.e. in a + Table_map_log_event) as "use current database". Binlog slave thread + probably should not allow such events as it doesn't have a natural current + database. +- Add a mysqlbinlog --strip-db option that would + = not produce any "USE dbname" statements + = change databasename for all RBR events to be empty + +That way, mysqlbinlog output will be database-agnostic and apply to the +current database. +(this will have the usual limitations that we assume that all statements in +the binlog refer to the current database). + +Option3: Enhance database rewrite +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +If there is a need to support database change for statements that use +dbname.tablename notation and are replicated as statements (i.e. are DDL +statements and/or DML statements that are binlogged as statements), +then that could be supported as follows: + +- Make the server's parser recognize special form of comments + + /* !database-alias(oldname,newname) */ + + and save the mapping somewhere + +- Put the hooks in table open and name resolution code to use the saved + mapping. + + +Once we've done the above, it will be easy to perform a complete, +no-compromise or restrictions database name change in binary log. + +It will be possible to do the rewrites either on the slave ( +--replicate-rewrite-db will work for all kinds of statements), or in +mysqlbinlog (adding a comment is easy and doesn't require use to parse the +statement). + -=-=(Psergey - Sun, 09 Aug 2009, 12:27)=-=- Dependency created: 39 now depends on 36 -=-=(Psergey - Fri, 07 Aug 2009, 14:57)=-=- Title modified. --- /tmp/wklog.36.old.14687 2009-08-07 14:57:49.000000000 +0300 +++ /tmp/wklog.36.new.14687 2009-08-07 14:57:49.000000000 +0300 @@ -1 +1 @@ -Add a mysqlbinlog option to change the database +Add a mysqlbinlog option to change the used database DESCRIPTION: Sometimes there is a need to take a binary log and apply it to a database with a different name than the original name of the database on binlog producer. If one is using statement-based replication, he can achieve this by grepping out "USE dbname" statements out of the output of mysqlbinlog(*). With row-based replication this is no longer possible, as database name is encoded within the the BINLOG '....' statement. This task is about adding an option to mysqlbinlog that would allow to change the names of used databases in both RBR and SBR events. (*) this implies that all statements refer to tables in the current database, doesn't catch updates made inside stored functions and so forth, but still works for a practially-important subset of cases. HIGH-LEVEL SPECIFICATION: Context ------- At the moment, the server has a replication slave option --replicate-rewrite-db="from->to" the option affects - Table_map_log_event (all RBR events) - Load_log_event (LOAD DATA) - Query_log_event (SBR-based updates, with the usual assumption that the statement refers to tables in current database, so that changing the current database will make the statement to work on a table in a different database). What we could do ---------------- Option1: make mysqlbinlog accept --replicate-rewrite-db option ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Make mysqlbinlog accept --replicate-rewrite-db options and process them to the same extent as replication slave would process --replicate-rewrite-db option. Option2: Add database-agnostic RBR events and --strip-db option ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Right now RBR events require a databasename. It is not possible to have RBR event stream that won't mention which database the events are for. When I tried to use debugger and specify empty database name, attempt to apply the binlog resulted in this error: 090809 17:38:44 [ERROR] Slave SQL: Error 'Table '.tablename' doesn't exist' on opening tables, We could do as follows: - Make the server interpret empty database name in RBR event (i.e. in a Table_map_log_event) as "use current database". Binlog slave thread probably should not allow such events as it doesn't have a natural current database. - Add a mysqlbinlog --strip-db option that would = not produce any "USE dbname" statements = change databasename for all RBR events to be empty That way, mysqlbinlog output will be database-agnostic and apply to the current database. (this will have the usual limitations that we assume that all statements in the binlog refer to the current database). Option3: Enhance database rewrite ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If there is a need to support database change for statements that use dbname.tablename notation and are replicated as statements (i.e. are DDL statements and/or DML statements that are binlogged as statements), then that could be supported as follows: - Make the server's parser recognize special form of comments /* !database-alias(oldname,newname) */ and save the mapping somewhere - Put the hooks in table open and name resolution code to use the saved mapping. Once we've done the above, it will be easy to perform a complete, no-compromise or restrictions database name change in binary log. It will be possible to do the rewrites either on the slave ( --replicate-rewrite-db will work for all kinds of statements), or in mysqlbinlog (adding a comment is easy and doesn't require mysqlbinlog to parse the statement). ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] Updated (by Guest): Add a mysqlbinlog option to change the used database (36)
by worklog-noreply＠askmonty.org 10 Aug '09

10 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Add a mysqlbinlog option to change the used database CREATION DATE..: Fri, 07 Aug 2009, 14:57 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: CATEGORY.......: Server-RawIdeaBin TASK ID........: 36 (http://askmonty.org/worklog/?tid=36) VERSION........: Server-9.x STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: -=-=(Guest - Mon, 10 Aug 2009, 11:12)=-=- High-Level Specification modified. --- /tmp/wklog.36.old.6580 2009-08-10 11:12:36.000000000 +0300 +++ /tmp/wklog.36.new.6580 2009-08-10 11:12:36.000000000 +0300 @@ -1,4 +1,3 @@ - Context ------- At the moment, the server has a replication slave option @@ -67,6 +66,6 @@ It will be possible to do the rewrites either on the slave ( --replicate-rewrite-db will work for all kinds of statements), or in -mysqlbinlog (adding a comment is easy and doesn't require use to parse the -statement). +mysqlbinlog (adding a comment is easy and doesn't require mysqlbinlog to +parse the statement). -=-=(Psergey - Sun, 09 Aug 2009, 23:53)=-=- High-Level Specification modified. --- /tmp/wklog.36.old.13425 2009-08-09 23:53:54.000000000 +0300 +++ /tmp/wklog.36.new.13425 2009-08-09 23:53:54.000000000 +0300 @@ -1 +1,72 @@ +Context +------- +At the moment, the server has a replication slave option + + --replicate-rewrite-db="from->to" + +the option affects +- Table_map_log_event (all RBR events) +- Load_log_event (LOAD DATA) +- Query_log_event (SBR-based updates, with the usual assumption that the + statement refers to tables in current database, so that changing the current + database will make the statement to work on a table in a different database). + +What we could do +---------------- + +Option1: make mysqlbinlog accept --replicate-rewrite-db option +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Make mysqlbinlog accept --replicate-rewrite-db options and process them to the +same extent as replication slave would process --replicate-rewrite-db option. + + +Option2: Add database-agnostic RBR events and --strip-db option +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Right now RBR events require a databasename. It is not possible to have RBR +event stream that won't mention which database the events are for. When I +tried to use debugger and specify empty database name, attempt to apply the +binlog resulted in this error: + +090809 17:38:44 [ERROR] Slave SQL: Error 'Table '.tablename' doesn't exist' on +opening tables, + +We could do as follows: +- Make the server interpret empty database name in RBR event (i.e. in a + Table_map_log_event) as "use current database". Binlog slave thread + probably should not allow such events as it doesn't have a natural current + database. +- Add a mysqlbinlog --strip-db option that would + = not produce any "USE dbname" statements + = change databasename for all RBR events to be empty + +That way, mysqlbinlog output will be database-agnostic and apply to the +current database. +(this will have the usual limitations that we assume that all statements in +the binlog refer to the current database). + +Option3: Enhance database rewrite +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +If there is a need to support database change for statements that use +dbname.tablename notation and are replicated as statements (i.e. are DDL +statements and/or DML statements that are binlogged as statements), +then that could be supported as follows: + +- Make the server's parser recognize special form of comments + + /* !database-alias(oldname,newname) */ + + and save the mapping somewhere + +- Put the hooks in table open and name resolution code to use the saved + mapping. + + +Once we've done the above, it will be easy to perform a complete, +no-compromise or restrictions database name change in binary log. + +It will be possible to do the rewrites either on the slave ( +--replicate-rewrite-db will work for all kinds of statements), or in +mysqlbinlog (adding a comment is easy and doesn't require use to parse the +statement). + -=-=(Psergey - Sun, 09 Aug 2009, 12:27)=-=- Dependency created: 39 now depends on 36 -=-=(Psergey - Fri, 07 Aug 2009, 14:57)=-=- Title modified. --- /tmp/wklog.36.old.14687 2009-08-07 14:57:49.000000000 +0300 +++ /tmp/wklog.36.new.14687 2009-08-07 14:57:49.000000000 +0300 @@ -1 +1 @@ -Add a mysqlbinlog option to change the database +Add a mysqlbinlog option to change the used database DESCRIPTION: Sometimes there is a need to take a binary log and apply it to a database with a different name than the original name of the database on binlog producer. If one is using statement-based replication, he can achieve this by grepping out "USE dbname" statements out of the output of mysqlbinlog(*). With row-based replication this is no longer possible, as database name is encoded within the the BINLOG '....' statement. This task is about adding an option to mysqlbinlog that would allow to change the names of used databases in both RBR and SBR events. (*) this implies that all statements refer to tables in the current database, doesn't catch updates made inside stored functions and so forth, but still works for a practially-important subset of cases. HIGH-LEVEL SPECIFICATION: Context ------- At the moment, the server has a replication slave option --replicate-rewrite-db="from->to" the option affects - Table_map_log_event (all RBR events) - Load_log_event (LOAD DATA) - Query_log_event (SBR-based updates, with the usual assumption that the statement refers to tables in current database, so that changing the current database will make the statement to work on a table in a different database). What we could do ---------------- Option1: make mysqlbinlog accept --replicate-rewrite-db option ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Make mysqlbinlog accept --replicate-rewrite-db options and process them to the same extent as replication slave would process --replicate-rewrite-db option. Option2: Add database-agnostic RBR events and --strip-db option ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Right now RBR events require a databasename. It is not possible to have RBR event stream that won't mention which database the events are for. When I tried to use debugger and specify empty database name, attempt to apply the binlog resulted in this error: 090809 17:38:44 [ERROR] Slave SQL: Error 'Table '.tablename' doesn't exist' on opening tables, We could do as follows: - Make the server interpret empty database name in RBR event (i.e. in a Table_map_log_event) as "use current database". Binlog slave thread probably should not allow such events as it doesn't have a natural current database. - Add a mysqlbinlog --strip-db option that would = not produce any "USE dbname" statements = change databasename for all RBR events to be empty That way, mysqlbinlog output will be database-agnostic and apply to the current database. (this will have the usual limitations that we assume that all statements in the binlog refer to the current database). Option3: Enhance database rewrite ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If there is a need to support database change for statements that use dbname.tablename notation and are replicated as statements (i.e. are DDL statements and/or DML statements that are binlogged as statements), then that could be supported as follows: - Make the server's parser recognize special form of comments /* !database-alias(oldname,newname) */ and save the mapping somewhere - Put the hooks in table open and name resolution code to use the saved mapping. Once we've done the above, it will be easy to perform a complete, no-compromise or restrictions database name change in binary log. It will be possible to do the rewrites either on the slave ( --replicate-rewrite-db will work for all kinds of statements), or in mysqlbinlog (adding a comment is easy and doesn't require mysqlbinlog to parse the statement). ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] Updated (by Psergey): Add a mysqlbinlog option to change the used database (36)
by worklog-noreply＠askmonty.org 09 Aug '09

09 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Add a mysqlbinlog option to change the used database CREATION DATE..: Fri, 07 Aug 2009, 14:57 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: CATEGORY.......: Server-RawIdeaBin TASK ID........: 36 (http://askmonty.org/worklog/?tid=36) VERSION........: Server-9.x STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: -=-=(Psergey - Sun, 09 Aug 2009, 23:53)=-=- High-Level Specification modified. --- /tmp/wklog.36.old.13425 2009-08-09 23:53:54.000000000 +0300 +++ /tmp/wklog.36.new.13425 2009-08-09 23:53:54.000000000 +0300 @@ -1 +1,72 @@ +Context +------- +At the moment, the server has a replication slave option + + --replicate-rewrite-db="from->to" + +the option affects +- Table_map_log_event (all RBR events) +- Load_log_event (LOAD DATA) +- Query_log_event (SBR-based updates, with the usual assumption that the + statement refers to tables in current database, so that changing the current + database will make the statement to work on a table in a different database). + +What we could do +---------------- + +Option1: make mysqlbinlog accept --replicate-rewrite-db option +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Make mysqlbinlog accept --replicate-rewrite-db options and process them to the +same extent as replication slave would process --replicate-rewrite-db option. + + +Option2: Add database-agnostic RBR events and --strip-db option +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Right now RBR events require a databasename. It is not possible to have RBR +event stream that won't mention which database the events are for. When I +tried to use debugger and specify empty database name, attempt to apply the +binlog resulted in this error: + +090809 17:38:44 [ERROR] Slave SQL: Error 'Table '.tablename' doesn't exist' on +opening tables, + +We could do as follows: +- Make the server interpret empty database name in RBR event (i.e. in a + Table_map_log_event) as "use current database". Binlog slave thread + probably should not allow such events as it doesn't have a natural current + database. +- Add a mysqlbinlog --strip-db option that would + = not produce any "USE dbname" statements + = change databasename for all RBR events to be empty + +That way, mysqlbinlog output will be database-agnostic and apply to the +current database. +(this will have the usual limitations that we assume that all statements in +the binlog refer to the current database). + +Option3: Enhance database rewrite +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +If there is a need to support database change for statements that use +dbname.tablename notation and are replicated as statements (i.e. are DDL +statements and/or DML statements that are binlogged as statements), +then that could be supported as follows: + +- Make the server's parser recognize special form of comments + + /* !database-alias(oldname,newname) */ + + and save the mapping somewhere + +- Put the hooks in table open and name resolution code to use the saved + mapping. + + +Once we've done the above, it will be easy to perform a complete, +no-compromise or restrictions database name change in binary log. + +It will be possible to do the rewrites either on the slave ( +--replicate-rewrite-db will work for all kinds of statements), or in +mysqlbinlog (adding a comment is easy and doesn't require use to parse the +statement). + -=-=(Psergey - Sun, 09 Aug 2009, 12:27)=-=- Dependency created: 39 now depends on 36 -=-=(Psergey - Fri, 07 Aug 2009, 14:57)=-=- Title modified. --- /tmp/wklog.36.old.14687 2009-08-07 14:57:49.000000000 +0300 +++ /tmp/wklog.36.new.14687 2009-08-07 14:57:49.000000000 +0300 @@ -1 +1 @@ -Add a mysqlbinlog option to change the database +Add a mysqlbinlog option to change the used database DESCRIPTION: Sometimes there is a need to take a binary log and apply it to a database with a different name than the original name of the database on binlog producer. If one is using statement-based replication, he can achieve this by grepping out "USE dbname" statements out of the output of mysqlbinlog(*). With row-based replication this is no longer possible, as database name is encoded within the the BINLOG '....' statement. This task is about adding an option to mysqlbinlog that would allow to change the names of used databases in both RBR and SBR events. (*) this implies that all statements refer to tables in the current database, doesn't catch updates made inside stored functions and so forth, but still works for a practially-important subset of cases. HIGH-LEVEL SPECIFICATION: Context ------- At the moment, the server has a replication slave option --replicate-rewrite-db="from->to" the option affects - Table_map_log_event (all RBR events) - Load_log_event (LOAD DATA) - Query_log_event (SBR-based updates, with the usual assumption that the statement refers to tables in current database, so that changing the current database will make the statement to work on a table in a different database). What we could do ---------------- Option1: make mysqlbinlog accept --replicate-rewrite-db option ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Make mysqlbinlog accept --replicate-rewrite-db options and process them to the same extent as replication slave would process --replicate-rewrite-db option. Option2: Add database-agnostic RBR events and --strip-db option ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Right now RBR events require a databasename. It is not possible to have RBR event stream that won't mention which database the events are for. When I tried to use debugger and specify empty database name, attempt to apply the binlog resulted in this error: 090809 17:38:44 [ERROR] Slave SQL: Error 'Table '.tablename' doesn't exist' on opening tables, We could do as follows: - Make the server interpret empty database name in RBR event (i.e. in a Table_map_log_event) as "use current database". Binlog slave thread probably should not allow such events as it doesn't have a natural current database. - Add a mysqlbinlog --strip-db option that would = not produce any "USE dbname" statements = change databasename for all RBR events to be empty That way, mysqlbinlog output will be database-agnostic and apply to the current database. (this will have the usual limitations that we assume that all statements in the binlog refer to the current database). Option3: Enhance database rewrite ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If there is a need to support database change for statements that use dbname.tablename notation and are replicated as statements (i.e. are DDL statements and/or DML statements that are binlogged as statements), then that could be supported as follows: - Make the server's parser recognize special form of comments /* !database-alias(oldname,newname) */ and save the mapping somewhere - Put the hooks in table open and name resolution code to use the saved mapping. Once we've done the above, it will be easy to perform a complete, no-compromise or restrictions database name change in binary log. It will be possible to do the rewrites either on the slave ( --replicate-rewrite-db will work for all kinds of statements), or in mysqlbinlog (adding a comment is easy and doesn't require use to parse the statement). ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] Updated (by Psergey): Add a mysqlbinlog option to change the used database (36)
by worklog-noreply＠askmonty.org 09 Aug '09

09 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Add a mysqlbinlog option to change the used database CREATION DATE..: Fri, 07 Aug 2009, 14:57 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: CATEGORY.......: Server-RawIdeaBin TASK ID........: 36 (http://askmonty.org/worklog/?tid=36) VERSION........: Server-9.x STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: -=-=(Psergey - Sun, 09 Aug 2009, 23:53)=-=- High-Level Specification modified. --- /tmp/wklog.36.old.13425 2009-08-09 23:53:54.000000000 +0300 +++ /tmp/wklog.36.new.13425 2009-08-09 23:53:54.000000000 +0300 @@ -1 +1,72 @@ +Context +------- +At the moment, the server has a replication slave option + + --replicate-rewrite-db="from->to" + +the option affects +- Table_map_log_event (all RBR events) +- Load_log_event (LOAD DATA) +- Query_log_event (SBR-based updates, with the usual assumption that the + statement refers to tables in current database, so that changing the current + database will make the statement to work on a table in a different database). + +What we could do +---------------- + +Option1: make mysqlbinlog accept --replicate-rewrite-db option +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Make mysqlbinlog accept --replicate-rewrite-db options and process them to the +same extent as replication slave would process --replicate-rewrite-db option. + + +Option2: Add database-agnostic RBR events and --strip-db option +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Right now RBR events require a databasename. It is not possible to have RBR +event stream that won't mention which database the events are for. When I +tried to use debugger and specify empty database name, attempt to apply the +binlog resulted in this error: + +090809 17:38:44 [ERROR] Slave SQL: Error 'Table '.tablename' doesn't exist' on +opening tables, + +We could do as follows: +- Make the server interpret empty database name in RBR event (i.e. in a + Table_map_log_event) as "use current database". Binlog slave thread + probably should not allow such events as it doesn't have a natural current + database. +- Add a mysqlbinlog --strip-db option that would + = not produce any "USE dbname" statements + = change databasename for all RBR events to be empty + +That way, mysqlbinlog output will be database-agnostic and apply to the +current database. +(this will have the usual limitations that we assume that all statements in +the binlog refer to the current database). + +Option3: Enhance database rewrite +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +If there is a need to support database change for statements that use +dbname.tablename notation and are replicated as statements (i.e. are DDL +statements and/or DML statements that are binlogged as statements), +then that could be supported as follows: + +- Make the server's parser recognize special form of comments + + /* !database-alias(oldname,newname) */ + + and save the mapping somewhere + +- Put the hooks in table open and name resolution code to use the saved + mapping. + + +Once we've done the above, it will be easy to perform a complete, +no-compromise or restrictions database name change in binary log. + +It will be possible to do the rewrites either on the slave ( +--replicate-rewrite-db will work for all kinds of statements), or in +mysqlbinlog (adding a comment is easy and doesn't require use to parse the +statement). + -=-=(Psergey - Sun, 09 Aug 2009, 12:27)=-=- Dependency created: 39 now depends on 36 -=-=(Psergey - Fri, 07 Aug 2009, 14:57)=-=- Title modified. --- /tmp/wklog.36.old.14687 2009-08-07 14:57:49.000000000 +0300 +++ /tmp/wklog.36.new.14687 2009-08-07 14:57:49.000000000 +0300 @@ -1 +1 @@ -Add a mysqlbinlog option to change the database +Add a mysqlbinlog option to change the used database DESCRIPTION: Sometimes there is a need to take a binary log and apply it to a database with a different name than the original name of the database on binlog producer. If one is using statement-based replication, he can achieve this by grepping out "USE dbname" statements out of the output of mysqlbinlog(*). With row-based replication this is no longer possible, as database name is encoded within the the BINLOG '....' statement. This task is about adding an option to mysqlbinlog that would allow to change the names of used databases in both RBR and SBR events. (*) this implies that all statements refer to tables in the current database, doesn't catch updates made inside stored functions and so forth, but still works for a practially-important subset of cases. HIGH-LEVEL SPECIFICATION: Context ------- At the moment, the server has a replication slave option --replicate-rewrite-db="from->to" the option affects - Table_map_log_event (all RBR events) - Load_log_event (LOAD DATA) - Query_log_event (SBR-based updates, with the usual assumption that the statement refers to tables in current database, so that changing the current database will make the statement to work on a table in a different database). What we could do ---------------- Option1: make mysqlbinlog accept --replicate-rewrite-db option ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Make mysqlbinlog accept --replicate-rewrite-db options and process them to the same extent as replication slave would process --replicate-rewrite-db option. Option2: Add database-agnostic RBR events and --strip-db option ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Right now RBR events require a databasename. It is not possible to have RBR event stream that won't mention which database the events are for. When I tried to use debugger and specify empty database name, attempt to apply the binlog resulted in this error: 090809 17:38:44 [ERROR] Slave SQL: Error 'Table '.tablename' doesn't exist' on opening tables, We could do as follows: - Make the server interpret empty database name in RBR event (i.e. in a Table_map_log_event) as "use current database". Binlog slave thread probably should not allow such events as it doesn't have a natural current database. - Add a mysqlbinlog --strip-db option that would = not produce any "USE dbname" statements = change databasename for all RBR events to be empty That way, mysqlbinlog output will be database-agnostic and apply to the current database. (this will have the usual limitations that we assume that all statements in the binlog refer to the current database). Option3: Enhance database rewrite ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If there is a need to support database change for statements that use dbname.tablename notation and are replicated as statements (i.e. are DDL statements and/or DML statements that are binlogged as statements), then that could be supported as follows: - Make the server's parser recognize special form of comments /* !database-alias(oldname,newname) */ and save the mapping somewhere - Put the hooks in table open and name resolution code to use the saved mapping. Once we've done the above, it will be easy to perform a complete, no-compromise or restrictions database name change in binary log. It will be possible to do the rewrites either on the slave ( --replicate-rewrite-db will work for all kinds of statements), or in mysqlbinlog (adding a comment is easy and doesn't require use to parse the statement). ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] Updated (by Psergey): Add an option to mysqlbinlog to produce SQL script with fewer roundtrips (37)
by worklog-noreply＠askmonty.org 09 Aug '09

09 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Add an option to mysqlbinlog to produce SQL script with fewer roundtrips CREATION DATE..: Fri, 07 Aug 2009, 17:14 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: CATEGORY.......: Server-RawIdeaBin TASK ID........: 37 (http://askmonty.org/worklog/?tid=37) VERSION........: Server-9.x STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: -=-=(Psergey - Sun, 09 Aug 2009, 12:56)=-=- High-Level Specification modified. --- /tmp/wklog.37.old.22083 2009-08-09 12:56:36.000000000 +0300 +++ /tmp/wklog.37.new.22083 2009-08-09 12:56:36.000000000 +0300 @@ -11,3 +11,16 @@ if (my_b_tell(&cache) != 0) my_b_write(&cache,";;",2); +Note: mysqlbinlog already uses + + DELIMITER /*!*/; + +so that it can process "multi-statements" like + + CREATE PROCEDURE ... BEGIN stmt1; stmt2; ... END + +what remains to be done is to print the /*!*/; only when we're about to exceed +$args[combine-statements] bytes. In all other cases, delimit statements with +regular semicolon. + + -=-=(Psergey - Sun, 09 Aug 2009, 12:30)=-=- High Level Description modified. --- /tmp/wklog.37.old.21090 2009-08-09 12:30:26.000000000 +0300 +++ /tmp/wklog.37.new.21090 2009-08-09 12:30:26.000000000 +0300 @@ -1,6 +1,6 @@ SQL scripts generated by mysqlbinlog can be slow to load because they have many small queries, hence applying the script against a remote server requires a lot -of roundtrips, and they become a bottleneck. +of roundtrips, and the network roundtrips become the bottleneck. This bottleneck can be addressed by having mysqlbinlog combine multiple statements into one: @@ -14,7 +14,7 @@ loading such sql script will require fewer roundtrips. -The behavior can be controlled using a command line option +The behaviour can be controlled using a command line option mysqlbinlog --combine-statements=# -=-=(Psergey - Sun, 09 Aug 2009, 12:24)=-=- Dependency created: 39 now depends on 37 -=-=(Psergey - Fri, 07 Aug 2009, 17:16)=-=- High-Level Specification modified. --- /tmp/wklog.37.old.20454 2009-08-07 17:16:54.000000000 +0300 +++ /tmp/wklog.37.new.20454 2009-08-07 17:16:54.000000000 +0300 @@ -1 +1,13 @@ +Implementation overview: + +- At start, print "--delimiter=;;" +- Modify the start of each print functions as follows + + if (my_b_tell(&cache) - my_start_of_combine_statement) + + estimiated_size_of_log_event) > combine_statement_size) + my_b_write(&cache,";;",2); + +- And we should end mysqlbinlog with; + if (my_b_tell(&cache) != 0) + my_b_write(&cache,";;",2); DESCRIPTION: SQL scripts generated by mysqlbinlog can be slow to load because they have many small queries, hence applying the script against a remote server requires a lot of roundtrips, and the network roundtrips become the bottleneck. This bottleneck can be addressed by having mysqlbinlog combine multiple statements into one: +delimiter // binlog statement1; binlog statement2; binlog statement3; +// binlog statement4; loading such sql script will require fewer roundtrips. The behaviour can be controlled using a command line option mysqlbinlog --combine-statements=# Where the # is maximum allowed packet length. HIGH-LEVEL SPECIFICATION: Implementation overview: - At start, print "--delimiter=;;" - Modify the start of each print functions as follows if (my_b_tell(&cache) - my_start_of_combine_statement) + estimiated_size_of_log_event) > combine_statement_size) my_b_write(&cache,";;",2); - And we should end mysqlbinlog with; if (my_b_tell(&cache) != 0) my_b_write(&cache,";;",2); Note: mysqlbinlog already uses DELIMITER /*!*/; so that it can process "multi-statements" like CREATE PROCEDURE ... BEGIN stmt1; stmt2; ... END what remains to be done is to print the /*!*/; only when we're about to exceed $args[combine-statements] bytes. In all other cases, delimit statements with regular semicolon. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] Updated (by Psergey): Add an option to mysqlbinlog to produce SQL script with fewer roundtrips (37)
by worklog-noreply＠askmonty.org 09 Aug '09

09 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Add an option to mysqlbinlog to produce SQL script with fewer roundtrips CREATION DATE..: Fri, 07 Aug 2009, 17:14 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: CATEGORY.......: Server-RawIdeaBin TASK ID........: 37 (http://askmonty.org/worklog/?tid=37) VERSION........: Server-9.x STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: -=-=(Psergey - Sun, 09 Aug 2009, 12:56)=-=- High-Level Specification modified. --- /tmp/wklog.37.old.22083 2009-08-09 12:56:36.000000000 +0300 +++ /tmp/wklog.37.new.22083 2009-08-09 12:56:36.000000000 +0300 @@ -11,3 +11,16 @@ if (my_b_tell(&cache) != 0) my_b_write(&cache,";;",2); +Note: mysqlbinlog already uses + + DELIMITER /*!*/; + +so that it can process "multi-statements" like + + CREATE PROCEDURE ... BEGIN stmt1; stmt2; ... END + +what remains to be done is to print the /*!*/; only when we're about to exceed +$args[combine-statements] bytes. In all other cases, delimit statements with +regular semicolon. + + -=-=(Psergey - Sun, 09 Aug 2009, 12:30)=-=- High Level Description modified. --- /tmp/wklog.37.old.21090 2009-08-09 12:30:26.000000000 +0300 +++ /tmp/wklog.37.new.21090 2009-08-09 12:30:26.000000000 +0300 @@ -1,6 +1,6 @@ SQL scripts generated by mysqlbinlog can be slow to load because they have many small queries, hence applying the script against a remote server requires a lot -of roundtrips, and they become a bottleneck. +of roundtrips, and the network roundtrips become the bottleneck. This bottleneck can be addressed by having mysqlbinlog combine multiple statements into one: @@ -14,7 +14,7 @@ loading such sql script will require fewer roundtrips. -The behavior can be controlled using a command line option +The behaviour can be controlled using a command line option mysqlbinlog --combine-statements=# -=-=(Psergey - Sun, 09 Aug 2009, 12:24)=-=- Dependency created: 39 now depends on 37 -=-=(Psergey - Fri, 07 Aug 2009, 17:16)=-=- High-Level Specification modified. --- /tmp/wklog.37.old.20454 2009-08-07 17:16:54.000000000 +0300 +++ /tmp/wklog.37.new.20454 2009-08-07 17:16:54.000000000 +0300 @@ -1 +1,13 @@ +Implementation overview: + +- At start, print "--delimiter=;;" +- Modify the start of each print functions as follows + + if (my_b_tell(&cache) - my_start_of_combine_statement) + + estimiated_size_of_log_event) > combine_statement_size) + my_b_write(&cache,";;",2); + +- And we should end mysqlbinlog with; + if (my_b_tell(&cache) != 0) + my_b_write(&cache,";;",2); DESCRIPTION: SQL scripts generated by mysqlbinlog can be slow to load because they have many small queries, hence applying the script against a remote server requires a lot of roundtrips, and the network roundtrips become the bottleneck. This bottleneck can be addressed by having mysqlbinlog combine multiple statements into one: +delimiter // binlog statement1; binlog statement2; binlog statement3; +// binlog statement4; loading such sql script will require fewer roundtrips. The behaviour can be controlled using a command line option mysqlbinlog --combine-statements=# Where the # is maximum allowed packet length. HIGH-LEVEL SPECIFICATION: Implementation overview: - At start, print "--delimiter=;;" - Modify the start of each print functions as follows if (my_b_tell(&cache) - my_start_of_combine_statement) + estimiated_size_of_log_event) > combine_statement_size) my_b_write(&cache,";;",2); - And we should end mysqlbinlog with; if (my_b_tell(&cache) != 0) my_b_write(&cache,";;",2); Note: mysqlbinlog already uses DELIMITER /*!*/; so that it can process "multi-statements" like CREATE PROCEDURE ... BEGIN stmt1; stmt2; ... END what remains to be done is to print the /*!*/; only when we're about to exceed $args[combine-statements] bytes. In all other cases, delimit statements with regular semicolon. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] Updated (by Psergey): Add an option to mysqlbinlog to produce SQL script with fewer roundtrips (37)
by worklog-noreply＠askmonty.org 09 Aug '09

09 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Add an option to mysqlbinlog to produce SQL script with fewer roundtrips CREATION DATE..: Fri, 07 Aug 2009, 17:14 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: CATEGORY.......: Server-RawIdeaBin TASK ID........: 37 (http://askmonty.org/worklog/?tid=37) VERSION........: Server-9.x STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: -=-=(Psergey - Sun, 09 Aug 2009, 12:30)=-=- High Level Description modified. --- /tmp/wklog.37.old.21090 2009-08-09 12:30:26.000000000 +0300 +++ /tmp/wklog.37.new.21090 2009-08-09 12:30:26.000000000 +0300 @@ -1,6 +1,6 @@ SQL scripts generated by mysqlbinlog can be slow to load because they have many small queries, hence applying the script against a remote server requires a lot -of roundtrips, and they become a bottleneck. +of roundtrips, and the network roundtrips become the bottleneck. This bottleneck can be addressed by having mysqlbinlog combine multiple statements into one: @@ -14,7 +14,7 @@ loading such sql script will require fewer roundtrips. -The behavior can be controlled using a command line option +The behaviour can be controlled using a command line option mysqlbinlog --combine-statements=# -=-=(Psergey - Sun, 09 Aug 2009, 12:24)=-=- Dependency created: 39 now depends on 37 -=-=(Psergey - Fri, 07 Aug 2009, 17:16)=-=- High-Level Specification modified. --- /tmp/wklog.37.old.20454 2009-08-07 17:16:54.000000000 +0300 +++ /tmp/wklog.37.new.20454 2009-08-07 17:16:54.000000000 +0300 @@ -1 +1,13 @@ +Implementation overview: + +- At start, print "--delimiter=;;" +- Modify the start of each print functions as follows + + if (my_b_tell(&cache) - my_start_of_combine_statement) + + estimiated_size_of_log_event) > combine_statement_size) + my_b_write(&cache,";;",2); + +- And we should end mysqlbinlog with; + if (my_b_tell(&cache) != 0) + my_b_write(&cache,";;",2); DESCRIPTION: SQL scripts generated by mysqlbinlog can be slow to load because they have many small queries, hence applying the script against a remote server requires a lot of roundtrips, and the network roundtrips become the bottleneck. This bottleneck can be addressed by having mysqlbinlog combine multiple statements into one: +delimiter // binlog statement1; binlog statement2; binlog statement3; +// binlog statement4; loading such sql script will require fewer roundtrips. The behaviour can be controlled using a command line option mysqlbinlog --combine-statements=# Where the # is maximum allowed packet length. HIGH-LEVEL SPECIFICATION: Implementation overview: - At start, print "--delimiter=;;" - Modify the start of each print functions as follows if (my_b_tell(&cache) - my_start_of_combine_statement) + estimiated_size_of_log_event) > combine_statement_size) my_b_write(&cache,";;",2); - And we should end mysqlbinlog with; if (my_b_tell(&cache) != 0) my_b_write(&cache,";;",2); ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] Updated (by Psergey): Add an option to mysqlbinlog to produce SQL script with fewer roundtrips (37)
by worklog-noreply＠askmonty.org 09 Aug '09

09 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Add an option to mysqlbinlog to produce SQL script with fewer roundtrips CREATION DATE..: Fri, 07 Aug 2009, 17:14 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: CATEGORY.......: Server-RawIdeaBin TASK ID........: 37 (http://askmonty.org/worklog/?tid=37) VERSION........: Server-9.x STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: -=-=(Psergey - Sun, 09 Aug 2009, 12:30)=-=- High Level Description modified. --- /tmp/wklog.37.old.21090 2009-08-09 12:30:26.000000000 +0300 +++ /tmp/wklog.37.new.21090 2009-08-09 12:30:26.000000000 +0300 @@ -1,6 +1,6 @@ SQL scripts generated by mysqlbinlog can be slow to load because they have many small queries, hence applying the script against a remote server requires a lot -of roundtrips, and they become a bottleneck. +of roundtrips, and the network roundtrips become the bottleneck. This bottleneck can be addressed by having mysqlbinlog combine multiple statements into one: @@ -14,7 +14,7 @@ loading such sql script will require fewer roundtrips. -The behavior can be controlled using a command line option +The behaviour can be controlled using a command line option mysqlbinlog --combine-statements=# -=-=(Psergey - Sun, 09 Aug 2009, 12:24)=-=- Dependency created: 39 now depends on 37 -=-=(Psergey - Fri, 07 Aug 2009, 17:16)=-=- High-Level Specification modified. --- /tmp/wklog.37.old.20454 2009-08-07 17:16:54.000000000 +0300 +++ /tmp/wklog.37.new.20454 2009-08-07 17:16:54.000000000 +0300 @@ -1 +1,13 @@ +Implementation overview: + +- At start, print "--delimiter=;;" +- Modify the start of each print functions as follows + + if (my_b_tell(&cache) - my_start_of_combine_statement) + + estimiated_size_of_log_event) > combine_statement_size) + my_b_write(&cache,";;",2); + +- And we should end mysqlbinlog with; + if (my_b_tell(&cache) != 0) + my_b_write(&cache,";;",2); DESCRIPTION: SQL scripts generated by mysqlbinlog can be slow to load because they have many small queries, hence applying the script against a remote server requires a lot of roundtrips, and the network roundtrips become the bottleneck. This bottleneck can be addressed by having mysqlbinlog combine multiple statements into one: +delimiter // binlog statement1; binlog statement2; binlog statement3; +// binlog statement4; loading such sql script will require fewer roundtrips. The behaviour can be controlled using a command line option mysqlbinlog --combine-statements=# Where the # is maximum allowed packet length. HIGH-LEVEL SPECIFICATION: Implementation overview: - At start, print "--delimiter=;;" - Modify the start of each print functions as follows if (my_b_tell(&cache) - my_start_of_combine_statement) + estimiated_size_of_log_event) > combine_statement_size) my_b_write(&cache,";;",2); - And we should end mysqlbinlog with; if (my_b_tell(&cache) != 0) my_b_write(&cache,";;",2); ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] New (by Psergey): Replication tasks (39)
by worklog-noreply＠askmonty.org 09 Aug '09

09 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Replication tasks CREATION DATE..: Sun, 09 Aug 2009, 12:24 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: CATEGORY.......: Client-RawIdeaBin TASK ID........: 39 (http://askmonty.org/worklog/?tid=39) VERSION........: Server-9.x STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: DESCRIPTION: A combine task for all replication tasks. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] New (by Psergey): Replication tasks (39)
by worklog-noreply＠askmonty.org 09 Aug '09

09 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Replication tasks CREATION DATE..: Sun, 09 Aug 2009, 12:24 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: CATEGORY.......: Client-RawIdeaBin TASK ID........: 39 (http://askmonty.org/worklog/?tid=39) VERSION........: Server-9.x STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: DESCRIPTION: A combine task for all replication tasks. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] Updated (by Guest): Make mysqlbinlog not to output unneeded statements when using --database (38)
by worklog-noreply＠askmonty.org 09 Aug '09

09 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Make mysqlbinlog not to output unneeded statements when using --database CREATION DATE..: Sat, 08 Aug 2009, 12:40 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: CATEGORY.......: Server-RawIdeaBin TASK ID........: 38 (http://askmonty.org/worklog/?tid=38) VERSION........: Benchmarks-3.0 STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: -=-=(Guest - Sun, 09 Aug 2009, 12:22)=-=- High-Level Specification modified. --- /tmp/wklog.38.old.20756 2009-08-09 12:22:52.000000000 +0300 +++ /tmp/wklog.38.new.20756 2009-08-09 12:22:52.000000000 +0300 @@ -1 +1,18 @@ +Monty's suggestions for fix: + +A way to fix this for 'most' cases are: + +If we do filtering (like mysqlbinlog --database='xxx') then: + +- In mysql_bin_log(), do a flush of the Log_event::cache() between + each statement. +- Log on statement. +- If the statement was ignored (we need a flag for this) and + there is something in the cache and the file position didn't change + (the cache didn't overflow), then reset the cache. + +Bug #23890 mysqlbinlog outputs COMMIT unnecessarily when single +database is used +- Could be fixed by having a flag to mark if something was printed + to the log since last commit. -=-=(Guest - Sun, 09 Aug 2009, 12:20)=-=- High Level Description modified. --- /tmp/wklog.38.old.20618 2009-08-09 12:20:16.000000000 +0300 +++ /tmp/wklog.38.new.20618 2009-08-09 12:20:16.000000000 +0300 @@ -1,10 +1,24 @@ -This comes from MySQL BUG#23890: +This comes from MySQL BUG#23890 and BUG#23894: when one runs - mysqlbinlog --database=bar N-bin.000003 + mysqlbinlog --database=bar binlog_file -will output all the COMMIT statements in the binary log even if it didn't print -any statements between the COMMITs (because all the statements that were there -were for the other databases). +the produced SQL may contain useless sequences of commands like: -The fix is trivial: in mysqlbinlog, check if we've printed anything after we've -printed the previous commit statement. +COMMIT; +COMMIT; +COMMIT; +... + +or + +SET INSERT_ID=val1; +SET INSERT_ID=val2; +SET INSERT_ID=val3; +... + +This happens because the statements between COMMIT (or SET) statements had no +effect on the specified database and so were filtered out. COMMIT and SET +statements themselves are not associated with any database and were left in. + +Presence of redundant COMMIT or SET statements makes binlog SQL script +unnecessarily big and it will take more client<->server roundtrips to apply it. -=-=(Guest - Sun, 09 Aug 2009, 12:19)=-=- Title modified. --- /tmp/wklog.38.old.20544 2009-08-09 12:19:22.000000000 +0300 +++ /tmp/wklog.38.new.20544 2009-08-09 12:19:22.000000000 +0300 @@ -1 +1 @@ -Make mysqlbinlog not to output unneeded COMMIT statements +Make mysqlbinlog not to output unneeded statements when using --database DESCRIPTION: This comes from MySQL BUG#23890 and BUG#23894: when one runs mysqlbinlog --database=bar binlog_file the produced SQL may contain useless sequences of commands like: COMMIT; COMMIT; COMMIT; ... or SET INSERT_ID=val1; SET INSERT_ID=val2; SET INSERT_ID=val3; ... This happens because the statements between COMMIT (or SET) statements had no effect on the specified database and so were filtered out. COMMIT and SET statements themselves are not associated with any database and were left in. Presence of redundant COMMIT or SET statements makes binlog SQL script unnecessarily big and it will take more client<->server roundtrips to apply it. HIGH-LEVEL SPECIFICATION: Monty's suggestions for fix: A way to fix this for 'most' cases are: If we do filtering (like mysqlbinlog --database='xxx') then: - In mysql_bin_log(), do a flush of the Log_event::cache() between each statement. - Log on statement. - If the statement was ignored (we need a flag for this) and there is something in the cache and the file position didn't change (the cache didn't overflow), then reset the cache. Bug #23890 mysqlbinlog outputs COMMIT unnecessarily when single database is used - Could be fixed by having a flag to mark if something was printed to the log since last commit. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] Updated (by Guest): Make mysqlbinlog not to output unneeded statements when using --database (38)
by worklog-noreply＠askmonty.org 09 Aug '09

09 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Make mysqlbinlog not to output unneeded statements when using --database CREATION DATE..: Sat, 08 Aug 2009, 12:40 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: CATEGORY.......: Server-RawIdeaBin TASK ID........: 38 (http://askmonty.org/worklog/?tid=38) VERSION........: Benchmarks-3.0 STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: -=-=(Guest - Sun, 09 Aug 2009, 12:22)=-=- High-Level Specification modified. --- /tmp/wklog.38.old.20756 2009-08-09 12:22:52.000000000 +0300 +++ /tmp/wklog.38.new.20756 2009-08-09 12:22:52.000000000 +0300 @@ -1 +1,18 @@ +Monty's suggestions for fix: + +A way to fix this for 'most' cases are: + +If we do filtering (like mysqlbinlog --database='xxx') then: + +- In mysql_bin_log(), do a flush of the Log_event::cache() between + each statement. +- Log on statement. +- If the statement was ignored (we need a flag for this) and + there is something in the cache and the file position didn't change + (the cache didn't overflow), then reset the cache. + +Bug #23890 mysqlbinlog outputs COMMIT unnecessarily when single +database is used +- Could be fixed by having a flag to mark if something was printed + to the log since last commit. -=-=(Guest - Sun, 09 Aug 2009, 12:20)=-=- High Level Description modified. --- /tmp/wklog.38.old.20618 2009-08-09 12:20:16.000000000 +0300 +++ /tmp/wklog.38.new.20618 2009-08-09 12:20:16.000000000 +0300 @@ -1,10 +1,24 @@ -This comes from MySQL BUG#23890: +This comes from MySQL BUG#23890 and BUG#23894: when one runs - mysqlbinlog --database=bar N-bin.000003 + mysqlbinlog --database=bar binlog_file -will output all the COMMIT statements in the binary log even if it didn't print -any statements between the COMMITs (because all the statements that were there -were for the other databases). +the produced SQL may contain useless sequences of commands like: -The fix is trivial: in mysqlbinlog, check if we've printed anything after we've -printed the previous commit statement. +COMMIT; +COMMIT; +COMMIT; +... + +or + +SET INSERT_ID=val1; +SET INSERT_ID=val2; +SET INSERT_ID=val3; +... + +This happens because the statements between COMMIT (or SET) statements had no +effect on the specified database and so were filtered out. COMMIT and SET +statements themselves are not associated with any database and were left in. + +Presence of redundant COMMIT or SET statements makes binlog SQL script +unnecessarily big and it will take more client<->server roundtrips to apply it. -=-=(Guest - Sun, 09 Aug 2009, 12:19)=-=- Title modified. --- /tmp/wklog.38.old.20544 2009-08-09 12:19:22.000000000 +0300 +++ /tmp/wklog.38.new.20544 2009-08-09 12:19:22.000000000 +0300 @@ -1 +1 @@ -Make mysqlbinlog not to output unneeded COMMIT statements +Make mysqlbinlog not to output unneeded statements when using --database DESCRIPTION: This comes from MySQL BUG#23890 and BUG#23894: when one runs mysqlbinlog --database=bar binlog_file the produced SQL may contain useless sequences of commands like: COMMIT; COMMIT; COMMIT; ... or SET INSERT_ID=val1; SET INSERT_ID=val2; SET INSERT_ID=val3; ... This happens because the statements between COMMIT (or SET) statements had no effect on the specified database and so were filtered out. COMMIT and SET statements themselves are not associated with any database and were left in. Presence of redundant COMMIT or SET statements makes binlog SQL script unnecessarily big and it will take more client<->server roundtrips to apply it. HIGH-LEVEL SPECIFICATION: Monty's suggestions for fix: A way to fix this for 'most' cases are: If we do filtering (like mysqlbinlog --database='xxx') then: - In mysql_bin_log(), do a flush of the Log_event::cache() between each statement. - Log on statement. - If the statement was ignored (we need a flag for this) and there is something in the cache and the file position didn't change (the cache didn't overflow), then reset the cache. Bug #23890 mysqlbinlog outputs COMMIT unnecessarily when single database is used - Could be fixed by having a flag to mark if something was printed to the log since last commit. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] Updated (by Guest): Make mysqlbinlog not to output unneeded statements when using --database (38)
by worklog-noreply＠askmonty.org 09 Aug '09

09 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Make mysqlbinlog not to output unneeded statements when using --database CREATION DATE..: Sat, 08 Aug 2009, 12:40 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: CATEGORY.......: Server-RawIdeaBin TASK ID........: 38 (http://askmonty.org/worklog/?tid=38) VERSION........: Benchmarks-3.0 STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: -=-=(Guest - Sun, 09 Aug 2009, 12:20)=-=- High Level Description modified. --- /tmp/wklog.38.old.20618 2009-08-09 12:20:16.000000000 +0300 +++ /tmp/wklog.38.new.20618 2009-08-09 12:20:16.000000000 +0300 @@ -1,10 +1,24 @@ -This comes from MySQL BUG#23890: +This comes from MySQL BUG#23890 and BUG#23894: when one runs - mysqlbinlog --database=bar N-bin.000003 + mysqlbinlog --database=bar binlog_file -will output all the COMMIT statements in the binary log even if it didn't print -any statements between the COMMITs (because all the statements that were there -were for the other databases). +the produced SQL may contain useless sequences of commands like: -The fix is trivial: in mysqlbinlog, check if we've printed anything after we've -printed the previous commit statement. +COMMIT; +COMMIT; +COMMIT; +... + +or + +SET INSERT_ID=val1; +SET INSERT_ID=val2; +SET INSERT_ID=val3; +... + +This happens because the statements between COMMIT (or SET) statements had no +effect on the specified database and so were filtered out. COMMIT and SET +statements themselves are not associated with any database and were left in. + +Presence of redundant COMMIT or SET statements makes binlog SQL script +unnecessarily big and it will take more client<->server roundtrips to apply it. -=-=(Guest - Sun, 09 Aug 2009, 12:19)=-=- Title modified. --- /tmp/wklog.38.old.20544 2009-08-09 12:19:22.000000000 +0300 +++ /tmp/wklog.38.new.20544 2009-08-09 12:19:22.000000000 +0300 @@ -1 +1 @@ -Make mysqlbinlog not to output unneeded COMMIT statements +Make mysqlbinlog not to output unneeded statements when using --database DESCRIPTION: This comes from MySQL BUG#23890 and BUG#23894: when one runs mysqlbinlog --database=bar binlog_file the produced SQL may contain useless sequences of commands like: COMMIT; COMMIT; COMMIT; ... or SET INSERT_ID=val1; SET INSERT_ID=val2; SET INSERT_ID=val3; ... This happens because the statements between COMMIT (or SET) statements had no effect on the specified database and so were filtered out. COMMIT and SET statements themselves are not associated with any database and were left in. Presence of redundant COMMIT or SET statements makes binlog SQL script unnecessarily big and it will take more client<->server roundtrips to apply it. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] Updated (by Guest): Make mysqlbinlog not to output unneeded statements when using --database (38)
by worklog-noreply＠askmonty.org 09 Aug '09

09 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Make mysqlbinlog not to output unneeded statements when using --database CREATION DATE..: Sat, 08 Aug 2009, 12:40 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: CATEGORY.......: Server-RawIdeaBin TASK ID........: 38 (http://askmonty.org/worklog/?tid=38) VERSION........: Benchmarks-3.0 STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: -=-=(Guest - Sun, 09 Aug 2009, 12:20)=-=- High Level Description modified. --- /tmp/wklog.38.old.20618 2009-08-09 12:20:16.000000000 +0300 +++ /tmp/wklog.38.new.20618 2009-08-09 12:20:16.000000000 +0300 @@ -1,10 +1,24 @@ -This comes from MySQL BUG#23890: +This comes from MySQL BUG#23890 and BUG#23894: when one runs - mysqlbinlog --database=bar N-bin.000003 + mysqlbinlog --database=bar binlog_file -will output all the COMMIT statements in the binary log even if it didn't print -any statements between the COMMITs (because all the statements that were there -were for the other databases). +the produced SQL may contain useless sequences of commands like: -The fix is trivial: in mysqlbinlog, check if we've printed anything after we've -printed the previous commit statement. +COMMIT; +COMMIT; +COMMIT; +... + +or + +SET INSERT_ID=val1; +SET INSERT_ID=val2; +SET INSERT_ID=val3; +... + +This happens because the statements between COMMIT (or SET) statements had no +effect on the specified database and so were filtered out. COMMIT and SET +statements themselves are not associated with any database and were left in. + +Presence of redundant COMMIT or SET statements makes binlog SQL script +unnecessarily big and it will take more client<->server roundtrips to apply it. -=-=(Guest - Sun, 09 Aug 2009, 12:19)=-=- Title modified. --- /tmp/wklog.38.old.20544 2009-08-09 12:19:22.000000000 +0300 +++ /tmp/wklog.38.new.20544 2009-08-09 12:19:22.000000000 +0300 @@ -1 +1 @@ -Make mysqlbinlog not to output unneeded COMMIT statements +Make mysqlbinlog not to output unneeded statements when using --database DESCRIPTION: This comes from MySQL BUG#23890 and BUG#23894: when one runs mysqlbinlog --database=bar binlog_file the produced SQL may contain useless sequences of commands like: COMMIT; COMMIT; COMMIT; ... or SET INSERT_ID=val1; SET INSERT_ID=val2; SET INSERT_ID=val3; ... This happens because the statements between COMMIT (or SET) statements had no effect on the specified database and so were filtered out. COMMIT and SET statements themselves are not associated with any database and were left in. Presence of redundant COMMIT or SET statements makes binlog SQL script unnecessarily big and it will take more client<->server roundtrips to apply it. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] Updated (by Guest): Make mysqlbinlog not to output unneeded statements when using --database (38)
by worklog-noreply＠askmonty.org 09 Aug '09

09 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Make mysqlbinlog not to output unneeded statements when using --database CREATION DATE..: Sat, 08 Aug 2009, 12:40 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: CATEGORY.......: Server-RawIdeaBin TASK ID........: 38 (http://askmonty.org/worklog/?tid=38) VERSION........: Benchmarks-3.0 STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: -=-=(Guest - Sun, 09 Aug 2009, 12:19)=-=- Title modified. --- /tmp/wklog.38.old.20544 2009-08-09 12:19:22.000000000 +0300 +++ /tmp/wklog.38.new.20544 2009-08-09 12:19:22.000000000 +0300 @@ -1 +1 @@ -Make mysqlbinlog not to output unneeded COMMIT statements +Make mysqlbinlog not to output unneeded statements when using --database DESCRIPTION: This comes from MySQL BUG#23890: mysqlbinlog --database=bar N-bin.000003 will output all the COMMIT statements in the binary log even if it didn't print any statements between the COMMITs (because all the statements that were there were for the other databases). The fix is trivial: in mysqlbinlog, check if we've printed anything after we've printed the previous commit statement. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] Updated (by Guest): Make mysqlbinlog not to output unneeded statements when using --database (38)
by worklog-noreply＠askmonty.org 09 Aug '09

09 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Make mysqlbinlog not to output unneeded statements when using --database CREATION DATE..: Sat, 08 Aug 2009, 12:40 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: CATEGORY.......: Server-RawIdeaBin TASK ID........: 38 (http://askmonty.org/worklog/?tid=38) VERSION........: Benchmarks-3.0 STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: -=-=(Guest - Sun, 09 Aug 2009, 12:19)=-=- Title modified. --- /tmp/wklog.38.old.20544 2009-08-09 12:19:22.000000000 +0300 +++ /tmp/wklog.38.new.20544 2009-08-09 12:19:22.000000000 +0300 @@ -1 +1 @@ -Make mysqlbinlog not to output unneeded COMMIT statements +Make mysqlbinlog not to output unneeded statements when using --database DESCRIPTION: This comes from MySQL BUG#23890: mysqlbinlog --database=bar N-bin.000003 will output all the COMMIT statements in the binary log even if it didn't print any statements between the COMMITs (because all the statements that were there were for the other databases). The fix is trivial: in mysqlbinlog, check if we've printed anything after we've printed the previous commit statement. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] New (by Psergey): Make mysqlbinlog not to output unneeded COMMIT statements (38)
by worklog-noreply＠askmonty.org 08 Aug '09

08 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Make mysqlbinlog not to output unneeded COMMIT statements CREATION DATE..: Sat, 08 Aug 2009, 12:40 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: CATEGORY.......: Server-RawIdeaBin TASK ID........: 38 (http://askmonty.org/worklog/?tid=38) VERSION........: Benchmarks-3.0 STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: DESCRIPTION: This comes from MySQL BUG#23890: mysqlbinlog --database=bar N-bin.000003 will output all the COMMIT statements in the binary log even if it didn't print any statements between the COMMITs (because all the statements that were there were for the other databases). The fix is trivial: in mysqlbinlog, check if we've printed anything after we've printed the previous commit statement. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] New (by Psergey): Make mysqlbinlog not to output unneeded COMMIT statements (38)
by worklog-noreply＠askmonty.org 08 Aug '09

08 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Make mysqlbinlog not to output unneeded COMMIT statements CREATION DATE..: Sat, 08 Aug 2009, 12:40 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: CATEGORY.......: Server-RawIdeaBin TASK ID........: 38 (http://askmonty.org/worklog/?tid=38) VERSION........: Benchmarks-3.0 STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: DESCRIPTION: This comes from MySQL BUG#23890: mysqlbinlog --database=bar N-bin.000003 will output all the COMMIT statements in the binary log even if it didn't print any statements between the COMMITs (because all the statements that were there were for the other databases). The fix is trivial: in mysqlbinlog, check if we've printed anything after we've printed the previous commit statement. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] Updated (by Psergey): Add an option to mysqlbinlog to produce SQL script with fewer roundtrips (37)
by worklog-noreply＠askmonty.org 07 Aug '09

07 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Add an option to mysqlbinlog to produce SQL script with fewer roundtrips CREATION DATE..: Fri, 07 Aug 2009, 17:14 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: CATEGORY.......: Server-RawIdeaBin TASK ID........: 37 (http://askmonty.org/worklog/?tid=37) VERSION........: Server-9.x STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: -=-=(Psergey - Fri, 07 Aug 2009, 17:16)=-=- High-Level Specification modified. --- /tmp/wklog.37.old.20454 2009-08-07 17:16:54.000000000 +0300 +++ /tmp/wklog.37.new.20454 2009-08-07 17:16:54.000000000 +0300 @@ -1 +1,13 @@ +Implementation overview: + +- At start, print "--delimiter=;;" +- Modify the start of each print functions as follows + + if (my_b_tell(&cache) - my_start_of_combine_statement) + + estimiated_size_of_log_event) > combine_statement_size) + my_b_write(&cache,";;",2); + +- And we should end mysqlbinlog with; + if (my_b_tell(&cache) != 0) + my_b_write(&cache,";;",2); DESCRIPTION: SQL scripts generated by mysqlbinlog can be slow to load because they have many small queries, hence applying the script against a remote server requires a lot of roundtrips, and they become a bottleneck. This bottleneck can be addressed by having mysqlbinlog combine multiple statements into one: +delimiter // binlog statement1; binlog statement2; binlog statement3; +// binlog statement4; loading such sql script will require fewer roundtrips. The behavior can be controlled using a command line option mysqlbinlog --combine-statements=# Where the # is maximum allowed packet length. HIGH-LEVEL SPECIFICATION: Implementation overview: - At start, print "--delimiter=;;" - Modify the start of each print functions as follows if (my_b_tell(&cache) - my_start_of_combine_statement) + estimiated_size_of_log_event) > combine_statement_size) my_b_write(&cache,";;",2); - And we should end mysqlbinlog with; if (my_b_tell(&cache) != 0) my_b_write(&cache,";;",2); ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] Updated (by Psergey): Add an option to mysqlbinlog to produce SQL script with fewer roundtrips (37)
by worklog-noreply＠askmonty.org 07 Aug '09

07 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Add an option to mysqlbinlog to produce SQL script with fewer roundtrips CREATION DATE..: Fri, 07 Aug 2009, 17:14 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: CATEGORY.......: Server-RawIdeaBin TASK ID........: 37 (http://askmonty.org/worklog/?tid=37) VERSION........: Server-9.x STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: -=-=(Psergey - Fri, 07 Aug 2009, 17:16)=-=- High-Level Specification modified. --- /tmp/wklog.37.old.20454 2009-08-07 17:16:54.000000000 +0300 +++ /tmp/wklog.37.new.20454 2009-08-07 17:16:54.000000000 +0300 @@ -1 +1,13 @@ +Implementation overview: + +- At start, print "--delimiter=;;" +- Modify the start of each print functions as follows + + if (my_b_tell(&cache) - my_start_of_combine_statement) + + estimiated_size_of_log_event) > combine_statement_size) + my_b_write(&cache,";;",2); + +- And we should end mysqlbinlog with; + if (my_b_tell(&cache) != 0) + my_b_write(&cache,";;",2); DESCRIPTION: SQL scripts generated by mysqlbinlog can be slow to load because they have many small queries, hence applying the script against a remote server requires a lot of roundtrips, and they become a bottleneck. This bottleneck can be addressed by having mysqlbinlog combine multiple statements into one: +delimiter // binlog statement1; binlog statement2; binlog statement3; +// binlog statement4; loading such sql script will require fewer roundtrips. The behavior can be controlled using a command line option mysqlbinlog --combine-statements=# Where the # is maximum allowed packet length. HIGH-LEVEL SPECIFICATION: Implementation overview: - At start, print "--delimiter=;;" - Modify the start of each print functions as follows if (my_b_tell(&cache) - my_start_of_combine_statement) + estimiated_size_of_log_event) > combine_statement_size) my_b_write(&cache,";;",2); - And we should end mysqlbinlog with; if (my_b_tell(&cache) != 0) my_b_write(&cache,";;",2); ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] New (by Psergey): Add an option to mysqlbinlog to produce SQL script with fewer roundtrips (37)
by worklog-noreply＠askmonty.org 07 Aug '09

07 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Add an option to mysqlbinlog to produce SQL script with fewer roundtrips CREATION DATE..: Fri, 07 Aug 2009, 17:14 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: CATEGORY.......: Server-RawIdeaBin TASK ID........: 37 (http://askmonty.org/worklog/?tid=37) VERSION........: Server-9.x STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: DESCRIPTION: SQL scripts generated by mysqlbinlog can be slow to load because they have many small queries, hence applying the script against a remote server requires a lot of roundtrips, and they become a bottleneck. This bottleneck can be addressed by having mysqlbinlog combine multiple statements into one: +delimiter // binlog statement1; binlog statement2; binlog statement3; +// binlog statement4; loading such sql script will require fewer roundtrips. The behavior can be controlled using a command line option mysqlbinlog --combine-statements=# Where the # is maximum allowed packet length. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] New (by Psergey): Add an option to mysqlbinlog to produce SQL script with fewer roundtrips (37)
by worklog-noreply＠askmonty.org 07 Aug '09

07 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Add an option to mysqlbinlog to produce SQL script with fewer roundtrips CREATION DATE..: Fri, 07 Aug 2009, 17:14 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: CATEGORY.......: Server-RawIdeaBin TASK ID........: 37 (http://askmonty.org/worklog/?tid=37) VERSION........: Server-9.x STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: DESCRIPTION: SQL scripts generated by mysqlbinlog can be slow to load because they have many small queries, hence applying the script against a remote server requires a lot of roundtrips, and they become a bottleneck. This bottleneck can be addressed by having mysqlbinlog combine multiple statements into one: +delimiter // binlog statement1; binlog statement2; binlog statement3; +// binlog statement4; loading such sql script will require fewer roundtrips. The behavior can be controlled using a command line option mysqlbinlog --combine-statements=# Where the # is maximum allowed packet length. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] Updated (by Psergey): Add a mysqlbinlog option to change the used database (36)
by worklog-noreply＠askmonty.org 07 Aug '09

07 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Add a mysqlbinlog option to change the used database CREATION DATE..: Fri, 07 Aug 2009, 14:57 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: CATEGORY.......: Server-RawIdeaBin TASK ID........: 36 (http://askmonty.org/worklog/?tid=36) VERSION........: Server-9.x STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: -=-=(Psergey - Fri, 07 Aug 2009, 14:57)=-=- Title modified. --- /tmp/wklog.36.old.14687 2009-08-07 14:57:49.000000000 +0300 +++ /tmp/wklog.36.new.14687 2009-08-07 14:57:49.000000000 +0300 @@ -1 +1 @@ -Add a mysqlbinlog option to change the database +Add a mysqlbinlog option to change the used database DESCRIPTION: Sometimes there is a need to take a binary log and apply it to a database with a different name than the original name of the database on binlog producer. If one is using statement-based replication, he can achieve this by grepping out "USE dbname" statements out of the output of mysqlbinlog(*). With row-based replication this is no longer possible, as database name is encoded within the the BINLOG '....' statement. This task is about adding an option to mysqlbinlog that would allow to change the names of used databases in both RBR and SBR events. (*) this implies that all statements refer to tables in the current database, doesn't catch updates made inside stored functions and so forth, but still works for a practially-important subset of cases. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] Updated (by Psergey): Add a mysqlbinlog option to change the used database (36)
by worklog-noreply＠askmonty.org 07 Aug '09

07 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Add a mysqlbinlog option to change the used database CREATION DATE..: Fri, 07 Aug 2009, 14:57 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: CATEGORY.......: Server-RawIdeaBin TASK ID........: 36 (http://askmonty.org/worklog/?tid=36) VERSION........: Server-9.x STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: -=-=(Psergey - Fri, 07 Aug 2009, 14:57)=-=- Title modified. --- /tmp/wklog.36.old.14687 2009-08-07 14:57:49.000000000 +0300 +++ /tmp/wklog.36.new.14687 2009-08-07 14:57:49.000000000 +0300 @@ -1 +1 @@ -Add a mysqlbinlog option to change the database +Add a mysqlbinlog option to change the used database DESCRIPTION: Sometimes there is a need to take a binary log and apply it to a database with a different name than the original name of the database on binlog producer. If one is using statement-based replication, he can achieve this by grepping out "USE dbname" statements out of the output of mysqlbinlog(*). With row-based replication this is no longer possible, as database name is encoded within the the BINLOG '....' statement. This task is about adding an option to mysqlbinlog that would allow to change the names of used databases in both RBR and SBR events. (*) this implies that all statements refer to tables in the current database, doesn't catch updates made inside stored functions and so forth, but still works for a practially-important subset of cases. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] New (by Psergey): Add a mysqlbinlog option to change the database (36)
by worklog-noreply＠askmonty.org 07 Aug '09

07 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Add a mysqlbinlog option to change the database CREATION DATE..: Fri, 07 Aug 2009, 14:57 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: CATEGORY.......: Server-RawIdeaBin TASK ID........: 36 (http://askmonty.org/worklog/?tid=36) VERSION........: Server-9.x STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: DESCRIPTION: Sometimes there is a need to take a binary log and apply it to a database with a different name than the original name of the database on binlog producer. If one is using statement-based replication, he can achieve this by grepping out "USE dbname" statements out of the output of mysqlbinlog(*). With row-based replication this is no longer possible, as database name is encoded within the the BINLOG '....' statement. This task is about adding an option to mysqlbinlog that would allow to change the names of used databases in both RBR and SBR events. (*) this implies that all statements refer to tables in the current database, doesn't catch updates made inside stored functions and so forth, but still works for a practially-important subset of cases. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] New (by Psergey): Add a mysqlbinlog option to change the database (36)
by worklog-noreply＠askmonty.org 07 Aug '09

07 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Add a mysqlbinlog option to change the database CREATION DATE..: Fri, 07 Aug 2009, 14:57 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: CATEGORY.......: Server-RawIdeaBin TASK ID........: 36 (http://askmonty.org/worklog/?tid=36) VERSION........: Server-9.x STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: DESCRIPTION: Sometimes there is a need to take a binary log and apply it to a database with a different name than the original name of the database on binlog producer. If one is using statement-based replication, he can achieve this by grepping out "USE dbname" statements out of the output of mysqlbinlog(*). With row-based replication this is no longer possible, as database name is encoded within the the BINLOG '....' statement. This task is about adding an option to mysqlbinlog that would allow to change the names of used databases in both RBR and SBR events. (*) this implies that all statements refer to tables in the current database, doesn't catch updates made inside stored functions and so forth, but still works for a practially-important subset of cases. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] Progress (by Hingo): Test task for using worklog time track features (35)
by worklog-noreply＠askmonty.org 07 Aug '09

07 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Test task for using worklog time track features CREATION DATE..: Fri, 07 Aug 2009, 09:28 SUPERVISOR.....: Bothorsen IMPLEMENTOR....: COPIES TO......: CATEGORY.......: Other TASK ID........: 35 (http://askmonty.org/worklog/?tid=35) VERSION........: Benchmarks-3.0 STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 1 ESTIMATE.......: 9 (hours remain) ORIG. ESTIMATE.: 10 PROGRESS NOTES: -=-=(Hingo - Fri, 07 Aug 2009, 09:30)=-=- Adding first hour worked Worked 1 hour and estimate 9 hours remain (original estimate unchanged). DESCRIPTION: Test task for testing time tracking features. Marking as private. What does that mean? ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] Progress (by Hingo): Test task for using worklog time track features (35)
by worklog-noreply＠askmonty.org 07 Aug '09

07 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Test task for using worklog time track features CREATION DATE..: Fri, 07 Aug 2009, 09:28 SUPERVISOR.....: Bothorsen IMPLEMENTOR....: COPIES TO......: CATEGORY.......: Other TASK ID........: 35 (http://askmonty.org/worklog/?tid=35) VERSION........: Benchmarks-3.0 STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 1 ESTIMATE.......: 9 (hours remain) ORIG. ESTIMATE.: 10 PROGRESS NOTES: -=-=(Hingo - Fri, 07 Aug 2009, 09:30)=-=- Adding first hour worked Worked 1 hour and estimate 9 hours remain (original estimate unchanged). DESCRIPTION: Test task for testing time tracking features. Marking as private. What does that mean? ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] New (by Hingo): Test task for using worklog time track features (35)
by worklog-noreply＠askmonty.org 07 Aug '09

07 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Test task for using worklog time track features CREATION DATE..: Fri, 07 Aug 2009, 09:28 SUPERVISOR.....: Bothorsen IMPLEMENTOR....: COPIES TO......: CATEGORY.......: Other TASK ID........: 35 (http://askmonty.org/worklog/?tid=35) VERSION........: Benchmarks-3.0 STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 10 (hours remain) ORIG. ESTIMATE.: 10 PROGRESS NOTES: DESCRIPTION: Test task for testing time tracking features. Marking as private. What does that mean? ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] New (by Hingo): Test task for using worklog time track features (35)
by worklog-noreply＠askmonty.org 07 Aug '09

07 Aug '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Test task for using worklog time track features CREATION DATE..: Fri, 07 Aug 2009, 09:28 SUPERVISOR.....: Bothorsen IMPLEMENTOR....: COPIES TO......: CATEGORY.......: Other TASK ID........: 35 (http://askmonty.org/worklog/?tid=35) VERSION........: Benchmarks-3.0 STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 10 (hours remain) ORIG. ESTIMATE.: 10 PROGRESS NOTES: DESCRIPTION: Test task for testing time tracking features. Marking as private. What does that mean? ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

Re: [Maria-developers] XtraDB merge into MariaDB
by Kristian Nielsen 05 Aug '09

05 Aug '09

Vadim Tkachenko <vadim(a)percona.com> writes: > Kristian, > > Now lp:percona-xtradb/release-6 and lp:percona-xtradb > should be synchronized and be up-to-date. I still do not see that this is the case :-( - Last push to lp:percona-xtradb is from June 26, and does not appear up-to-date with latest XtraDB6 release. - lp:percona-xtradb/release-6 does not exist. Maybe a typo for lp:~percona-dev/percona-xtradb/release-6 ? - I diffed lp:~percona-dev/percona-xtradb/release-6 against the XtraDB source tarball on http://www.percona.com/mysql/xtradb/5.1.36-6/source/. They do not match. The bzr branch is missing two bug fixes, on the other hand the tarball has #define PERCONA_INNODB_VERSION 5a which seems to be wrong (diff appended at the end of the mail). Ok. To proceed, I have now committed the two missing bugfixes and pushed here: lp:~maria-captains/percona-xtradb/release-6-fixed I will use this as the basis for my merge. Maybe you can pull that into lp:~percona-dev/percona-xtradb/release-6, or I will handle the conflict if necessary in the next merge (should be no problem). > Please let me know if there something else from our side. Only that I feel a bit stupid having these difficulties finding out what tree to merge. Seems to me there is something I do not understand correctly... is there some mailing list or IRC channel or something I should follow to better keep track of XtraDB development? - Kristian. ----------------------------------------------------------------------- $ diff -u --recursive . ../mysql-5.1.36-xtradb6/storage/innobase Only in .: .bzr diff -u --recursive ./dict/dict0dict.c ../mysql-5.1.36-xtradb6/storage/innobase/dict/dict0dict.c --- ./dict/dict0dict.c 2009-08-03 08:30:02.000000000 +0200 +++ ../mysql-5.1.36-xtradb6/storage/innobase/dict/dict0dict.c 2009-07-22 20:07:56.000000000 +0200 @@ -3049,7 +3049,7 @@ } else if (quote) { /* Within quotes: do not look for starting quotes or comments. */ - } else if (*sptr == '"' || *sptr == '`') { + } else if (*sptr == '"' || *sptr == '`' || *sptr == '\'') { /* Starting quote: remember the quote character. */ quote = *sptr; } else if (*sptr == '#' diff -u --recursive ./handler/ha_innodb.cc ../mysql-5.1.36-xtradb6/storage/innobase/handler/ha_innodb.cc --- ./handler/ha_innodb.cc 2009-08-03 08:30:02.000000000 +0200 +++ ../mysql-5.1.36-xtradb6/storage/innobase/handler/ha_innodb.cc 2009-07-22 20:07:56.000000000 +0200 @@ -9319,7 +9319,8 @@ /* Check that row format didn't change */ if ((info->used_fields & HA_CREATE_USED_ROW_FORMAT) && - get_row_type() != info->row_type) { + get_row_type() != ((info->row_type == ROW_TYPE_DEFAULT) + ? ROW_TYPE_COMPACT : info->row_type)) { return(COMPATIBLE_DATA_NO); } Only in ../mysql-5.1.36-xtradb6/storage/innobase/handler: ha_innodb.cc.orig Only in ../mysql-5.1.36-xtradb6/storage/innobase/handler: innodb_patch_info.h.orig Only in ../mysql-5.1.36-xtradb6/storage/innobase/handler: i_s.cc.orig diff -u --recursive ./include/univ.i ../mysql-5.1.36-xtradb6/storage/innobase/include/univ.i --- ./include/univ.i 2009-08-03 08:30:02.000000000 +0200 +++ ../mysql-5.1.36-xtradb6/storage/innobase/include/univ.i 2009-07-22 20:07:56.000000000 +0200 @@ -35,7 +35,7 @@ #define INNODB_VERSION_MAJOR 1 #define INNODB_VERSION_MINOR 0 #define INNODB_VERSION_BUGFIX 3 -#define PERCONA_INNODB_VERSION 6a +#define PERCONA_INNODB_VERSION 5a /* The following is the InnoDB version as shown in SELECT plugin_version FROM information_schema.plugins; Only in ./mysql-test/patches: mysqlbinlog_row_big.diff Only in ./mysql-test/patches: variables-big.diff

2 3

[Maria-developers] [Branch ~maria-captains/maria/5.1] Rev 2719: Add a new variant of dlclose() Valgrind suppressions to fix a Buildbot issue.
by noreply＠launchpad.net 05 Aug '09

05 Aug '09

------------------------------------------------------------ revno: 2719 committer: knielsen(a)knielsen-hq.org branch nick: mariadb-5.1 timestamp: Wed 2009-08-05 09:21:37 +0200 message: Add a new variant of dlclose() Valgrind suppressions to fix a Buildbot issue. modified: mysql-test/valgrind.supp -- lp:maria https://code.launchpad.net/~maria-captains/maria/5.1 Your team Maria developers is subscribed to branch lp:maria. To unsubscribe from this branch go to https://code.launchpad.net/~maria-captains/maria/5.1/+edit-subscription.

1 0

[Maria-developers] bzr commit into MariaDB 5.1, with Maria 1.5:maria branch (knielsen:2719)
by knielsen＠knielsen-hq.org 05 Aug '09

05 Aug '09

#At lp:maria 2719 knielsen(a)knielsen-hq.org 2009-08-05 Add a new variant of dlclose() Valgrind suppressions to fix a Buildbot issue. modified: mysql-test/valgrind.supp === modified file 'mysql-test/valgrind.supp' --- a/mysql-test/valgrind.supp 2009-08-04 14:09:08 +0000 +++ b/mysql-test/valgrind.supp 2009-08-05 07:21:37 +0000 @@ -415,20 +415,6 @@ } { - dlclose memory loss from plugin variant 4, seen on Ubuntu Jaunty i686 - Memcheck:Leak - fun:malloc - fun:_dl_close_worker - fun:_dl_close - fun:dlclose_doit - fun:_dl_catch_error - fun:_dlerror_run - fun:dlclose - fun:_Z15free_plugin_memP12st_plugin_dl - fun:_Z13plugin_dl_delPK19st_mysql_lex_string -} - -{ dlclose memory loss from plugin variant 4 Memcheck:Leak fun:malloc @@ -455,6 +441,35 @@ } { + dlclose memory loss from plugin variant 6, seen on Ubuntu Jaunty i686 + Memcheck:Leak + fun:malloc + fun:_dl_scope_free + fun:_dl_close_worker + fun:_dl_close + fun:dlclose_doit + fun:_dl_catch_error + fun:_dlerror_run + fun:dlclose + fun:_ZL15free_plugin_memP12st_plugin_dl + fun:_ZL13plugin_dl_delPK19st_mysql_lex_string +} + +{ + dlclose memory loss from plugin variant 7, seen on Ubuntu Jaunty i686 + Memcheck:Leak + fun:malloc + fun:_dl_close_worker + fun:_dl_close + fun:dlclose_doit + fun:_dl_catch_error + fun:_dlerror_run + fun:dlclose + fun:_ZL15free_plugin_memP12st_plugin_dl + fun:_ZL13plugin_dl_delPK19st_mysql_lex_string +} + +{ dlopen / ptread_cancel_init memory loss on Suse Linux 10.3 32/64 bit ver 1 Memcheck:Leak fun:*alloc

1 0

[Maria-developers] Adding Valgrind suppressions
by Kristian Nielsen 05 Aug '09

05 Aug '09

Just a remark about adding Valgrind suppressions for problems in system libraries that we cannot fix. I see in mysql-test/valgrind.supp a number of suppressions like this: { dlclose memory loss from plugin variant 4 Memcheck:Leak fun:malloc obj:/lib*/ld-*.so obj:/lib*/ld-*.so obj:/lib*/ld-*.so obj:/lib*/libdl-*.so fun:dlclose fun:_ZL15free_plugin_memP12st_plugin_dl fun:_ZL13plugin_dl_delPK19st_mysql_lex_string } where these "obj:/lib*/ld-*.so" entries are caused by lack of debugging information, so Valgrind cannot provide proper stack traces. Please do not add any more suppressions like this. Instead, install debugging versions of the system libraries (this can be done without any loss of efficiency, as the debug libraries are only used when explicitly requested, eg. by `mtr --valgrind`). On Ubuntu/Debian: sudo apt-get install libc6-dbg On Suse: Enable the `debuginfo' repository (if not already enabled) Install the package `glibc-debuginfo' Then the needed suppression will look much nicer: { dlclose memory loss from plugin variant 2 Memcheck:Leak fun:malloc fun:_dl_close_worker fun:_dl_close fun:_dl_catch_error fun:_dlerror_run fun:dlclose fun:_ZL15free_plugin_memP12st_plugin_dl fun:_ZL13plugin_dl_delPK19st_mysql_lex_string } This way, we get fewer suppressions to maintain, and those we do need to maintain are much easier to read. - Kristian.

1 0

[Maria-developers] [Branch ~maria-captains/maria/5.1] Rev 2718: Add a new variant of dlclose() Valgrind suppressions to fix a Buildbot issue.
by noreply＠launchpad.net 04 Aug '09

04 Aug '09

------------------------------------------------------------ revno: 2718 committer: knielsen(a)knielsen-hq.org branch nick: mariadb-5.1 timestamp: Tue 2009-08-04 16:09:08 +0200 message: Add a new variant of dlclose() Valgrind suppressions to fix a Buildbot issue. modified: mysql-test/valgrind.supp -- lp:maria https://code.launchpad.net/~maria-captains/maria/5.1 Your team Maria developers is subscribed to branch lp:maria. To unsubscribe from this branch go to https://code.launchpad.net/~maria-captains/maria/5.1/+edit-subscription.

1 0

[Maria-developers] bzr commit into MariaDB 5.1, with Maria 1.5:maria branch (knielsen:2718)
by knielsen＠knielsen-hq.org 04 Aug '09

04 Aug '09

#At lp:maria 2718 knielsen(a)knielsen-hq.org 2009-08-04 Add a new variant of dlclose() Valgrind suppressions to fix a Buildbot issue. modified: mysql-test/valgrind.supp === modified file 'mysql-test/valgrind.supp' --- a/mysql-test/valgrind.supp 2009-06-05 20:46:23 +0000 +++ b/mysql-test/valgrind.supp 2009-08-04 14:09:08 +0000 @@ -415,6 +415,20 @@ } { + dlclose memory loss from plugin variant 4, seen on Ubuntu Jaunty i686 + Memcheck:Leak + fun:malloc + fun:_dl_close_worker + fun:_dl_close + fun:dlclose_doit + fun:_dl_catch_error + fun:_dlerror_run + fun:dlclose + fun:_Z15free_plugin_memP12st_plugin_dl + fun:_Z13plugin_dl_delPK19st_mysql_lex_string +} + +{ dlclose memory loss from plugin variant 4 Memcheck:Leak fun:malloc

1 0

[Maria-developers] [Branch ~maria-captains/maria/5.1] Rev 2717: Merge XtraDB 6 with latest MariaDB 5.1
by noreply＠launchpad.net 04 Aug '09

04 Aug '09

Merge authors: akuzminsky <akuzminsky(a)localhost.localdomain> akuzminsky <akuzminsky@sm1u02> Kristian Nielsen (knielsen) Vadim Tkachenko (vadim-tk) ------------------------------------------------------------ revno: 2717 [merge] committer: knielsen(a)knielsen-hq.org branch nick: mariadb-5.1 timestamp: Mon 2009-08-03 22:19:12 +0200 message: Merge XtraDB 6 with latest MariaDB 5.1 removed: mysql-test/include/ctype_innodb_like.inc mysql-test/include/have_innodb.inc mysql-test/include/innodb_trx_weight.inc mysql-test/r/innodb-autoinc.result mysql-test/r/innodb-lock.result mysql-test/r/innodb-replace.result mysql-test/r/innodb-semi-consistent.result mysql-test/r/innodb.result mysql-test/r/innodb_bug34053.result mysql-test/r/innodb_bug34300.result mysql-test/r/innodb_bug35220.result mysql-test/r/innodb_trx_weight.result mysql-test/t/innodb-autoinc.test mysql-test/t/innodb-lock.test mysql-test/t/innodb-master.opt mysql-test/t/innodb-replace.test mysql-test/t/innodb-semi-consistent-master.opt mysql-test/t/innodb-semi-consistent.test mysql-test/t/innodb.test mysql-test/t/innodb_bug34053.test mysql-test/t/innodb_bug34300.test mysql-test/t/innodb_bug35220.test mysql-test/t/innodb_trx_weight.test added: BUILD/compile-innodb BUILD/compile-innodb-debug mysql-test/include/ctype_innodb_like.inc mysql-test/include/have_innodb.inc mysql-test/include/innodb-index.inc mysql-test/include/innodb_trx_weight.inc mysql-test/r/innodb-analyze.result mysql-test/r/innodb-autoinc.result mysql-test/r/innodb-index.result mysql-test/r/innodb-index_ucs2.result mysql-test/r/innodb-lock.result mysql-test/r/innodb-replace.result mysql-test/r/innodb-semi-consistent.result mysql-test/r/innodb-timeout.result mysql-test/r/innodb-use-sys-malloc.result mysql-test/r/innodb-zip.result mysql-test/r/innodb.result mysql-test/r/innodb_bug34053.result mysql-test/r/innodb_bug34300.result mysql-test/r/innodb_bug35220.result mysql-test/r/innodb_bug36169.result mysql-test/r/innodb_bug36172.result mysql-test/r/innodb_bug40360.result mysql-test/r/innodb_bug41904.result mysql-test/r/innodb_information_schema.result mysql-test/r/innodb_trx_weight.result mysql-test/r/innodb_xtradb_bug317074.result mysql-test/t/innodb-analyze.test mysql-test/t/innodb-autoinc.test mysql-test/t/innodb-index.test mysql-test/t/innodb-index_ucs2.test mysql-test/t/innodb-lock.test mysql-test/t/innodb-master.opt mysql-test/t/innodb-replace.test mysql-test/t/innodb-semi-consistent-master.opt mysql-test/t/innodb-semi-consistent.test mysql-test/t/innodb-timeout.test mysql-test/t/innodb-use-sys-malloc-master.opt mysql-test/t/innodb-use-sys-malloc.test mysql-test/t/innodb-zip.test mysql-test/t/innodb.test mysql-test/t/innodb_bug34053.test mysql-test/t/innodb_bug34300.test mysql-test/t/innodb_bug35220.test mysql-test/t/innodb_bug36169.test mysql-test/t/innodb_bug36172.test mysql-test/t/innodb_bug40360.test mysql-test/t/innodb_bug41904.test mysql-test/t/innodb_information_schema.test mysql-test/t/innodb_trx_weight.test mysql-test/t/innodb_xtradb_bug317074.test storage/xtradb/ storage/xtradb/CMakeLists.txt storage/xtradb/COPYING.Google storage/xtradb/ChangeLog storage/xtradb/Makefile.am storage/xtradb/btr/ storage/xtradb/btr/btr0btr.c storage/xtradb/btr/btr0cur.c storage/xtradb/btr/btr0pcur.c storage/xtradb/btr/btr0sea.c storage/xtradb/buf/ storage/xtradb/buf/buf0buddy.c storage/xtradb/buf/buf0buf.c storage/xtradb/buf/buf0flu.c storage/xtradb/buf/buf0lru.c storage/xtradb/buf/buf0rea.c storage/xtradb/data/ storage/xtradb/data/data0data.c storage/xtradb/data/data0type.c storage/xtradb/dict/ storage/xtradb/dict/dict0boot.c storage/xtradb/dict/dict0crea.c storage/xtradb/dict/dict0dict.c storage/xtradb/dict/dict0load.c storage/xtradb/dict/dict0mem.c storage/xtradb/dyn/ storage/xtradb/dyn/dyn0dyn.c storage/xtradb/eval/ storage/xtradb/eval/eval0eval.c storage/xtradb/eval/eval0proc.c storage/xtradb/fil/ storage/xtradb/fil/fil0fil.c storage/xtradb/fsp/ storage/xtradb/fsp/fsp0fsp.c storage/xtradb/fut/ storage/xtradb/fut/fut0fut.c storage/xtradb/fut/fut0lst.c storage/xtradb/ha/ storage/xtradb/ha/ha0ha.c storage/xtradb/ha/ha0storage.c storage/xtradb/ha/hash0hash.c storage/xtradb/ha_innodb.def storage/xtradb/handler/ storage/xtradb/handler/ha_innodb.cc storage/xtradb/handler/ha_innodb.h storage/xtradb/handler/handler0alter.cc storage/xtradb/handler/handler0vars.h storage/xtradb/handler/i_s.cc storage/xtradb/handler/i_s.h storage/xtradb/handler/innodb_patch_info.h storage/xtradb/handler/mysql_addons.cc storage/xtradb/handler/win_delay_loader.cc storage/xtradb/ibuf/ storage/xtradb/ibuf/ibuf0ibuf.c storage/xtradb/include/ storage/xtradb/include/btr0btr.h storage/xtradb/include/btr0btr.ic storage/xtradb/include/btr0cur.h storage/xtradb/include/btr0cur.ic storage/xtradb/include/btr0pcur.h storage/xtradb/include/btr0pcur.ic storage/xtradb/include/btr0sea.h storage/xtradb/include/btr0sea.ic storage/xtradb/include/btr0types.h storage/xtradb/include/buf0buddy.h storage/xtradb/include/buf0buddy.ic storage/xtradb/include/buf0buf.h storage/xtradb/include/buf0buf.ic storage/xtradb/include/buf0flu.h storage/xtradb/include/buf0flu.ic storage/xtradb/include/buf0lru.h storage/xtradb/include/buf0lru.ic storage/xtradb/include/buf0rea.h storage/xtradb/include/buf0types.h storage/xtradb/include/data0data.h storage/xtradb/include/data0data.ic storage/xtradb/include/data0type.h storage/xtradb/include/data0type.ic storage/xtradb/include/data0types.h storage/xtradb/include/db0err.h storage/xtradb/include/dict0boot.h storage/xtradb/include/dict0boot.ic storage/xtradb/include/dict0crea.h storage/xtradb/include/dict0crea.ic storage/xtradb/include/dict0dict.h storage/xtradb/include/dict0dict.ic storage/xtradb/include/dict0load.h storage/xtradb/include/dict0load.ic storage/xtradb/include/dict0mem.h storage/xtradb/include/dict0mem.ic storage/xtradb/include/dict0types.h storage/xtradb/include/dyn0dyn.h storage/xtradb/include/dyn0dyn.ic storage/xtradb/include/eval0eval.h storage/xtradb/include/eval0eval.ic storage/xtradb/include/eval0proc.h storage/xtradb/include/eval0proc.ic storage/xtradb/include/fil0fil.h storage/xtradb/include/fsp0fsp.h storage/xtradb/include/fsp0fsp.ic storage/xtradb/include/fut0fut.h storage/xtradb/include/fut0fut.ic storage/xtradb/include/fut0lst.h storage/xtradb/include/fut0lst.ic storage/xtradb/include/ha0ha.h storage/xtradb/include/ha0ha.ic storage/xtradb/include/ha0storage.h storage/xtradb/include/ha0storage.ic storage/xtradb/include/ha_prototypes.h storage/xtradb/include/handler0alter.h storage/xtradb/include/hash0hash.h storage/xtradb/include/hash0hash.ic storage/xtradb/include/ibuf0ibuf.h storage/xtradb/include/ibuf0ibuf.ic storage/xtradb/include/ibuf0types.h storage/xtradb/include/lock0iter.h storage/xtradb/include/lock0lock.h storage/xtradb/include/lock0lock.ic storage/xtradb/include/lock0priv.h storage/xtradb/include/lock0priv.ic storage/xtradb/include/lock0types.h storage/xtradb/include/log0log.h storage/xtradb/include/log0log.ic storage/xtradb/include/log0recv.h storage/xtradb/include/log0recv.ic storage/xtradb/include/mach0data.h storage/xtradb/include/mach0data.ic storage/xtradb/include/mem0dbg.h storage/xtradb/include/mem0dbg.ic storage/xtradb/include/mem0mem.h storage/xtradb/include/mem0mem.ic storage/xtradb/include/mem0pool.h storage/xtradb/include/mem0pool.ic storage/xtradb/include/mtr0log.h storage/xtradb/include/mtr0log.ic storage/xtradb/include/mtr0mtr.h storage/xtradb/include/mtr0mtr.ic storage/xtradb/include/mtr0types.h storage/xtradb/include/mysql_addons.h storage/xtradb/include/os0file.h storage/xtradb/include/os0proc.h storage/xtradb/include/os0proc.ic storage/xtradb/include/os0sync.h storage/xtradb/include/os0sync.ic storage/xtradb/include/os0thread.h storage/xtradb/include/os0thread.ic storage/xtradb/include/page0cur.h storage/xtradb/include/page0cur.ic storage/xtradb/include/page0page.h storage/xtradb/include/page0page.ic storage/xtradb/include/page0types.h storage/xtradb/include/page0zip.h storage/xtradb/include/page0zip.ic storage/xtradb/include/pars0grm.h storage/xtradb/include/pars0opt.h storage/xtradb/include/pars0opt.ic storage/xtradb/include/pars0pars.h storage/xtradb/include/pars0pars.ic storage/xtradb/include/pars0sym.h storage/xtradb/include/pars0sym.ic storage/xtradb/include/pars0types.h storage/xtradb/include/que0que.h storage/xtradb/include/que0que.ic storage/xtradb/include/que0types.h storage/xtradb/include/read0read.h storage/xtradb/include/read0read.ic storage/xtradb/include/read0types.h storage/xtradb/include/rem0cmp.h storage/xtradb/include/rem0cmp.ic storage/xtradb/include/rem0rec.h storage/xtradb/include/rem0rec.ic storage/xtradb/include/rem0types.h storage/xtradb/include/row0ext.h storage/xtradb/include/row0ext.ic storage/xtradb/include/row0ins.h storage/xtradb/include/row0ins.ic storage/xtradb/include/row0merge.h storage/xtradb/include/row0mysql.h storage/xtradb/include/row0mysql.ic storage/xtradb/include/row0purge.h storage/xtradb/include/row0purge.ic storage/xtradb/include/row0row.h storage/xtradb/include/row0row.ic storage/xtradb/include/row0sel.h storage/xtradb/include/row0sel.ic storage/xtradb/include/row0types.h storage/xtradb/include/row0uins.h storage/xtradb/include/row0uins.ic storage/xtradb/include/row0umod.h storage/xtradb/include/row0umod.ic storage/xtradb/include/row0undo.h storage/xtradb/include/row0undo.ic storage/xtradb/include/row0upd.h storage/xtradb/include/row0upd.ic storage/xtradb/include/row0vers.h storage/xtradb/include/row0vers.ic storage/xtradb/include/srv0que.h storage/xtradb/include/srv0srv.h storage/xtradb/include/srv0srv.ic storage/xtradb/include/srv0start.h storage/xtradb/include/sync0arr.h storage/xtradb/include/sync0arr.ic storage/xtradb/include/sync0rw.h storage/xtradb/include/sync0rw.ic storage/xtradb/include/sync0sync.h storage/xtradb/include/sync0sync.ic storage/xtradb/include/sync0types.h storage/xtradb/include/thr0loc.h storage/xtradb/include/thr0loc.ic storage/xtradb/include/trx0i_s.h storage/xtradb/include/trx0purge.h storage/xtradb/include/trx0purge.ic storage/xtradb/include/trx0rec.h storage/xtradb/include/trx0rec.ic storage/xtradb/include/trx0roll.h storage/xtradb/include/trx0roll.ic storage/xtradb/include/trx0rseg.h storage/xtradb/include/trx0rseg.ic storage/xtradb/include/trx0sys.h storage/xtradb/include/trx0sys.ic storage/xtradb/include/trx0trx.h storage/xtradb/include/trx0trx.ic storage/xtradb/include/trx0types.h storage/xtradb/include/trx0undo.h storage/xtradb/include/trx0undo.ic storage/xtradb/include/trx0xa.h storage/xtradb/include/univ.i storage/xtradb/include/usr0sess.h storage/xtradb/include/usr0sess.ic storage/xtradb/include/usr0types.h storage/xtradb/include/ut0auxconf.h storage/xtradb/include/ut0byte.h storage/xtradb/include/ut0byte.ic storage/xtradb/include/ut0dbg.h storage/xtradb/include/ut0list.h storage/xtradb/include/ut0list.ic storage/xtradb/include/ut0lst.h storage/xtradb/include/ut0mem.h storage/xtradb/include/ut0mem.ic storage/xtradb/include/ut0rnd.h storage/xtradb/include/ut0rnd.ic storage/xtradb/include/ut0sort.h storage/xtradb/include/ut0ut.h storage/xtradb/include/ut0ut.ic storage/xtradb/include/ut0vec.h storage/xtradb/include/ut0vec.ic storage/xtradb/include/ut0wqueue.h storage/xtradb/lock/ storage/xtradb/lock/lock0iter.c storage/xtradb/lock/lock0lock.c storage/xtradb/log/ storage/xtradb/log/log0log.c storage/xtradb/log/log0recv.c storage/xtradb/mach/ storage/xtradb/mach/mach0data.c storage/xtradb/mem/ storage/xtradb/mem/mem0dbg.c storage/xtradb/mem/mem0mem.c storage/xtradb/mem/mem0pool.c storage/xtradb/mtr/ storage/xtradb/mtr/mtr0log.c storage/xtradb/mtr/mtr0mtr.c storage/xtradb/os/ storage/xtradb/os/os0file.c storage/xtradb/os/os0proc.c storage/xtradb/os/os0sync.c storage/xtradb/os/os0thread.c storage/xtradb/page/ storage/xtradb/page/page0cur.c storage/xtradb/page/page0page.c storage/xtradb/page/page0zip.c storage/xtradb/pars/ storage/xtradb/pars/lexyy.c storage/xtradb/pars/make_bison.sh storage/xtradb/pars/make_flex.sh storage/xtradb/pars/pars0grm.c storage/xtradb/pars/pars0grm.y storage/xtradb/pars/pars0lex.l storage/xtradb/pars/pars0opt.c storage/xtradb/pars/pars0pars.c storage/xtradb/pars/pars0sym.c storage/xtradb/plug.in storage/xtradb/que/ storage/xtradb/que/que0que.c storage/xtradb/read/ storage/xtradb/read/read0read.c storage/xtradb/rem/ storage/xtradb/rem/rem0cmp.c storage/xtradb/rem/rem0rec.c storage/xtradb/row/ storage/xtradb/row/row0ext.c storage/xtradb/row/row0ins.c storage/xtradb/row/row0merge.c storage/xtradb/row/row0mysql.c storage/xtradb/row/row0purge.c storage/xtradb/row/row0row.c storage/xtradb/row/row0sel.c storage/xtradb/row/row0uins.c storage/xtradb/row/row0umod.c storage/xtradb/row/row0undo.c storage/xtradb/row/row0upd.c storage/xtradb/row/row0vers.c storage/xtradb/scripts/ storage/xtradb/scripts/install_innodb_plugins.sql storage/xtradb/scripts/install_innodb_plugins_win.sql storage/xtradb/srv/ storage/xtradb/srv/srv0que.c storage/xtradb/srv/srv0srv.c storage/xtradb/srv/srv0start.c storage/xtradb/sync/ storage/xtradb/sync/sync0arr.c storage/xtradb/sync/sync0rw.c storage/xtradb/sync/sync0sync.c storage/xtradb/thr/ storage/xtradb/thr/thr0loc.c storage/xtradb/trx/ storage/xtradb/trx/trx0i_s.c storage/xtradb/trx/trx0purge.c storage/xtradb/trx/trx0rec.c storage/xtradb/trx/trx0roll.c storage/xtradb/trx/trx0rseg.c storage/xtradb/trx/trx0sys.c storage/xtradb/trx/trx0trx.c storage/xtradb/trx/trx0undo.c storage/xtradb/usr/ storage/xtradb/usr/usr0sess.c storage/xtradb/ut/ storage/xtradb/ut/ut0auxconf.c storage/xtradb/ut/ut0byte.c storage/xtradb/ut/ut0dbg.c storage/xtradb/ut/ut0list.c storage/xtradb/ut/ut0mem.c storage/xtradb/ut/ut0rnd.c storage/xtradb/ut/ut0ut.c storage/xtradb/ut/ut0vec.c storage/xtradb/ut/ut0wqueue.c storage/xtradb/win-plugin/ storage/xtradb/win-plugin/README storage/xtradb/win-plugin/win-plugin.diff strings/strmov_overlapp.c renamed: storage/innobase/plug.in => storage/innobase/plug.in.disabled modified: .bzrignore CMakeLists.txt configure.in include/atomic/nolock.h include/m_string.h include/my_sys.h libmysql/Makefile.shared libmysqld/CMakeLists.txt mysql-test/include/mtr_check.sql mysql-test/include/varchar.inc mysql-test/lib/mtr_cases.pm mysql-test/mysql-test-run.pl mysql-test/r/events_stress.result mysql-test/r/index_merge_innodb.result mysql-test/r/information_schema.result mysql-test/r/information_schema_all_engines.result mysql-test/r/mysqlbinlog_row_big.result mysql-test/r/row-checksum-old.result mysql-test/r/row-checksum.result mysql-test/r/variables-big.result mysql-test/t/events_stress.test mysql-test/t/information_schema.test mysql-test/t/mysqlbinlog_row_big.test mysql-test/t/partition_innodb.test mysql-test/t/type_bit_innodb.test mysql-test/t/variables-big.test mysys/mf_iocache2.c mysys/thr_mutex.c sql-common/client.c sql/log_event.cc sql/log_event.h sql/rpl_mi.cc sql/rpl_rli.cc sql/slave.cc sql/sql_table.cc strings/Makefile.am The size of the diff (225341 lines) is larger than your specified limit of 1000 lines -- lp:maria https://code.launchpad.net/~maria-captains/maria/5.1 Your team Maria developers is subscribed to branch lp:maria. To unsubscribe from this branch go to https://code.launchpad.net/~maria-captains/maria/5.1/+edit-subscription.

1 0

[Maria-developers] jaunty-x86-valgrind in Buildbot
by Kristian Nielsen 04 Aug '09

04 Aug '09

Hi Sergey, I was looking at the many failures we have in the jaunty-x86-valgrind host in Buildbot ... I was wondering if you could install the `libc6-dbg' package? sudo apt-get install libc6-dbg Assuming it works as in Ubuntu Hardy, this should give proper symbols in the Valgrind stack traces, so that the bogus warnings from libc internals are suppressed (or we can add meaningful suppressions for any remaining problems). - Kristian.

2 1

[Maria-developers] bzr commit into MariaDB 5.1, with Maria 1.5:maria branch (knielsen:2717)
by knielsen＠knielsen-hq.org 03 Aug '09

03 Aug '09

#At lp:maria 2717 knielsen(a)knielsen-hq.org 2009-08-03 [merge] Merge XtraDB 6 into MariaDB. modified: mysql-test/r/events_stress.result mysql-test/r/information_schema.result mysql-test/r/information_schema_all_engines.result mysql-test/r/innodb_bug36169.result mysql-test/r/innodb_xtradb_bug317074.result mysql-test/t/events_stress.test mysql-test/t/innodb-analyze.test mysql-test/t/innodb_bug36169.test mysql-test/t/innodb_bug36172.test mysql-test/t/innodb_xtradb_bug317074.test storage/xtradb/btr/btr0cur.c storage/xtradb/btr/btr0sea.c storage/xtradb/buf/buf0buddy.c storage/xtradb/buf/buf0buf.c storage/xtradb/buf/buf0flu.c storage/xtradb/buf/buf0lru.c storage/xtradb/buf/buf0rea.c storage/xtradb/dict/dict0boot.c storage/xtradb/dict/dict0crea.c storage/xtradb/dict/dict0dict.c storage/xtradb/dict/dict0load.c storage/xtradb/fil/fil0fil.c storage/xtradb/handler/ha_innodb.cc storage/xtradb/handler/i_s.cc storage/xtradb/handler/i_s.h storage/xtradb/handler/innodb_patch_info.h storage/xtradb/ibuf/ibuf0ibuf.c storage/xtradb/include/buf0buddy.h storage/xtradb/include/buf0buddy.ic storage/xtradb/include/buf0buf.h storage/xtradb/include/buf0buf.ic storage/xtradb/include/buf0flu.ic storage/xtradb/include/buf0lru.h storage/xtradb/include/dict0dict.h storage/xtradb/include/dict0dict.ic storage/xtradb/include/log0log.h storage/xtradb/include/rem0cmp.h storage/xtradb/include/rem0cmp.ic storage/xtradb/include/srv0srv.h storage/xtradb/include/sync0sync.h storage/xtradb/include/univ.i storage/xtradb/include/ut0auxconf.h storage/xtradb/log/log0log.c storage/xtradb/log/log0recv.c storage/xtradb/mtr/mtr0mtr.c storage/xtradb/os/os0file.c storage/xtradb/rem/rem0cmp.c storage/xtradb/row/row0mysql.c storage/xtradb/scripts/install_innodb_plugins.sql storage/xtradb/srv/srv0srv.c storage/xtradb/srv/srv0start.c storage/xtradb/sync/sync0sync.c storage/xtradb/ut/ut0ut.c === modified file 'mysql-test/r/events_stress.result' --- a/mysql-test/r/events_stress.result 2006-09-01 11:08:44 +0000 +++ b/mysql-test/r/events_stress.result 2009-08-03 20:09:53 +0000 @@ -32,6 +32,7 @@ USE events_conn1_test2; SELECT COUNT(*) FROM INFORMATION_SCHEMA.EVENTS WHERE EVENT_SCHEMA='events_conn1_test2'; COUNT(*) 50 +SET @old_event_scheduler=@@event_scheduler; SET GLOBAL event_scheduler=on; DROP DATABASE events_conn1_test2; SET GLOBAL event_scheduler=off; @@ -63,3 +64,4 @@ DROP TABLE fill_it1; DROP TABLE fill_it2; DROP TABLE fill_it3; DROP DATABASE events_test; +SET GLOBAL event_scheduler=@old_event_scheduler; === modified file 'mysql-test/r/information_schema.result' --- a/mysql-test/r/information_schema.result 2009-06-11 17:49:51 +0000 +++ b/mysql-test/r/information_schema.result 2009-08-03 20:09:53 +0000 @@ -61,9 +61,11 @@ INNODB_CMP INNODB_CMPMEM INNODB_CMPMEM_RESET INNODB_CMP_RESET +INNODB_INDEX_STATS INNODB_LOCKS INNODB_LOCK_WAITS INNODB_RSEG +INNODB_TABLE_STATS INNODB_TRX KEY_COLUMN_USAGE PARTITIONS @@ -863,6 +865,8 @@ TABLE_CONSTRAINTS TABLE_NAME select TABLE_PRIVILEGES TABLE_NAME select VIEWS TABLE_NAME select INNODB_BUFFER_POOL_PAGES_INDEX table_name select +INNODB_INDEX_STATS table_name select +INNODB_TABLE_STATS table_name select delete from mysql.user where user='mysqltest_4'; delete from mysql.db where user='mysqltest_4'; flush privileges; === modified file 'mysql-test/r/information_schema_all_engines.result' --- a/mysql-test/r/information_schema_all_engines.result 2009-06-11 12:53:26 +0000 +++ b/mysql-test/r/information_schema_all_engines.result 2009-08-03 20:09:53 +0000 @@ -35,13 +35,15 @@ INNODB_CMP INNODB_RSEG XTRADB_ENHANCEMENTS INNODB_BUFFER_POOL_PAGES_INDEX -INNODB_BUFFER_POOL_PAGES_BLOB +INNODB_INDEX_STATS INNODB_TRX INNODB_CMP_RESET INNODB_LOCK_WAITS INNODB_CMPMEM_RESET INNODB_LOCKS INNODB_CMPMEM +INNODB_TABLE_STATS +INNODB_BUFFER_POOL_PAGES_BLOB SELECT t.table_name, c1.column_name FROM information_schema.tables t INNER JOIN @@ -91,13 +93,15 @@ INNODB_CMP page_size INNODB_RSEG rseg_id XTRADB_ENHANCEMENTS name INNODB_BUFFER_POOL_PAGES_INDEX schema_name -INNODB_BUFFER_POOL_PAGES_BLOB space_id +INNODB_INDEX_STATS table_name INNODB_TRX trx_id INNODB_CMP_RESET page_size INNODB_LOCK_WAITS requesting_trx_id INNODB_CMPMEM_RESET page_size INNODB_LOCKS lock_id INNODB_CMPMEM page_size +INNODB_TABLE_STATS table_name +INNODB_BUFFER_POOL_PAGES_BLOB space_id SELECT t.table_name, c1.column_name FROM information_schema.tables t INNER JOIN @@ -147,13 +151,15 @@ INNODB_CMP page_size INNODB_RSEG rseg_id XTRADB_ENHANCEMENTS name INNODB_BUFFER_POOL_PAGES_INDEX schema_name -INNODB_BUFFER_POOL_PAGES_BLOB space_id +INNODB_INDEX_STATS table_name INNODB_TRX trx_id INNODB_CMP_RESET page_size INNODB_LOCK_WAITS requesting_trx_id INNODB_CMPMEM_RESET page_size INNODB_LOCKS lock_id INNODB_CMPMEM page_size +INNODB_TABLE_STATS table_name +INNODB_BUFFER_POOL_PAGES_BLOB space_id select 1 as f1 from information_schema.tables where "CHARACTER_SETS"= (select cast(table_name as char) from information_schema.tables order by table_name limit 1) limit 1; @@ -192,9 +198,11 @@ INNODB_CMP information_schema.INNODB_CMP INNODB_CMPMEM information_schema.INNODB_CMPMEM 1 INNODB_CMPMEM_RESET information_schema.INNODB_CMPMEM_RESET 1 INNODB_CMP_RESET information_schema.INNODB_CMP_RESET 1 +INNODB_INDEX_STATS information_schema.INNODB_INDEX_STATS 1 INNODB_LOCKS information_schema.INNODB_LOCKS 1 INNODB_LOCK_WAITS information_schema.INNODB_LOCK_WAITS 1 INNODB_RSEG information_schema.INNODB_RSEG 1 +INNODB_TABLE_STATS information_schema.INNODB_TABLE_STATS 1 INNODB_TRX information_schema.INNODB_TRX 1 KEY_COLUMN_USAGE information_schema.KEY_COLUMN_USAGE 1 PARTITIONS information_schema.PARTITIONS 1 @@ -254,13 +262,15 @@ Database: information_schema | INNODB_RSEG | | XTRADB_ENHANCEMENTS | | INNODB_BUFFER_POOL_PAGES_INDEX | -| INNODB_BUFFER_POOL_PAGES_BLOB | +| INNODB_INDEX_STATS | | INNODB_TRX | | INNODB_CMP_RESET | | INNODB_LOCK_WAITS | | INNODB_CMPMEM_RESET | | INNODB_LOCKS | | INNODB_CMPMEM | +| INNODB_TABLE_STATS | +| INNODB_BUFFER_POOL_PAGES_BLOB | +---------------------------------------+ Database: INFORMATION_SCHEMA +---------------------------------------+ @@ -300,13 +310,15 @@ Database: INFORMATION_SCHEMA | INNODB_RSEG | | XTRADB_ENHANCEMENTS | | INNODB_BUFFER_POOL_PAGES_INDEX | -| INNODB_BUFFER_POOL_PAGES_BLOB | +| INNODB_INDEX_STATS | | INNODB_TRX | | INNODB_CMP_RESET | | INNODB_LOCK_WAITS | | INNODB_CMPMEM_RESET | | INNODB_LOCKS | | INNODB_CMPMEM | +| INNODB_TABLE_STATS | +| INNODB_BUFFER_POOL_PAGES_BLOB | +---------------------------------------+ Wildcard: inf_rmation_schema +--------------------+ @@ -316,5 +328,5 @@ Wildcard: inf_rmation_schema +--------------------+ SELECT table_schema, count(*) FROM information_schema.TABLES WHERE table_schema IN ('mysql', 'INFORMATION_SCHEMA', 'test', 'mysqltest') AND table_name<>'ndb_binlog_index' AND table_name<>'ndb_apply_status' GROUP BY TABLE_SCHEMA; table_schema count(*) -information_schema 41 +information_schema 43 mysql 22 === modified file 'mysql-test/r/innodb_bug36169.result' --- a/mysql-test/r/innodb_bug36169.result 2009-06-11 12:53:26 +0000 +++ b/mysql-test/r/innodb_bug36169.result 2009-08-03 20:09:53 +0000 @@ -1,5 +1,5 @@ -SET @save_innodb_file_format=@@global.innodb_file_format; -SET @save_innodb_file_format_check=@@global.innodb_file_format_check; -SET @save_innodb_file_per_table=@@global.innodb_file_per_table; +set @old_innodb_file_per_table=@@innodb_file_per_table; +set @old_innodb_file_format=@@innodb_file_format; +set @old_innodb_file_format_check=@@innodb_file_format_check; SET GLOBAL innodb_file_format='Barracuda'; SET GLOBAL innodb_file_per_table=ON; === modified file 'mysql-test/r/innodb_xtradb_bug317074.result' --- a/mysql-test/r/innodb_xtradb_bug317074.result 2009-06-11 12:53:26 +0000 +++ b/mysql-test/r/innodb_xtradb_bug317074.result 2009-08-03 20:09:53 +0000 @@ -1,5 +1,5 @@ -SET @save_innodb_file_format=@@global.innodb_file_format; -SET @save_innodb_file_format_check=@@global.innodb_file_format_check; -SET @save_innodb_file_per_table=@@global.innodb_file_per_table; +SET @old_innodb_file_format=@@innodb_file_format; +SET @old_innodb_file_per_table=@@innodb_file_per_table; +SET @old_innodb_file_format_check=@@innodb_file_format_check; SET GLOBAL innodb_file_format='Barracuda'; SET GLOBAL innodb_file_per_table=ON; === modified file 'mysql-test/t/events_stress.test' --- a/mysql-test/t/events_stress.test 2007-05-26 14:36:38 +0000 +++ b/mysql-test/t/events_stress.test 2009-08-03 20:09:53 +0000 @@ -61,6 +61,7 @@ while ($1) } --enable_query_log SELECT COUNT(*) FROM INFORMATION_SCHEMA.EVENTS WHERE EVENT_SCHEMA='events_conn1_test2'; +SET @old_event_scheduler=@@event_scheduler; SET GLOBAL event_scheduler=on; --sleep 2.5 DROP DATABASE events_conn1_test2; @@ -135,3 +136,6 @@ DROP USER event_user3@localhost; # DROP DATABASE events_test; + +# Cleanup +SET GLOBAL event_scheduler=@old_event_scheduler; === modified file 'mysql-test/t/innodb-analyze.test' --- a/mysql-test/t/innodb-analyze.test 2009-06-09 15:08:46 +0000 +++ b/mysql-test/t/innodb-analyze.test 2009-08-03 20:09:53 +0000 @@ -11,7 +11,7 @@ -- disable_result_log -- enable_warnings -SET @save_innodb_stats_sample_pages=@@innodb_stats_sample_pages; +SET @old_innodb_stats_sample_pages=@@innodb_stats_sample_pages; SET GLOBAL innodb_stats_sample_pages=0; # check that the value has been adjusted to 1 @@ -61,5 +61,5 @@ ANALYZE TABLE innodb_analyze; SET GLOBAL innodb_stats_sample_pages=16; ANALYZE TABLE innodb_analyze; -SET GLOBAL innodb_stats_sample_pages=@save_innodb_stats_sample_pages; DROP TABLE innodb_analyze; +SET GLOBAL innodb_stats_sample_pages=@old_innodb_stats_sample_pages; === modified file 'mysql-test/t/innodb_bug36169.test' --- a/mysql-test/t/innodb_bug36169.test 2009-06-11 12:53:26 +0000 +++ b/mysql-test/t/innodb_bug36169.test 2009-08-03 20:09:53 +0000 @@ -4,10 +4,10 @@ # -- source include/have_innodb.inc +set @old_innodb_file_per_table=@@innodb_file_per_table; +set @old_innodb_file_format=@@innodb_file_format; +set @old_innodb_file_format_check=@@innodb_file_format_check; -SET @save_innodb_file_format=@@global.innodb_file_format; -SET @save_innodb_file_format_check=@@global.innodb_file_format_check; -SET @save_innodb_file_per_table=@@global.innodb_file_per_table; SET GLOBAL innodb_file_format='Barracuda'; SET GLOBAL innodb_file_per_table=ON; @@ -1148,10 +1148,6 @@ KEY `idx44` (`col176`(100),`col42`,`col7 KEY `idx45` (`col2`(27),`col27`(116)) )engine=innodb ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=1; -SET GLOBAL innodb_file_format=@save_innodb_file_format; -SET GLOBAL innodb_file_format_check=@save_innodb_file_format_check; -SET GLOBAL innodb_file_per_table=@save_innodb_file_per_table; - DROP TABLE IF EXISTS table0; DROP TABLE IF EXISTS table1; DROP TABLE IF EXISTS table2; @@ -1160,3 +1156,7 @@ DROP TABLE IF EXISTS table4; DROP TABLE IF EXISTS table5; DROP TABLE IF EXISTS table6; +set global innodb_file_per_table=@old_innodb_file_per_table; +set global innodb_file_format=@old_innodb_file_format; +set global innodb_file_format_check=@old_innodb_file_format_check; + === modified file 'mysql-test/t/innodb_bug36172.test' --- a/mysql-test/t/innodb_bug36172.test 2009-06-11 12:53:26 +0000 +++ b/mysql-test/t/innodb_bug36172.test 2009-08-03 20:09:53 +0000 @@ -13,10 +13,10 @@ SET storage_engine=InnoDB; -- disable_query_log -- disable_result_log +set @old_innodb_file_per_table=@@innodb_file_per_table; +set @old_innodb_file_format=@@innodb_file_format; +set @old_innodb_file_format_check=@@innodb_file_format_check; -SET @save_innodb_file_format=@@global.innodb_file_format; -SET @save_innodb_file_format_check=@@global.innodb_file_format_check; -SET @save_innodb_file_per_table=@@global.innodb_file_per_table; SET GLOBAL innodb_file_format='Barracuda'; SET GLOBAL innodb_file_per_table=on; @@ -27,7 +27,8 @@ CHECK TABLE table0 EXTENDED; INSERT IGNORE INTO `table0` SET `col19` = '19940127002709', `col20` = 2383927.9055146948, `col21` = 4293243420.5621204000, `col22` = '20511211123705', `col23` = 4289899778.6573381000, `col24` = 4293449279.0540481000, `col25` = 'emphysemic', `col26` = 'dentally', `col27` = '2347406', `col28` = 'eruct', `col30` = 1222, `col31` = 4294372994.9941406000, `col32` = 4291385574.1173744000, `col33` = 'borrowing\'s', `col34` = 'septics', `col35` = 'ratter\'s', `col36` = 'Kaye', `col37` = 'Florentia', `col38` = 'allium', `col39` = 'barkeep', `col40` = '19510407003441', `col41` = 4293559200.4215522000, `col42` = 22482, `col43` = 'decussate', `col44` = 'Brom\'s', `col45` = 'violated', `col46` = 4925506.4635456400, `col47` = 930549, `col48` = '51296066', `col49` = 'voluminously', `col50` = '29306676', `col51` = -88, `col52` = -2153690, `col53` = 4290250202.1464887000, `col54` = 'expropriation', `col55` = 'Aberdeen\'s', `col56` = 20343, `col58` = '19640415171532', `col59` = 'extern', `col60` = 'Ubana', `col61` = 4290487961.8539081000, `col62` = '2147', `col63` = -24271, `col64` = '20750801194548', `col65` = 'Cunaxa\'s', `col66` = 'pasticcio', `col67` = 2795817, `col68` = 'Indore\'s', `col70` = 6864127, `col71` = '1817832', `col72` = '20540506114211', `col73` = '20040101012300', `col74` = 'rationalized', `col75` = '45522', `col76` = 'indene', `col77` = -6964559, `col78` = 4247535.5266884370, `col79` = '20720416124357', `col80` = '2143', `col81` = 4292060102.4466386000, `col82` = 'striving', `col83` = 'boneblack\'s', `col84` = 'redolent', `col85` = 6489697.9009369183, `col86` = 4287473465.9731131000, `col87` = 7726015, `col88` = 'perplexed', `col89` = '17153791', `col90` = 5478587.1108127078, `col91` = 4287091404.7004304000, `col92` = 'Boulez\'s', `col93` = '2931278'; CHECK TABLE table0 EXTENDED; -SET GLOBAL innodb_file_format=@save_innodb_file_format; -SET GLOBAL innodb_file_format_check=@save_innodb_file_format_check; -SET GLOBAL innodb_file_per_table=@save_innodb_file_per_table; DROP TABLE table0; +set global innodb_file_per_table=@old_innodb_file_per_table; +set global innodb_file_format=@old_innodb_file_format; +set global innodb_file_format_check=@old_innodb_file_format_check; + === modified file 'mysql-test/t/innodb_xtradb_bug317074.test' --- a/mysql-test/t/innodb_xtradb_bug317074.test 2009-06-11 12:53:26 +0000 +++ b/mysql-test/t/innodb_xtradb_bug317074.test 2009-08-03 20:09:53 +0000 @@ -1,8 +1,8 @@ -- source include/have_innodb.inc -SET @save_innodb_file_format=@@global.innodb_file_format; -SET @save_innodb_file_format_check=@@global.innodb_file_format_check; -SET @save_innodb_file_per_table=@@global.innodb_file_per_table; +SET @old_innodb_file_format=@@innodb_file_format; +SET @old_innodb_file_per_table=@@innodb_file_per_table; +SET @old_innodb_file_format_check=@@innodb_file_format_check; SET GLOBAL innodb_file_format='Barracuda'; SET GLOBAL innodb_file_per_table=ON; @@ -38,8 +38,7 @@ DROP PROCEDURE insert_many; # The bug is hangup at the following statement ALTER TABLE test1 ENGINE=MyISAM; -SET GLOBAL innodb_file_format=@save_innodb_file_format; -SET GLOBAL innodb_file_format_check=@save_innodb_file_format_check; -SET GLOBAL innodb_file_per_table=@save_innodb_file_per_table; - DROP TABLE test1; +SET GLOBAL innodb_file_format=@old_innodb_file_format; +SET GLOBAL innodb_file_per_table=@old_innodb_file_per_table; +SET GLOBAL innodb_file_format_check=@old_innodb_file_format_check; === modified file 'storage/xtradb/btr/btr0cur.c' --- a/storage/xtradb/btr/btr0cur.c 2009-05-04 02:45:47 +0000 +++ b/storage/xtradb/btr/btr0cur.c 2009-06-25 01:43:25 +0000 @@ -3202,7 +3202,9 @@ btr_estimate_number_of_different_key_val ulint n_cols; ulint matched_fields; ulint matched_bytes; + ib_int64_t n_recs = 0; ib_int64_t* n_diff; + ib_int64_t* n_not_nulls; ullint n_sample_pages; /* number of pages to sample */ ulint not_empty_flag = 0; ulint total_external_size = 0; @@ -3215,6 +3217,7 @@ btr_estimate_number_of_different_key_val ulint offsets_next_rec_[REC_OFFS_NORMAL_SIZE]; ulint* offsets_rec = offsets_rec_; ulint* offsets_next_rec= offsets_next_rec_; + ulint stats_method = srv_stats_method; rec_offs_init(offsets_rec_); rec_offs_init(offsets_next_rec_); @@ -3222,6 +3225,10 @@ btr_estimate_number_of_different_key_val n_diff = mem_zalloc((n_cols + 1) * sizeof(ib_int64_t)); + if (stats_method == SRV_STATS_METHOD_IGNORE_NULLS) { + n_not_nulls = mem_zalloc((n_cols + 1) * sizeof(ib_int64_t)); + } + /* It makes no sense to test more pages than are contained in the index, thus we lower the number if it is too high */ if (srv_stats_sample_pages > index->stat_index_size) { @@ -3260,6 +3267,20 @@ btr_estimate_number_of_different_key_val } while (rec != supremum) { + /* count recs */ + if (stats_method == SRV_STATS_METHOD_IGNORE_NULLS) { + n_recs++; + for (j = 0; j <= n_cols; j++) { + ulint f_len; + rec_get_nth_field(rec, offsets_rec, + j, &f_len); + if (f_len == UNIV_SQL_NULL) + break; + + n_not_nulls[j]++; + } + } + rec_t* next_rec = page_rec_get_next(rec); if (next_rec == supremum) { break; @@ -3274,7 +3295,7 @@ btr_estimate_number_of_different_key_val cmp_rec_rec_with_match(rec, next_rec, offsets_rec, offsets_next_rec, index, &matched_fields, - &matched_bytes); + &matched_bytes, srv_stats_method); for (j = matched_fields + 1; j <= n_cols; j++) { /* We add one if this index record has @@ -3359,9 +3380,21 @@ btr_estimate_number_of_different_key_val } index->stat_n_diff_key_vals[j] += add_on; + + /* revision for 'nulls_ignored' */ + if (stats_method == SRV_STATS_METHOD_IGNORE_NULLS) { + if (!n_not_nulls[j]) + n_not_nulls[j] = 1; + index->stat_n_diff_key_vals[j] = + index->stat_n_diff_key_vals[j] * n_recs + / n_not_nulls[j]; + } } mem_free(n_diff); + if (stats_method == SRV_STATS_METHOD_IGNORE_NULLS) { + mem_free(n_not_nulls); + } if (UNIV_LIKELY_NULL(heap)) { mem_heap_free(heap); } @@ -3733,7 +3766,8 @@ btr_blob_free( mtr_commit(mtr); - buf_pool_mutex_enter(); + //buf_pool_mutex_enter(); + mutex_enter(&LRU_list_mutex); mutex_enter(&block->mutex); /* Only free the block if it is still allocated to @@ -3744,17 +3778,22 @@ btr_blob_free( && buf_block_get_space(block) == space && buf_block_get_page_no(block) == page_no) { - if (buf_LRU_free_block(&block->page, all, NULL) + if (buf_LRU_free_block(&block->page, all, NULL, TRUE) != BUF_LRU_FREED - && all && block->page.zip.data) { + && all && block->page.zip.data + /* Now, buf_LRU_free_block() may release mutex temporarily */ + && buf_block_get_state(block) == BUF_BLOCK_FILE_PAGE + && buf_block_get_space(block) == space + && buf_block_get_page_no(block) == page_no) { /* Attempt to deallocate the uncompressed page if the whole block cannot be deallocted. */ - buf_LRU_free_block(&block->page, FALSE, NULL); + buf_LRU_free_block(&block->page, FALSE, NULL, TRUE); } } - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + mutex_exit(&LRU_list_mutex); mutex_exit(&block->mutex); } === modified file 'storage/xtradb/btr/btr0sea.c' --- a/storage/xtradb/btr/btr0sea.c 2009-05-04 02:45:47 +0000 +++ b/storage/xtradb/btr/btr0sea.c 2009-06-25 01:43:25 +0000 @@ -1731,7 +1731,8 @@ btr_search_validate(void) rec_offs_init(offsets_); rw_lock_x_lock(&btr_search_latch); - buf_pool_mutex_enter(); + //buf_pool_mutex_enter(); + rw_lock_x_lock(&page_hash_latch); cell_count = hash_get_n_cells(btr_search_sys->hash_index); @@ -1739,11 +1740,13 @@ btr_search_validate(void) /* We release btr_search_latch every once in a while to give other queries a chance to run. */ if ((i != 0) && ((i % chunk_size) == 0)) { - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + rw_lock_x_unlock(&page_hash_latch); rw_lock_x_unlock(&btr_search_latch); os_thread_yield(); rw_lock_x_lock(&btr_search_latch); - buf_pool_mutex_enter(); + //buf_pool_mutex_enter(); + rw_lock_x_lock(&page_hash_latch); } node = hash_get_nth_cell(btr_search_sys->hash_index, i)->node; @@ -1850,11 +1853,13 @@ btr_search_validate(void) /* We release btr_search_latch every once in a while to give other queries a chance to run. */ if (i != 0) { - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + rw_lock_x_unlock(&page_hash_latch); rw_lock_x_unlock(&btr_search_latch); os_thread_yield(); rw_lock_x_lock(&btr_search_latch); - buf_pool_mutex_enter(); + //buf_pool_mutex_enter(); + rw_lock_x_lock(&page_hash_latch); } if (!ha_validate(btr_search_sys->hash_index, i, end_index)) { @@ -1862,7 +1867,8 @@ btr_search_validate(void) } } - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + rw_lock_x_unlock(&page_hash_latch); rw_lock_x_unlock(&btr_search_latch); if (UNIV_LIKELY_NULL(heap)) { mem_heap_free(heap); === modified file 'storage/xtradb/buf/buf0buddy.c' --- a/storage/xtradb/buf/buf0buddy.c 2009-05-04 04:32:30 +0000 +++ b/storage/xtradb/buf/buf0buddy.c 2009-06-25 01:43:25 +0000 @@ -82,7 +82,7 @@ buf_buddy_add_to_free( #endif /* UNIV_DEBUG_VALGRIND */ ut_ad(buf_pool->zip_free[i].start != bpage); - UT_LIST_ADD_FIRST(list, buf_pool->zip_free[i], bpage); + UT_LIST_ADD_FIRST(zip_list, buf_pool->zip_free[i], bpage); #ifdef UNIV_DEBUG_VALGRIND if (b) UNIV_MEM_FREE(b, BUF_BUDDY_LOW << i); @@ -100,8 +100,8 @@ buf_buddy_remove_from_free( ulint i) /* in: index of buf_pool->zip_free[] */ { #ifdef UNIV_DEBUG_VALGRIND - buf_page_t* prev = UT_LIST_GET_PREV(list, bpage); - buf_page_t* next = UT_LIST_GET_NEXT(list, bpage); + buf_page_t* prev = UT_LIST_GET_PREV(zip_list, bpage); + buf_page_t* next = UT_LIST_GET_NEXT(zip_list, bpage); if (prev) UNIV_MEM_VALID(prev, BUF_BUDDY_LOW << i); if (next) UNIV_MEM_VALID(next, BUF_BUDDY_LOW << i); @@ -111,7 +111,7 @@ buf_buddy_remove_from_free( #endif /* UNIV_DEBUG_VALGRIND */ ut_ad(buf_page_get_state(bpage) == BUF_BLOCK_ZIP_FREE); - UT_LIST_REMOVE(list, buf_pool->zip_free[i], bpage); + UT_LIST_REMOVE(zip_list, buf_pool->zip_free[i], bpage); #ifdef UNIV_DEBUG_VALGRIND if (prev) UNIV_MEM_FREE(prev, BUF_BUDDY_LOW << i); @@ -131,12 +131,13 @@ buf_buddy_alloc_zip( { buf_page_t* bpage; - ut_ad(buf_pool_mutex_own()); + //ut_ad(buf_pool_mutex_own()); + ut_ad(mutex_own(&zip_free_mutex)); ut_a(i < BUF_BUDDY_SIZES); #if defined UNIV_DEBUG && !defined UNIV_DEBUG_VALGRIND /* Valgrind would complain about accessing free memory. */ - UT_LIST_VALIDATE(list, buf_page_t, buf_pool->zip_free[i]); + UT_LIST_VALIDATE(zip_list, buf_page_t, buf_pool->zip_free[i]); #endif /* UNIV_DEBUG && !UNIV_DEBUG_VALGRIND */ bpage = UT_LIST_GET_LAST(buf_pool->zip_free[i]); @@ -177,16 +178,19 @@ static void buf_buddy_block_free( /*=================*/ - void* buf) /* in: buffer frame to deallocate */ + void* buf, /* in: buffer frame to deallocate */ + ibool have_page_hash_mutex) { const ulint fold = BUF_POOL_ZIP_FOLD_PTR(buf); buf_page_t* bpage; buf_block_t* block; - ut_ad(buf_pool_mutex_own()); + //ut_ad(buf_pool_mutex_own()); ut_ad(!mutex_own(&buf_pool_zip_mutex)); ut_a(!ut_align_offset(buf, UNIV_PAGE_SIZE)); + mutex_enter(&zip_hash_mutex); + HASH_SEARCH(hash, buf_pool->zip_hash, fold, buf_page_t*, bpage, ut_ad(buf_page_get_state(bpage) == BUF_BLOCK_MEMORY && bpage->in_zip_hash && !bpage->in_page_hash), @@ -198,12 +202,14 @@ buf_buddy_block_free( ut_d(bpage->in_zip_hash = FALSE); HASH_DELETE(buf_page_t, hash, buf_pool->zip_hash, fold, bpage); + mutex_exit(&zip_hash_mutex); + ut_d(memset(buf, 0, UNIV_PAGE_SIZE)); UNIV_MEM_INVALID(buf, UNIV_PAGE_SIZE); block = (buf_block_t*) bpage; mutex_enter(&block->mutex); - buf_LRU_block_free_non_file_page(block); + buf_LRU_block_free_non_file_page(block, have_page_hash_mutex); mutex_exit(&block->mutex); ut_ad(buf_buddy_n_frames > 0); @@ -219,7 +225,7 @@ buf_buddy_block_register( buf_block_t* block) /* in: buffer frame to allocate */ { const ulint fold = BUF_POOL_ZIP_FOLD(block); - ut_ad(buf_pool_mutex_own()); + //ut_ad(buf_pool_mutex_own()); ut_ad(!mutex_own(&buf_pool_zip_mutex)); buf_block_set_state(block, BUF_BLOCK_MEMORY); @@ -230,7 +236,10 @@ buf_buddy_block_register( ut_ad(!block->page.in_page_hash); ut_ad(!block->page.in_zip_hash); ut_d(block->page.in_zip_hash = TRUE); + + mutex_enter(&zip_hash_mutex); HASH_INSERT(buf_page_t, hash, buf_pool->zip_hash, fold, &block->page); + mutex_exit(&zip_hash_mutex); ut_d(buf_buddy_n_frames++); } @@ -264,7 +273,7 @@ buf_buddy_alloc_from( bpage->state = BUF_BLOCK_ZIP_FREE; #if defined UNIV_DEBUG && !defined UNIV_DEBUG_VALGRIND /* Valgrind would complain about accessing free memory. */ - UT_LIST_VALIDATE(list, buf_page_t, buf_pool->zip_free[j]); + UT_LIST_VALIDATE(zip_list, buf_page_t, buf_pool->zip_free[j]); #endif /* UNIV_DEBUG && !UNIV_DEBUG_VALGRIND */ buf_buddy_add_to_free(bpage, j); } @@ -284,24 +293,28 @@ buf_buddy_alloc_low( possibly NULL if lru==NULL */ ulint i, /* in: index of buf_pool->zip_free[], or BUF_BUDDY_SIZES */ - ibool* lru) /* in: pointer to a variable that will be assigned + ibool* lru, /* in: pointer to a variable that will be assigned TRUE if storage was allocated from the LRU list and buf_pool_mutex was temporarily released, or NULL if the LRU list should not be used */ + ibool have_page_hash_mutex) { buf_block_t* block; - ut_ad(buf_pool_mutex_own()); + //ut_ad(buf_pool_mutex_own()); ut_ad(!mutex_own(&buf_pool_zip_mutex)); if (i < BUF_BUDDY_SIZES) { /* Try to allocate from the buddy system. */ + mutex_enter(&zip_free_mutex); block = buf_buddy_alloc_zip(i); if (block) { goto func_exit; } + + mutex_exit(&zip_free_mutex); } /* Try allocating from the buf_pool->free list. */ @@ -318,18 +331,29 @@ buf_buddy_alloc_low( } /* Try replacing an uncompressed page in the buffer pool. */ - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + mutex_exit(&LRU_list_mutex); + if (have_page_hash_mutex) { + rw_lock_x_unlock(&page_hash_latch); + } block = buf_LRU_get_free_block(0); *lru = TRUE; - buf_pool_mutex_enter(); + //buf_pool_mutex_enter(); + mutex_enter(&LRU_list_mutex); + if (have_page_hash_mutex) { + rw_lock_x_lock(&page_hash_latch); + } alloc_big: buf_buddy_block_register(block); + mutex_enter(&zip_free_mutex); block = buf_buddy_alloc_from(block->frame, i, BUF_BUDDY_SIZES); func_exit: buf_buddy_stat[i].used++; + mutex_exit(&zip_free_mutex); + return(block); } @@ -345,7 +369,10 @@ buf_buddy_relocate_block( { buf_page_t* b; - ut_ad(buf_pool_mutex_own()); + //ut_ad(buf_pool_mutex_own()); +#ifdef UNIV_SYNC_DEBUG + ut_ad(rw_lock_own(&page_hash_latch, RW_LOCK_EX)); +#endif switch (buf_page_get_state(bpage)) { case BUF_BLOCK_ZIP_FREE: @@ -354,7 +381,7 @@ buf_buddy_relocate_block( case BUF_BLOCK_FILE_PAGE: case BUF_BLOCK_MEMORY: case BUF_BLOCK_REMOVE_HASH: - ut_error; + /* ut_error; */ /* optimistic */ case BUF_BLOCK_ZIP_DIRTY: /* Cannot relocate dirty pages. */ return(FALSE); @@ -364,9 +391,17 @@ buf_buddy_relocate_block( } mutex_enter(&buf_pool_zip_mutex); + mutex_enter(&zip_free_mutex); if (!buf_page_can_relocate(bpage)) { mutex_exit(&buf_pool_zip_mutex); + mutex_exit(&zip_free_mutex); + return(FALSE); + } + + if (bpage != buf_page_hash_get(bpage->space, bpage->offset)) { + mutex_exit(&buf_pool_zip_mutex); + mutex_exit(&zip_free_mutex); return(FALSE); } @@ -374,16 +409,19 @@ buf_buddy_relocate_block( ut_d(bpage->state = BUF_BLOCK_ZIP_FREE); /* relocate buf_pool->zip_clean */ - b = UT_LIST_GET_PREV(list, dpage); - UT_LIST_REMOVE(list, buf_pool->zip_clean, dpage); + mutex_enter(&flush_list_mutex); + b = UT_LIST_GET_PREV(zip_list, dpage); + UT_LIST_REMOVE(zip_list, buf_pool->zip_clean, dpage); if (b) { - UT_LIST_INSERT_AFTER(list, buf_pool->zip_clean, b, dpage); + UT_LIST_INSERT_AFTER(zip_list, buf_pool->zip_clean, b, dpage); } else { - UT_LIST_ADD_FIRST(list, buf_pool->zip_clean, dpage); + UT_LIST_ADD_FIRST(zip_list, buf_pool->zip_clean, dpage); } + mutex_exit(&flush_list_mutex); mutex_exit(&buf_pool_zip_mutex); + mutex_exit(&zip_free_mutex); return(TRUE); } @@ -396,13 +434,15 @@ buf_buddy_relocate( /* out: TRUE if relocated */ void* src, /* in: block to relocate */ void* dst, /* in: free block to relocate to */ - ulint i) /* in: index of buf_pool->zip_free[] */ + ulint i, /* in: index of buf_pool->zip_free[] */ + ibool have_page_hash_mutex) { buf_page_t* bpage; const ulint size = BUF_BUDDY_LOW << i; ullint usec = ut_time_us(NULL); - ut_ad(buf_pool_mutex_own()); + //ut_ad(buf_pool_mutex_own()); + ut_ad(mutex_own(&zip_free_mutex)); ut_ad(!mutex_own(&buf_pool_zip_mutex)); ut_ad(!ut_align_offset(src, size)); ut_ad(!ut_align_offset(dst, size)); @@ -421,9 +461,16 @@ buf_buddy_relocate( actually is a properly initialized buf_page_t object. */ if (size >= PAGE_ZIP_MIN_SIZE) { + if (!have_page_hash_mutex) + mutex_exit(&zip_free_mutex); + /* This is a compressed page. */ mutex_t* mutex; + if (!have_page_hash_mutex) { + mutex_enter(&LRU_list_mutex); + rw_lock_x_lock(&page_hash_latch); + } /* The src block may be split into smaller blocks, some of which may be free. Thus, the mach_read_from_4() calls below may attempt to read @@ -444,6 +491,11 @@ buf_buddy_relocate( added to buf_pool->page_hash yet. Obviously, it cannot be relocated. */ + if (!have_page_hash_mutex) { + mutex_enter(&zip_free_mutex); + mutex_exit(&LRU_list_mutex); + rw_lock_x_unlock(&page_hash_latch); + } return(FALSE); } @@ -453,16 +505,32 @@ buf_buddy_relocate( For the sake of simplicity, give up. */ ut_ad(page_zip_get_size(&bpage->zip) < size); + if (!have_page_hash_mutex) { + mutex_enter(&zip_free_mutex); + mutex_exit(&LRU_list_mutex); + rw_lock_x_unlock(&page_hash_latch); + } return(FALSE); } + /* To keep latch order */ + if (have_page_hash_mutex) + mutex_exit(&zip_free_mutex); + /* The block must have been allocated, but it may contain uninitialized data. */ UNIV_MEM_ASSERT_W(src, size); mutex = buf_page_get_mutex(bpage); +retry_lock: mutex_enter(mutex); + if (mutex != buf_page_get_mutex(bpage)) { + mutex_exit(mutex); + mutex = buf_page_get_mutex(bpage); + goto retry_lock; + } + mutex_enter(&zip_free_mutex); if (buf_page_can_relocate(bpage)) { /* Relocate the compressed page. */ @@ -479,17 +547,48 @@ success: buddy_stat->relocated_usec += ut_time_us(NULL) - usec; } + + if (!have_page_hash_mutex) { + mutex_exit(&LRU_list_mutex); + rw_lock_x_unlock(&page_hash_latch); + } return(TRUE); } + if (!have_page_hash_mutex) { + mutex_exit(&LRU_list_mutex); + rw_lock_x_unlock(&page_hash_latch); + } + mutex_exit(mutex); } else if (i == buf_buddy_get_slot(sizeof(buf_page_t))) { /* This must be a buf_page_t object. */ UNIV_MEM_ASSERT_RW(src, size); + + mutex_exit(&zip_free_mutex); + + if (!have_page_hash_mutex) { + mutex_enter(&LRU_list_mutex); + rw_lock_x_lock(&page_hash_latch); + } + if (buf_buddy_relocate_block(src, dst)) { + mutex_enter(&zip_free_mutex); + + if (!have_page_hash_mutex) { + mutex_exit(&LRU_list_mutex); + rw_lock_x_unlock(&page_hash_latch); + } goto success; } + + mutex_enter(&zip_free_mutex); + + if (!have_page_hash_mutex) { + mutex_exit(&LRU_list_mutex); + rw_lock_x_unlock(&page_hash_latch); + } } return(FALSE); @@ -503,12 +602,14 @@ buf_buddy_free_low( /*===============*/ void* buf, /* in: block to be freed, must not be pointed to by the buffer pool */ - ulint i) /* in: index of buf_pool->zip_free[] */ + ulint i, /* in: index of buf_pool->zip_free[] */ + ibool have_page_hash_mutex) { buf_page_t* bpage; buf_page_t* buddy; - ut_ad(buf_pool_mutex_own()); + //ut_ad(buf_pool_mutex_own()); + ut_ad(mutex_own(&zip_free_mutex)); ut_ad(!mutex_own(&buf_pool_zip_mutex)); ut_ad(i <= BUF_BUDDY_SIZES); ut_ad(buf_buddy_stat[i].used > 0); @@ -519,7 +620,9 @@ recombine: ut_d(((buf_page_t*) buf)->state = BUF_BLOCK_ZIP_FREE); if (i == BUF_BUDDY_SIZES) { - buf_buddy_block_free(buf); + mutex_exit(&zip_free_mutex); + buf_buddy_block_free(buf, have_page_hash_mutex); + mutex_enter(&zip_free_mutex); return; } @@ -564,7 +667,7 @@ buddy_free2: ut_a(bpage != buf); { - buf_page_t* next = UT_LIST_GET_NEXT(list, bpage); + buf_page_t* next = UT_LIST_GET_NEXT(zip_list, bpage); UNIV_MEM_ASSERT_AND_FREE(bpage, BUF_BUDDY_LOW << i); bpage = next; } @@ -573,11 +676,11 @@ buddy_free2: #ifndef UNIV_DEBUG_VALGRIND buddy_nonfree: /* Valgrind would complain about accessing free memory. */ - ut_d(UT_LIST_VALIDATE(list, buf_page_t, buf_pool->zip_free[i])); + ut_d(UT_LIST_VALIDATE(zip_list, buf_page_t, buf_pool->zip_free[i])); #endif /* UNIV_DEBUG_VALGRIND */ /* The buddy is not free. Is there a free block of this size? */ - bpage = UT_LIST_GET_FIRST(buf_pool->zip_free[i]); + bpage = UT_LIST_GET_LAST(buf_pool->zip_free[i]); if (bpage) { /* Remove the block from the free list, because a successful @@ -587,7 +690,7 @@ buddy_nonfree: buf_buddy_remove_from_free(bpage, i); /* Try to relocate the buddy of buf to the free block. */ - if (buf_buddy_relocate(buddy, bpage, i)) { + if (buf_buddy_relocate(buddy, bpage, i, have_page_hash_mutex)) { ut_d(buddy->state = BUF_BLOCK_ZIP_FREE); goto buddy_free2; @@ -608,14 +711,14 @@ buddy_nonfree: (Parts of the buddy can be free in buf_pool->zip_free[j] with j < i.)*/ for (b = UT_LIST_GET_FIRST(buf_pool->zip_free[i]); - b; b = UT_LIST_GET_NEXT(list, b)) { + b; b = UT_LIST_GET_NEXT(zip_list, b)) { ut_a(b != buddy); } } #endif /* UNIV_DEBUG && !UNIV_DEBUG_VALGRIND */ - if (buf_buddy_relocate(buddy, buf, i)) { + if (buf_buddy_relocate(buddy, buf, i, have_page_hash_mutex)) { buf = bpage; UNIV_MEM_VALID(bpage, BUF_BUDDY_LOW << i); === modified file 'storage/xtradb/buf/buf0buf.c' --- a/storage/xtradb/buf/buf0buf.c 2009-05-04 04:32:30 +0000 +++ b/storage/xtradb/buf/buf0buf.c 2009-06-25 01:43:25 +0000 @@ -244,6 +244,12 @@ UNIV_INTERN buf_pool_t* buf_pool = NULL; /* mutex protecting the buffer pool struct and control blocks, except the read-write lock in them */ UNIV_INTERN mutex_t buf_pool_mutex; +UNIV_INTERN mutex_t LRU_list_mutex; +UNIV_INTERN mutex_t flush_list_mutex; +UNIV_INTERN rw_lock_t page_hash_latch; +UNIV_INTERN mutex_t free_list_mutex; +UNIV_INTERN mutex_t zip_free_mutex; +UNIV_INTERN mutex_t zip_hash_mutex; /* mutex protecting the control blocks of compressed-only pages (of type buf_page_t, not buf_block_t) */ UNIV_INTERN mutex_t buf_pool_zip_mutex; @@ -664,9 +670,9 @@ buf_block_init( block->page.in_zip_hash = FALSE; block->page.in_flush_list = FALSE; block->page.in_free_list = FALSE; - block->in_unzip_LRU_list = FALSE; #endif /* UNIV_DEBUG */ block->page.in_LRU_list = FALSE; + block->in_unzip_LRU_list = FALSE; #if defined UNIV_AHI_DEBUG || defined UNIV_DEBUG block->n_pointers = 0; #endif /* UNIV_AHI_DEBUG || UNIV_DEBUG */ @@ -751,8 +757,10 @@ buf_chunk_init( memset(block->frame, '\0', UNIV_PAGE_SIZE); #endif /* Add the block to the free list */ - UT_LIST_ADD_LAST(list, buf_pool->free, (&block->page)); + mutex_enter(&free_list_mutex); + UT_LIST_ADD_LAST(free, buf_pool->free, (&block->page)); ut_d(block->page.in_free_list = TRUE); + mutex_exit(&free_list_mutex); block++; frame += UNIV_PAGE_SIZE; @@ -778,7 +786,7 @@ buf_chunk_contains_zip( ulint i; ut_ad(buf_pool); - ut_ad(buf_pool_mutex_own()); + //ut_ad(buf_pool_mutex_own()); block = chunk->blocks; @@ -832,7 +840,7 @@ buf_chunk_not_freed( ulint i; ut_ad(buf_pool); - ut_ad(buf_pool_mutex_own()); + //ut_ad(buf_pool_mutex_own()); /*optimistic...*/ block = chunk->blocks; @@ -865,7 +873,7 @@ buf_chunk_all_free( ulint i; ut_ad(buf_pool); - ut_ad(buf_pool_mutex_own()); + ut_ad(buf_pool_mutex_own()); /* but we need all mutex here */ block = chunk->blocks; @@ -891,7 +899,7 @@ buf_chunk_free( buf_block_t* block; const buf_block_t* block_end; - ut_ad(buf_pool_mutex_own()); + ut_ad(buf_pool_mutex_own()); /* but we need all mutex here */ block_end = chunk->blocks + chunk->size; @@ -903,8 +911,10 @@ buf_chunk_free( ut_ad(!block->in_unzip_LRU_list); ut_ad(!block->page.in_flush_list); /* Remove the block from the free list. */ + mutex_enter(&free_list_mutex); ut_ad(block->page.in_free_list); - UT_LIST_REMOVE(list, buf_pool->free, (&block->page)); + UT_LIST_REMOVE(free, buf_pool->free, (&block->page)); + mutex_exit(&free_list_mutex); /* Free the latches. */ mutex_free(&block->mutex); @@ -935,8 +945,17 @@ buf_pool_init(void) /* 1. Initialize general fields ------------------------------- */ mutex_create(&buf_pool_mutex, SYNC_BUF_POOL); + mutex_create(&LRU_list_mutex, SYNC_BUF_LRU_LIST); + mutex_create(&flush_list_mutex, SYNC_BUF_FLUSH_LIST); + rw_lock_create(&page_hash_latch, SYNC_BUF_PAGE_HASH); + mutex_create(&free_list_mutex, SYNC_BUF_FREE_LIST); + mutex_create(&zip_free_mutex, SYNC_BUF_ZIP_FREE); + mutex_create(&zip_hash_mutex, SYNC_BUF_ZIP_HASH); + mutex_create(&buf_pool_zip_mutex, SYNC_BUF_BLOCK); + mutex_enter(&LRU_list_mutex); + rw_lock_x_lock(&page_hash_latch); buf_pool_mutex_enter(); buf_pool->n_chunks = 1; @@ -973,6 +992,8 @@ buf_pool_init(void) --------------------------- */ /* All fields are initialized by mem_zalloc(). */ + mutex_exit(&LRU_list_mutex); + rw_lock_x_unlock(&page_hash_latch); buf_pool_mutex_exit(); btr_search_sys_create(buf_pool->curr_size @@ -1105,7 +1126,11 @@ buf_relocate( buf_page_t* b; ulint fold; - ut_ad(buf_pool_mutex_own()); + //ut_ad(buf_pool_mutex_own()); + ut_ad(mutex_own(&LRU_list_mutex)); +#ifdef UNIV_SYNC_DEBUG + ut_ad(rw_lock_own(&page_hash_latch, RW_LOCK_EX)); +#endif ut_ad(mutex_own(buf_page_get_mutex(bpage))); ut_a(buf_page_get_io_fix(bpage) == BUF_IO_NONE); ut_a(bpage->buf_fix_count == 0); @@ -1186,7 +1211,8 @@ buf_pool_shrink( try_again: btr_search_disable(); /* Empty the adaptive hash index again */ - buf_pool_mutex_enter(); + //buf_pool_mutex_enter(); + mutex_enter(&LRU_list_mutex); shrink_again: if (buf_pool->n_chunks <= 1) { @@ -1257,7 +1283,7 @@ shrink_again: buf_LRU_make_block_old(&block->page); dirty++; - } else if (buf_LRU_free_block(&block->page, TRUE, NULL) + } else if (buf_LRU_free_block(&block->page, TRUE, NULL, FALSE) != BUF_LRU_FREED) { nonfree++; } @@ -1265,7 +1291,8 @@ shrink_again: mutex_exit(&block->mutex); } - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + mutex_exit(&LRU_list_mutex); /* Request for a flush of the chunk if it helps. Do not flush if there are non-free blocks, since @@ -1314,7 +1341,8 @@ shrink_again: func_done: srv_buf_pool_old_size = srv_buf_pool_size; func_exit: - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + mutex_exit(&LRU_list_mutex); btr_search_enable(); } @@ -1332,7 +1360,11 @@ buf_pool_page_hash_rebuild(void) hash_table_t* zip_hash; buf_page_t* b; - buf_pool_mutex_enter(); + //buf_pool_mutex_enter(); + mutex_enter(&LRU_list_mutex); + rw_lock_x_lock(&page_hash_latch); + mutex_enter(&flush_list_mutex); + /* Free, create, and populate the hash table. */ hash_table_free(buf_pool->page_hash); @@ -1374,7 +1406,7 @@ buf_pool_page_hash_rebuild(void) in buf_pool->flush_list. */ for (b = UT_LIST_GET_FIRST(buf_pool->zip_clean); b; - b = UT_LIST_GET_NEXT(list, b)) { + b = UT_LIST_GET_NEXT(zip_list, b)) { ut_a(buf_page_get_state(b) == BUF_BLOCK_ZIP_PAGE); ut_ad(!b->in_flush_list); ut_ad(b->in_LRU_list); @@ -1386,7 +1418,7 @@ buf_pool_page_hash_rebuild(void) } for (b = UT_LIST_GET_FIRST(buf_pool->flush_list); b; - b = UT_LIST_GET_NEXT(list, b)) { + b = UT_LIST_GET_NEXT(flush_list, b)) { ut_ad(b->in_flush_list); ut_ad(b->in_LRU_list); ut_ad(b->in_page_hash); @@ -1412,7 +1444,10 @@ buf_pool_page_hash_rebuild(void) } } - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + mutex_exit(&LRU_list_mutex); + rw_lock_x_unlock(&page_hash_latch); + mutex_exit(&flush_list_mutex); } /************************************************************************ @@ -1422,17 +1457,20 @@ void buf_pool_resize(void) /*=================*/ { - buf_pool_mutex_enter(); + //buf_pool_mutex_enter(); + mutex_enter(&LRU_list_mutex); if (srv_buf_pool_old_size == srv_buf_pool_size) { - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + mutex_exit(&LRU_list_mutex); return; } if (srv_buf_pool_curr_size + 1048576 > srv_buf_pool_size) { - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + mutex_exit(&LRU_list_mutex); /* Disable adaptive hash indexes and empty the index in order to free up memory in the buffer pool chunks. */ @@ -1466,7 +1504,8 @@ buf_pool_resize(void) } srv_buf_pool_old_size = srv_buf_pool_size; - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + mutex_exit(&LRU_list_mutex); } buf_pool_page_hash_rebuild(); @@ -1488,12 +1527,14 @@ buf_block_make_young( if (buf_page_peek_if_too_old(bpage)) { - buf_pool_mutex_enter(); + //buf_pool_mutex_enter(); + mutex_enter(&LRU_list_mutex); /* There has been freeing activity in the LRU list: best to move to the head of the LRU list */ buf_LRU_make_block_young(bpage); - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + mutex_exit(&LRU_list_mutex); } } @@ -1507,13 +1548,15 @@ buf_page_make_young( /*================*/ buf_page_t* bpage) /* in: buffer block of a file page */ { - buf_pool_mutex_enter(); + //buf_pool_mutex_enter(); + mutex_enter(&LRU_list_mutex); ut_a(buf_page_in_file(bpage)); buf_LRU_make_block_young(bpage); - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + mutex_exit(&LRU_list_mutex); } /************************************************************************ @@ -1528,7 +1571,8 @@ buf_reset_check_index_page_at_flush( { buf_block_t* block; - buf_pool_mutex_enter(); + //buf_pool_mutex_enter(); + rw_lock_s_lock(&page_hash_latch); block = (buf_block_t*) buf_page_hash_get(space, offset); @@ -1536,7 +1580,8 @@ buf_reset_check_index_page_at_flush( block->check_index_page_at_flush = FALSE; } - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + rw_lock_s_unlock(&page_hash_latch); } /************************************************************************ @@ -1555,7 +1600,8 @@ buf_page_peek_if_search_hashed( buf_block_t* block; ibool is_hashed; - buf_pool_mutex_enter(); + //buf_pool_mutex_enter(); + rw_lock_s_lock(&page_hash_latch); block = (buf_block_t*) buf_page_hash_get(space, offset); @@ -1565,7 +1611,8 @@ buf_page_peek_if_search_hashed( is_hashed = block->is_hashed; } - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + rw_lock_s_unlock(&page_hash_latch); return(is_hashed); } @@ -1587,7 +1634,8 @@ buf_page_set_file_page_was_freed( { buf_page_t* bpage; - buf_pool_mutex_enter(); + //buf_pool_mutex_enter(); + rw_lock_s_lock(&page_hash_latch); bpage = buf_page_hash_get(space, offset); @@ -1595,7 +1643,8 @@ buf_page_set_file_page_was_freed( bpage->file_page_was_freed = TRUE; } - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + rw_lock_s_unlock(&page_hash_latch); return(bpage); } @@ -1616,7 +1665,8 @@ buf_page_reset_file_page_was_freed( { buf_page_t* bpage; - buf_pool_mutex_enter(); + //buf_pool_mutex_enter(); + rw_lock_s_lock(&page_hash_latch); bpage = buf_page_hash_get(space, offset); @@ -1624,7 +1674,8 @@ buf_page_reset_file_page_was_freed( bpage->file_page_was_freed = FALSE; } - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + rw_lock_s_unlock(&page_hash_latch); return(bpage); } @@ -1657,8 +1708,9 @@ buf_page_get_zip( buf_pool->n_page_gets++; for (;;) { - buf_pool_mutex_enter(); + //buf_pool_mutex_enter(); lookup: + rw_lock_s_lock(&page_hash_latch); bpage = buf_page_hash_get(space, offset); if (bpage) { break; @@ -1666,7 +1718,8 @@ lookup: /* Page not in buf_pool: needs to be read from file */ - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + rw_lock_s_unlock(&page_hash_latch); buf_read_page(space, zip_size, offset); @@ -1677,12 +1730,21 @@ lookup: if (UNIV_UNLIKELY(!bpage->zip.data)) { /* There is no compressed page. */ - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + rw_lock_s_unlock(&page_hash_latch); return(NULL); } block_mutex = buf_page_get_mutex(bpage); +retry_lock: mutex_enter(block_mutex); + if (block_mutex != buf_page_get_mutex(bpage)) { + mutex_exit(block_mutex); + block_mutex = buf_page_get_mutex(bpage); + goto retry_lock; + } + + rw_lock_s_unlock(&page_hash_latch); switch (buf_page_get_state(bpage)) { case BUF_BLOCK_NOT_USED: @@ -1698,7 +1760,7 @@ lookup: break; case BUF_BLOCK_FILE_PAGE: /* Discard the uncompressed page frame if possible. */ - if (buf_LRU_free_block(bpage, FALSE, NULL) + if (buf_LRU_free_block(bpage, FALSE, NULL, FALSE) == BUF_LRU_FREED) { mutex_exit(block_mutex); @@ -1712,7 +1774,7 @@ lookup: must_read = buf_page_get_io_fix(bpage) == BUF_IO_READ; - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); buf_page_set_accessed(bpage, TRUE); @@ -1943,7 +2005,7 @@ buf_block_is_uncompressed( const buf_chunk_t* chunk = buf_pool->chunks; const buf_chunk_t* const echunk = chunk + buf_pool->n_chunks; - ut_ad(buf_pool_mutex_own()); + //ut_ad(buf_pool_mutex_own()); if (UNIV_UNLIKELY((((ulint) block) % sizeof *block) != 0)) { /* The pointer should be aligned. */ @@ -1986,6 +2048,7 @@ buf_page_get_gen( ibool accessed; ulint fix_type; ibool must_read; + mutex_t* block_mutex; ut_ad(mtr); ut_ad((rw_latch == RW_S_LATCH) @@ -2001,9 +2064,18 @@ buf_page_get_gen( buf_pool->n_page_gets++; loop: block = guess; - buf_pool_mutex_enter(); + //buf_pool_mutex_enter(); if (block) { + block_mutex = buf_page_get_mutex((buf_page_t*)block); +retry_lock_1: + mutex_enter(block_mutex); + if (block_mutex != buf_page_get_mutex((buf_page_t*)block)) { + mutex_exit(block_mutex); + block_mutex = buf_page_get_mutex((buf_page_t*)block); + goto retry_lock_1; + } + /* If the guess is a compressed page descriptor that has been allocated by buf_buddy_alloc(), it may have been invalidated by buf_buddy_relocate(). In that @@ -2017,6 +2089,8 @@ loop: || space != block->page.space || buf_block_get_state(block) != BUF_BLOCK_FILE_PAGE) { + mutex_exit(block_mutex); + block = guess = NULL; } else { ut_ad(!block->page.in_zip_hash); @@ -2025,14 +2099,26 @@ loop: } if (block == NULL) { + rw_lock_s_lock(&page_hash_latch); block = (buf_block_t*) buf_page_hash_get(space, offset); + if (block) { + block_mutex = buf_page_get_mutex((buf_page_t*)block); +retry_lock_2: + mutex_enter(block_mutex); + if (block_mutex != buf_page_get_mutex((buf_page_t*)block)) { + mutex_exit(block_mutex); + block_mutex = buf_page_get_mutex((buf_page_t*)block); + goto retry_lock_2; + } + } + rw_lock_s_unlock(&page_hash_latch); } loop2: if (block == NULL) { /* Page not in buf_pool: needs to be read from file */ - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); if (mode == BUF_GET_IF_IN_POOL) { @@ -2053,7 +2139,8 @@ loop2: if (must_read && mode == BUF_GET_IF_IN_POOL) { /* The page is only being read to buffer */ - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + mutex_exit(block_mutex); return(NULL); } @@ -2063,10 +2150,16 @@ loop2: ibool success; case BUF_BLOCK_FILE_PAGE: + if (block_mutex == &buf_pool_zip_mutex) { + /* it is wrong mutex... */ + mutex_exit(block_mutex); + goto loop; + } break; case BUF_BLOCK_ZIP_PAGE: case BUF_BLOCK_ZIP_DIRTY: + ut_ad(block_mutex == &buf_pool_zip_mutex); bpage = &block->page; if (bpage->buf_fix_count @@ -2077,20 +2170,25 @@ loop2: wait_until_unfixed: /* The block is buffer-fixed or I/O-fixed. Try again later. */ - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + mutex_exit(block_mutex); os_thread_sleep(WAIT_FOR_READ); goto loop; } /* Allocate an uncompressed page. */ - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + mutex_exit(block_mutex); block = buf_LRU_get_free_block(0); ut_a(block); + block_mutex = &block->mutex; - buf_pool_mutex_enter(); - mutex_enter(&block->mutex); + //buf_pool_mutex_enter(); + mutex_enter(&LRU_list_mutex); + rw_lock_x_lock(&page_hash_latch); + mutex_enter(block_mutex); { buf_page_t* hash_bpage @@ -2101,35 +2199,55 @@ wait_until_unfixed: while buf_pool_mutex was released. Free the block that was allocated. */ - buf_LRU_block_free_non_file_page(block); - mutex_exit(&block->mutex); + buf_LRU_block_free_non_file_page(block, TRUE); + mutex_exit(block_mutex); block = (buf_block_t*) hash_bpage; + if (block) { + block_mutex = buf_page_get_mutex((buf_page_t*)block); +retry_lock_3: + mutex_enter(block_mutex); + if (block_mutex != buf_page_get_mutex((buf_page_t*)block)) { + mutex_exit(block_mutex); + block_mutex = buf_page_get_mutex((buf_page_t*)block); + goto retry_lock_3; + } + } + rw_lock_x_unlock(&page_hash_latch); + mutex_exit(&LRU_list_mutex); goto loop2; } } + mutex_enter(&buf_pool_zip_mutex); + if (UNIV_UNLIKELY (bpage->buf_fix_count || buf_page_get_io_fix(bpage) != BUF_IO_NONE)) { + mutex_exit(&buf_pool_zip_mutex); /* The block was buffer-fixed or I/O-fixed while buf_pool_mutex was not held by this thread. Free the block that was allocated and try again. This should be extremely unlikely. */ - buf_LRU_block_free_non_file_page(block); - mutex_exit(&block->mutex); + buf_LRU_block_free_non_file_page(block, TRUE); + //mutex_exit(&block->mutex); + rw_lock_x_unlock(&page_hash_latch); + mutex_exit(&LRU_list_mutex); goto wait_until_unfixed; } /* Move the compressed page from bpage to block, and uncompress it. */ - mutex_enter(&buf_pool_zip_mutex); + mutex_enter(&flush_list_mutex); buf_relocate(bpage, &block->page); + + rw_lock_x_unlock(&page_hash_latch); + buf_block_init_low(block); block->lock_hash_val = lock_rec_hash(space, offset); @@ -2138,29 +2256,31 @@ wait_until_unfixed: if (buf_page_get_state(&block->page) == BUF_BLOCK_ZIP_PAGE) { - UT_LIST_REMOVE(list, buf_pool->zip_clean, + UT_LIST_REMOVE(zip_list, buf_pool->zip_clean, &block->page); ut_ad(!block->page.in_flush_list); } else { /* Relocate buf_pool->flush_list. */ buf_page_t* b; - b = UT_LIST_GET_PREV(list, &block->page); + b = UT_LIST_GET_PREV(flush_list, &block->page); ut_ad(block->page.in_flush_list); - UT_LIST_REMOVE(list, buf_pool->flush_list, + UT_LIST_REMOVE(flush_list, buf_pool->flush_list, &block->page); if (b) { UT_LIST_INSERT_AFTER( - list, buf_pool->flush_list, b, + flush_list, buf_pool->flush_list, b, &block->page); } else { UT_LIST_ADD_FIRST( - list, buf_pool->flush_list, + flush_list, buf_pool->flush_list, &block->page); } } + mutex_exit(&flush_list_mutex); + /* Buffer-fix, I/O-fix, and X-latch the block for the duration of the decompression. Also add the block to the unzip_LRU list. */ @@ -2169,16 +2289,22 @@ wait_until_unfixed: /* Insert at the front of unzip_LRU list */ buf_unzip_LRU_add_block(block, FALSE); + mutex_exit(&LRU_list_mutex); + block->page.buf_fix_count = 1; buf_block_set_io_fix(block, BUF_IO_READ); + + mutex_enter(&buf_pool_mutex); buf_pool->n_pend_unzip++; + mutex_exit(&buf_pool_mutex); + rw_lock_x_lock(&block->lock); - mutex_exit(&block->mutex); + mutex_exit(block_mutex); mutex_exit(&buf_pool_zip_mutex); - buf_buddy_free(bpage, sizeof *bpage); + buf_buddy_free(bpage, sizeof *bpage, FALSE); - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); /* Decompress the page and apply buffered operations while not holding buf_pool_mutex or block->mutex. */ @@ -2190,17 +2316,21 @@ wait_until_unfixed: } /* Unfix and unlatch the block. */ - buf_pool_mutex_enter(); - mutex_enter(&block->mutex); + //buf_pool_mutex_enter(); + block_mutex = &block->mutex; + mutex_enter(block_mutex); + mutex_enter(&buf_pool_mutex); buf_pool->n_pend_unzip--; + mutex_exit(&buf_pool_mutex); block->page.buf_fix_count--; buf_block_set_io_fix(block, BUF_IO_NONE); - mutex_exit(&block->mutex); + //mutex_exit(&block->mutex); rw_lock_x_unlock(&block->lock); if (UNIV_UNLIKELY(!success)) { - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + mutex_exit(block_mutex); return(NULL); } @@ -2217,11 +2347,11 @@ wait_until_unfixed: ut_ad(buf_block_get_state(block) == BUF_BLOCK_FILE_PAGE); - mutex_enter(&block->mutex); + //mutex_enter(&block->mutex); UNIV_MEM_ASSERT_RW(&block->page, sizeof block->page); buf_block_buf_fix_inc(block, file, line); - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); /* Check if this is the first access to the page */ @@ -2229,7 +2359,7 @@ wait_until_unfixed: buf_page_set_accessed(&block->page, TRUE); - mutex_exit(&block->mutex); + mutex_exit(block_mutex); buf_block_make_young(&block->page); @@ -2515,16 +2645,19 @@ buf_page_try_get_func( ibool success; ulint fix_type; - buf_pool_mutex_enter(); + //buf_pool_mutex_enter(); + rw_lock_s_lock(&page_hash_latch); block = buf_block_hash_get(space_id, page_no); if (!block) { - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + rw_lock_s_unlock(&page_hash_latch); return(NULL); } mutex_enter(&block->mutex); - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + rw_lock_s_unlock(&page_hash_latch); #if defined UNIV_DEBUG || defined UNIV_BUF_DEBUG ut_a(buf_block_get_state(block) == BUF_BLOCK_FILE_PAGE); @@ -2644,7 +2777,10 @@ buf_page_init( { buf_page_t* hash_page; - ut_ad(buf_pool_mutex_own()); + //ut_ad(buf_pool_mutex_own()); +#ifdef UNIV_SYNC_DEBUG + ut_ad(rw_lock_own(&page_hash_latch, RW_LOCK_EX)); +#endif ut_ad(mutex_own(&(block->mutex))); ut_a(buf_block_get_state(block) != BUF_BLOCK_FILE_PAGE); @@ -2677,7 +2813,8 @@ buf_page_init( (const void*) hash_page, (const void*) block); #if defined UNIV_DEBUG || defined UNIV_BUF_DEBUG mutex_exit(&block->mutex); - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + rw_lock_x_unlock(&page_hash_latch); buf_print(); buf_LRU_print(); buf_validate(); @@ -2756,16 +2893,24 @@ buf_page_init_for_read( ut_ad(block); } - buf_pool_mutex_enter(); + //buf_pool_mutex_enter(); + mutex_enter(&LRU_list_mutex); + rw_lock_x_lock(&page_hash_latch); if (buf_page_hash_get(space, offset)) { /* The page is already in the buffer pool. */ err_exit: if (block) { mutex_enter(&block->mutex); - buf_LRU_block_free_non_file_page(block); + mutex_exit(&LRU_list_mutex); + rw_lock_x_unlock(&page_hash_latch); + buf_LRU_block_free_non_file_page(block, FALSE); mutex_exit(&block->mutex); } + else { + mutex_exit(&LRU_list_mutex); + rw_lock_x_unlock(&page_hash_latch); + } bpage = NULL; goto func_exit; @@ -2785,6 +2930,8 @@ err_exit: mutex_enter(&block->mutex); buf_page_init(space, offset, block); + rw_lock_x_unlock(&page_hash_latch); + /* The block must be put to the LRU list, to the old blocks */ buf_LRU_add_block(bpage, TRUE/* to old blocks */); @@ -2812,7 +2959,7 @@ err_exit: been added to buf_pool->LRU and buf_pool->page_hash. */ mutex_exit(&block->mutex); - data = buf_buddy_alloc(zip_size, &lru); + data = buf_buddy_alloc(zip_size, &lru, FALSE); mutex_enter(&block->mutex); block->page.zip.data = data; @@ -2825,6 +2972,7 @@ err_exit: buf_unzip_LRU_add_block(block, TRUE); } + mutex_exit(&LRU_list_mutex); mutex_exit(&block->mutex); } else { /* Defer buf_buddy_alloc() until after the block has @@ -2836,8 +2984,8 @@ err_exit: control block (bpage), in order to avoid the invocation of buf_buddy_relocate_block() on uninitialized data. */ - data = buf_buddy_alloc(zip_size, &lru); - bpage = buf_buddy_alloc(sizeof *bpage, &lru); + data = buf_buddy_alloc(zip_size, &lru, TRUE); + bpage = buf_buddy_alloc(sizeof *bpage, &lru, TRUE); /* If buf_buddy_alloc() allocated storage from the LRU list, it released and reacquired buf_pool_mutex. Thus, we must @@ -2846,8 +2994,11 @@ err_exit: && UNIV_LIKELY_NULL(buf_page_hash_get(space, offset))) { /* The block was added by some other thread. */ - buf_buddy_free(bpage, sizeof *bpage); - buf_buddy_free(data, zip_size); + buf_buddy_free(bpage, sizeof *bpage, TRUE); + buf_buddy_free(data, zip_size, TRUE); + + mutex_exit(&LRU_list_mutex); + rw_lock_x_unlock(&page_hash_latch); bpage = NULL; goto func_exit; @@ -2877,18 +3028,26 @@ err_exit: HASH_INSERT(buf_page_t, hash, buf_pool->page_hash, buf_page_address_fold(space, offset), bpage); + rw_lock_x_unlock(&page_hash_latch); + /* The block must be put to the LRU list, to the old blocks */ buf_LRU_add_block(bpage, TRUE/* to old blocks */); + mutex_enter(&flush_list_mutex); buf_LRU_insert_zip_clean(bpage); + mutex_exit(&flush_list_mutex); + + mutex_exit(&LRU_list_mutex); buf_page_set_io_fix(bpage, BUF_IO_READ); mutex_exit(&buf_pool_zip_mutex); } + mutex_enter(&buf_pool_mutex); buf_pool->n_pend_reads++; + mutex_exit(&buf_pool_mutex); func_exit: - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); if (mode == BUF_READ_IBUF_PAGES_ONLY) { @@ -2924,7 +3083,9 @@ buf_page_create( free_block = buf_LRU_get_free_block(0); - buf_pool_mutex_enter(); + //buf_pool_mutex_enter(); + mutex_enter(&LRU_list_mutex); + rw_lock_x_lock(&page_hash_latch); block = (buf_block_t*) buf_page_hash_get(space, offset); @@ -2937,7 +3098,9 @@ buf_page_create( #endif /* UNIV_DEBUG_FILE_ACCESSES */ /* Page can be found in buf_pool */ - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + mutex_exit(&LRU_list_mutex); + rw_lock_x_unlock(&page_hash_latch); buf_block_free(free_block); @@ -2959,6 +3122,7 @@ buf_page_create( mutex_enter(&block->mutex); buf_page_init(space, offset, block); + rw_lock_x_unlock(&page_hash_latch); /* The block must be put to the LRU list */ buf_LRU_add_block(&block->page, FALSE); @@ -2985,7 +3149,7 @@ buf_page_create( the reacquisition of buf_pool_mutex. We also must defer this operation until after the block descriptor has been added to buf_pool->LRU and buf_pool->page_hash. */ - data = buf_buddy_alloc(zip_size, &lru); + data = buf_buddy_alloc(zip_size, &lru, FALSE); mutex_enter(&block->mutex); block->page.zip.data = data; @@ -3001,7 +3165,8 @@ buf_page_create( rw_lock_x_unlock(&block->lock); } - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + mutex_exit(&LRU_list_mutex); mtr_memo_push(mtr, block, MTR_MEMO_BUF_FIX); @@ -3053,6 +3218,8 @@ buf_page_io_complete( enum buf_io_fix io_type; const ibool uncompressed = (buf_page_get_state(bpage) == BUF_BLOCK_FILE_PAGE); + enum buf_flush flush_type; + mutex_t* block_mutex; ut_a(buf_page_in_file(bpage)); @@ -3187,8 +3354,23 @@ corrupt: } } - buf_pool_mutex_enter(); - mutex_enter(buf_page_get_mutex(bpage)); + //buf_pool_mutex_enter(); + if (io_type == BUF_IO_WRITE) { + flush_type = buf_page_get_flush_type(bpage); + /* to keep consistency at buf_LRU_insert_zip_clean() */ + //if (flush_type == BUF_FLUSH_LRU) { /* optimistic! */ + mutex_enter(&LRU_list_mutex); + //} + } + block_mutex = buf_page_get_mutex(bpage); +retry_lock: + mutex_enter(block_mutex); + if (block_mutex != buf_page_get_mutex(bpage)) { + mutex_exit(block_mutex); + block_mutex = buf_page_get_mutex(bpage); + goto retry_lock; + } + mutex_enter(&buf_pool_mutex); #ifdef UNIV_IBUF_COUNT_DEBUG if (io_type == BUF_IO_WRITE || uncompressed) { @@ -3228,6 +3410,11 @@ corrupt: buf_flush_write_complete(bpage); + /* to keep consistency at buf_LRU_insert_zip_clean() */ + //if (flush_type == BUF_FLUSH_LRU) { /* optimistic! */ + mutex_exit(&LRU_list_mutex); + //} + if (uncompressed) { rw_lock_s_unlock_gen(&((buf_block_t*) bpage)->lock, BUF_IO_WRITE); @@ -3250,8 +3437,9 @@ corrupt: } #endif /* UNIV_DEBUG */ - mutex_exit(buf_page_get_mutex(bpage)); - buf_pool_mutex_exit(); + mutex_exit(&buf_pool_mutex); + mutex_exit(block_mutex); + //buf_pool_mutex_exit(); } /************************************************************************* @@ -3273,12 +3461,14 @@ buf_pool_invalidate(void) freed = buf_LRU_search_and_free_block(100); } - buf_pool_mutex_enter(); + //buf_pool_mutex_enter(); + mutex_enter(&LRU_list_mutex); ut_ad(UT_LIST_GET_LEN(buf_pool->LRU) == 0); ut_ad(UT_LIST_GET_LEN(buf_pool->unzip_LRU) == 0); - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + mutex_exit(&LRU_list_mutex); } #if defined UNIV_DEBUG || defined UNIV_BUF_DEBUG @@ -3302,7 +3492,10 @@ buf_validate(void) ut_ad(buf_pool); - buf_pool_mutex_enter(); + //buf_pool_mutex_enter(); + mutex_enter(&LRU_list_mutex); + rw_lock_x_lock(&page_hash_latch); + /* for keep the new latch order, it cannot validate correctly... */ chunk = buf_pool->chunks; @@ -3401,7 +3594,7 @@ buf_validate(void) /* Check clean compressed-only blocks. */ for (b = UT_LIST_GET_FIRST(buf_pool->zip_clean); b; - b = UT_LIST_GET_NEXT(list, b)) { + b = UT_LIST_GET_NEXT(zip_list, b)) { ut_a(buf_page_get_state(b) == BUF_BLOCK_ZIP_PAGE); switch (buf_page_get_io_fix(b)) { case BUF_IO_NONE: @@ -3426,8 +3619,9 @@ buf_validate(void) /* Check dirty compressed-only blocks. */ + mutex_enter(&flush_list_mutex); for (b = UT_LIST_GET_FIRST(buf_pool->flush_list); b; - b = UT_LIST_GET_NEXT(list, b)) { + b = UT_LIST_GET_NEXT(flush_list, b)) { ut_ad(b->in_flush_list); switch (buf_page_get_state(b)) { @@ -3472,6 +3666,7 @@ buf_validate(void) } ut_a(buf_page_hash_get(b->space, b->offset) == b); } + mutex_exit(&flush_list_mutex); mutex_exit(&buf_pool_zip_mutex); @@ -3483,19 +3678,27 @@ buf_validate(void) } ut_a(UT_LIST_GET_LEN(buf_pool->LRU) == n_lru); + /* because of latching order with block->mutex, we cannot get free_list_mutex before that */ +/* if (UT_LIST_GET_LEN(buf_pool->free) != n_free) { fprintf(stderr, "Free list len %lu, free blocks %lu\n", (ulong) UT_LIST_GET_LEN(buf_pool->free), (ulong) n_free); ut_error; } +*/ + /* because of latching order with block->mutex, we cannot get flush_list_mutex before that */ +/* ut_a(UT_LIST_GET_LEN(buf_pool->flush_list) == n_flush); ut_a(buf_pool->n_flush[BUF_FLUSH_SINGLE_PAGE] == n_single_flush); ut_a(buf_pool->n_flush[BUF_FLUSH_LIST] == n_list_flush); ut_a(buf_pool->n_flush[BUF_FLUSH_LRU] == n_lru_flush); +*/ - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + mutex_exit(&LRU_list_mutex); + rw_lock_x_unlock(&page_hash_latch); ut_a(buf_LRU_validate()); ut_a(buf_flush_validate()); @@ -3529,7 +3732,10 @@ buf_print(void) index_ids = mem_alloc(sizeof(dulint) * size); counts = mem_alloc(sizeof(ulint) * size); - buf_pool_mutex_enter(); + //buf_pool_mutex_enter(); + mutex_enter(&LRU_list_mutex); + mutex_enter(&free_list_mutex); + mutex_enter(&flush_list_mutex); fprintf(stderr, "buf_pool size %lu\n" @@ -3592,7 +3798,10 @@ buf_print(void) } } - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + mutex_exit(&LRU_list_mutex); + mutex_exit(&free_list_mutex); + mutex_exit(&flush_list_mutex); for (i = 0; i < n_found; i++) { index = dict_index_get_if_in_cache(index_ids[i]); @@ -3630,7 +3839,7 @@ buf_get_latched_pages_number(void) ulint i; ulint fixed_pages_number = 0; - buf_pool_mutex_enter(); + //buf_pool_mutex_enter(); chunk = buf_pool->chunks; @@ -3664,7 +3873,7 @@ buf_get_latched_pages_number(void) /* Traverse the lists of clean and dirty compressed-only blocks. */ for (b = UT_LIST_GET_FIRST(buf_pool->zip_clean); b; - b = UT_LIST_GET_NEXT(list, b)) { + b = UT_LIST_GET_NEXT(zip_list, b)) { ut_a(buf_page_get_state(b) == BUF_BLOCK_ZIP_PAGE); ut_a(buf_page_get_io_fix(b) != BUF_IO_WRITE); @@ -3674,8 +3883,9 @@ buf_get_latched_pages_number(void) } } + mutex_enter(&flush_list_mutex); for (b = UT_LIST_GET_FIRST(buf_pool->flush_list); b; - b = UT_LIST_GET_NEXT(list, b)) { + b = UT_LIST_GET_NEXT(flush_list, b)) { ut_ad(b->in_flush_list); switch (buf_page_get_state(b)) { @@ -3698,9 +3908,10 @@ buf_get_latched_pages_number(void) break; } } + mutex_exit(&flush_list_mutex); mutex_exit(&buf_pool_zip_mutex); - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); return(fixed_pages_number); } @@ -3757,7 +3968,11 @@ buf_print_io( ut_ad(buf_pool); size = buf_pool->curr_size; - buf_pool_mutex_enter(); + //buf_pool_mutex_enter(); + mutex_enter(&LRU_list_mutex); + mutex_enter(&free_list_mutex); + mutex_enter(&buf_pool_mutex); + mutex_enter(&flush_list_mutex); fprintf(file, "Buffer pool size %lu\n" @@ -3824,7 +4039,11 @@ buf_print_io( buf_LRU_stat_sum.io, buf_LRU_stat_cur.io, buf_LRU_stat_sum.unzip, buf_LRU_stat_cur.unzip); - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + mutex_exit(&LRU_list_mutex); + mutex_exit(&free_list_mutex); + mutex_exit(&buf_pool_mutex); + mutex_exit(&flush_list_mutex); } /************************************************************************** @@ -3853,7 +4072,7 @@ buf_all_freed(void) ut_ad(buf_pool); - buf_pool_mutex_enter(); + //buf_pool_mutex_enter(); /* optimistic */ chunk = buf_pool->chunks; @@ -3870,7 +4089,7 @@ buf_all_freed(void) } } - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); /* optimistic */ return(TRUE); } @@ -3886,7 +4105,8 @@ buf_pool_check_no_pending_io(void) { ibool ret; - buf_pool_mutex_enter(); + //buf_pool_mutex_enter(); + mutex_enter(&buf_pool_mutex); if (buf_pool->n_pend_reads + buf_pool->n_flush[BUF_FLUSH_LRU] + buf_pool->n_flush[BUF_FLUSH_LIST] @@ -3896,7 +4116,8 @@ buf_pool_check_no_pending_io(void) ret = TRUE; } - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + mutex_exit(&buf_pool_mutex); return(ret); } @@ -3910,11 +4131,13 @@ buf_get_free_list_len(void) { ulint len; - buf_pool_mutex_enter(); + //buf_pool_mutex_enter(); + mutex_enter(&free_list_mutex); len = UT_LIST_GET_LEN(buf_pool->free); - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + mutex_exit(&free_list_mutex); return(len); } === modified file 'storage/xtradb/buf/buf0flu.c' --- a/storage/xtradb/buf/buf0flu.c 2009-05-04 04:32:30 +0000 +++ b/storage/xtradb/buf/buf0flu.c 2009-06-25 01:43:25 +0000 @@ -61,7 +61,9 @@ buf_flush_insert_into_flush_list( /*=============================*/ buf_block_t* block) /* in/out: block which is modified */ { - ut_ad(buf_pool_mutex_own()); + //ut_ad(buf_pool_mutex_own()); + ut_ad(mutex_own(&block->mutex)); + ut_ad(mutex_own(&flush_list_mutex)); ut_ad((UT_LIST_GET_FIRST(buf_pool->flush_list) == NULL) || (UT_LIST_GET_FIRST(buf_pool->flush_list)->oldest_modification <= block->page.oldest_modification)); @@ -72,7 +74,7 @@ buf_flush_insert_into_flush_list( ut_ad(!block->page.in_zip_hash); ut_ad(!block->page.in_flush_list); ut_d(block->page.in_flush_list = TRUE); - UT_LIST_ADD_FIRST(list, buf_pool->flush_list, &block->page); + UT_LIST_ADD_FIRST(flush_list, buf_pool->flush_list, &block->page); #if defined UNIV_DEBUG || defined UNIV_BUF_DEBUG ut_a(buf_flush_validate_low()); @@ -92,7 +94,9 @@ buf_flush_insert_sorted_into_flush_list( buf_page_t* prev_b; buf_page_t* b; - ut_ad(buf_pool_mutex_own()); + //ut_ad(buf_pool_mutex_own()); + ut_ad(mutex_own(&block->mutex)); + ut_ad(mutex_own(&flush_list_mutex)); ut_ad(buf_block_get_state(block) == BUF_BLOCK_FILE_PAGE); ut_ad(block->page.in_LRU_list); @@ -107,13 +111,13 @@ buf_flush_insert_sorted_into_flush_list( while (b && b->oldest_modification > block->page.oldest_modification) { ut_ad(b->in_flush_list); prev_b = b; - b = UT_LIST_GET_NEXT(list, b); + b = UT_LIST_GET_NEXT(flush_list, b); } if (prev_b == NULL) { - UT_LIST_ADD_FIRST(list, buf_pool->flush_list, &block->page); + UT_LIST_ADD_FIRST(flush_list, buf_pool->flush_list, &block->page); } else { - UT_LIST_INSERT_AFTER(list, buf_pool->flush_list, + UT_LIST_INSERT_AFTER(flush_list, buf_pool->flush_list, prev_b, &block->page); } @@ -134,7 +138,7 @@ buf_flush_ready_for_replace( buf_page_in_file(bpage) and in the LRU list */ { //ut_ad(buf_pool_mutex_own()); - //ut_ad(mutex_own(buf_page_get_mutex(bpage))); + ut_ad(mutex_own(buf_page_get_mutex(bpage))); //ut_ad(bpage->in_LRU_list); /* optimistic use */ if (UNIV_LIKELY(bpage->in_LRU_list && buf_page_in_file(bpage))) { @@ -169,12 +173,12 @@ buf_flush_ready_for_flush( buf_page_in_file(bpage) */ enum buf_flush flush_type)/* in: BUF_FLUSH_LRU or BUF_FLUSH_LIST */ { - ut_a(buf_page_in_file(bpage)); - ut_ad(buf_pool_mutex_own()); + //ut_a(buf_page_in_file(bpage)); + //ut_ad(buf_pool_mutex_own()); /*optimistic...*/ ut_ad(mutex_own(buf_page_get_mutex(bpage))); ut_ad(flush_type == BUF_FLUSH_LRU || BUF_FLUSH_LIST); - if (bpage->oldest_modification != 0 + if (buf_page_in_file(bpage) && bpage->oldest_modification != 0 && buf_page_get_io_fix(bpage) == BUF_IO_NONE) { ut_ad(bpage->in_flush_list); @@ -203,8 +207,11 @@ buf_flush_remove( /*=============*/ buf_page_t* bpage) /* in: pointer to the block in question */ { - ut_ad(buf_pool_mutex_own()); + //ut_ad(buf_pool_mutex_own()); ut_ad(mutex_own(buf_page_get_mutex(bpage))); + + mutex_enter(&flush_list_mutex); + ut_ad(bpage->in_flush_list); ut_d(bpage->in_flush_list = FALSE); @@ -216,21 +223,23 @@ buf_flush_remove( case BUF_BLOCK_READY_FOR_USE: case BUF_BLOCK_MEMORY: case BUF_BLOCK_REMOVE_HASH: + mutex_exit(&flush_list_mutex); ut_error; return; case BUF_BLOCK_ZIP_DIRTY: buf_page_set_state(bpage, BUF_BLOCK_ZIP_PAGE); - UT_LIST_REMOVE(list, buf_pool->flush_list, bpage); + UT_LIST_REMOVE(flush_list, buf_pool->flush_list, bpage); buf_LRU_insert_zip_clean(bpage); break; case BUF_BLOCK_FILE_PAGE: - UT_LIST_REMOVE(list, buf_pool->flush_list, bpage); + UT_LIST_REMOVE(flush_list, buf_pool->flush_list, bpage); break; } bpage->oldest_modification = 0; - ut_d(UT_LIST_VALIDATE(list, buf_page_t, buf_pool->flush_list)); + ut_d(UT_LIST_VALIDATE(flush_list, buf_page_t, buf_pool->flush_list)); + mutex_exit(&flush_list_mutex); } /************************************************************************ @@ -678,7 +687,9 @@ buf_flush_write_block_low( io_fixed and oldest_modification != 0. Thus, it cannot be relocated in the buffer pool or removed from flush_list or LRU_list. */ - ut_ad(!buf_pool_mutex_own()); + //ut_ad(!buf_pool_mutex_own()); + ut_ad(!mutex_own(&LRU_list_mutex)); + ut_ad(!mutex_own(&flush_list_mutex)); ut_ad(!mutex_own(buf_page_get_mutex(bpage))); ut_ad(buf_page_get_io_fix(bpage) == BUF_IO_WRITE); ut_ad(bpage->oldest_modification != 0); @@ -762,12 +773,19 @@ buf_flush_page( ibool is_uncompressed; ut_ad(flush_type == BUF_FLUSH_LRU || flush_type == BUF_FLUSH_LIST); - ut_ad(buf_pool_mutex_own()); + //ut_ad(buf_pool_mutex_own()); +#ifdef UNIV_SYNC_DEBUG + ut_ad(rw_lock_own(&page_hash_latch, RW_LOCK_EX) + || rw_lock_own(&page_hash_latch, RW_LOCK_SHARED)); +#endif ut_ad(buf_page_in_file(bpage)); block_mutex = buf_page_get_mutex(bpage); ut_ad(mutex_own(block_mutex)); + mutex_enter(&buf_pool_mutex); + rw_lock_s_unlock(&page_hash_latch); + ut_ad(buf_flush_ready_for_flush(bpage, flush_type)); buf_page_set_io_fix(bpage, BUF_IO_WRITE); @@ -798,7 +816,8 @@ buf_flush_page( } mutex_exit(block_mutex); - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + mutex_exit(&buf_pool_mutex); /* Even though bpage is not protected by any mutex at this point, it is safe to access bpage, because it is @@ -835,7 +854,8 @@ buf_flush_page( immediately. */ mutex_exit(block_mutex); - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + mutex_exit(&buf_pool_mutex); break; default: @@ -899,7 +919,8 @@ buf_flush_try_neighbors( high = fil_space_get_size(space); } - buf_pool_mutex_enter(); + //buf_pool_mutex_enter(); + rw_lock_s_lock(&page_hash_latch); for (i = low; i < high; i++) { @@ -920,7 +941,13 @@ buf_flush_try_neighbors( || buf_page_is_old(bpage)) { mutex_t* block_mutex = buf_page_get_mutex(bpage); +retry_lock: mutex_enter(block_mutex); + if (block_mutex != buf_page_get_mutex(bpage)) { + mutex_exit(block_mutex); + block_mutex = buf_page_get_mutex(bpage); + goto retry_lock; + } if (buf_flush_ready_for_flush(bpage, flush_type) && (i == offset || !bpage->buf_fix_count)) { @@ -936,14 +963,16 @@ buf_flush_try_neighbors( ut_ad(!mutex_own(block_mutex)); count++; - buf_pool_mutex_enter(); + //buf_pool_mutex_enter(); + rw_lock_s_lock(&page_hash_latch); } else { mutex_exit(block_mutex); } } } - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + rw_lock_s_unlock(&page_hash_latch); return(count); } @@ -980,6 +1009,7 @@ buf_flush_batch( ulint old_page_count; ulint space; ulint offset; + ulint remaining = 0; ut_ad((flush_type == BUF_FLUSH_LRU) || (flush_type == BUF_FLUSH_LIST)); @@ -987,20 +1017,28 @@ buf_flush_batch( ut_ad((flush_type != BUF_FLUSH_LIST) || sync_thread_levels_empty_gen(TRUE)); #endif /* UNIV_SYNC_DEBUG */ - buf_pool_mutex_enter(); + //buf_pool_mutex_enter(); + mutex_enter(&buf_pool_mutex); if ((buf_pool->n_flush[flush_type] > 0) || (buf_pool->init_flush[flush_type] == TRUE)) { /* There is already a flush batch of the same type running */ - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + mutex_exit(&buf_pool_mutex); return(ULINT_UNDEFINED); } buf_pool->init_flush[flush_type] = TRUE; + mutex_exit(&buf_pool_mutex); + + if (flush_type == BUF_FLUSH_LRU) { + mutex_enter(&LRU_list_mutex); + } + for (;;) { flush_next: /* If we have flushed enough, leave the loop */ @@ -1017,7 +1055,10 @@ flush_next: } else { ut_ad(flush_type == BUF_FLUSH_LIST); + mutex_enter(&flush_list_mutex); + remaining = UT_LIST_GET_LEN(buf_pool->flush_list); bpage = UT_LIST_GET_LAST(buf_pool->flush_list); + mutex_exit(&flush_list_mutex); if (!bpage || bpage->oldest_modification >= lsn_limit) { /* We have flushed enough */ @@ -1037,9 +1078,15 @@ flush_next: mutex_t*block_mutex = buf_page_get_mutex(bpage); ibool ready; +retry_lock_1: ut_a(buf_page_in_file(bpage)); mutex_enter(block_mutex); + if (block_mutex != buf_page_get_mutex(bpage)) { + mutex_exit(block_mutex); + block_mutex = buf_page_get_mutex(bpage); + goto retry_lock_1; + } ready = buf_flush_ready_for_flush(bpage, flush_type); mutex_exit(block_mutex); @@ -1047,7 +1094,10 @@ flush_next: space = buf_page_get_space(bpage); offset = buf_page_get_page_no(bpage); - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + if (flush_type == BUF_FLUSH_LRU) { + mutex_exit(&LRU_list_mutex); + } old_page_count = page_count; @@ -1057,10 +1107,17 @@ flush_next: space, offset, flush_type); } else { /* Try to flush the page only */ - buf_pool_mutex_enter(); + //buf_pool_mutex_enter(); + rw_lock_s_lock(&page_hash_latch); mutex_t* block_mutex = buf_page_get_mutex(bpage); +retry_lock_2: mutex_enter(block_mutex); + if (block_mutex != buf_page_get_mutex(bpage)) { + mutex_exit(block_mutex); + block_mutex = buf_page_get_mutex(bpage); + goto retry_lock_2; + } buf_page_t* bpage_tmp = buf_page_hash_get(space, offset); if (bpage_tmp) { @@ -1073,7 +1130,10 @@ flush_next: flush_type, offset, page_count - old_page_count); */ - buf_pool_mutex_enter(); + //buf_pool_mutex_enter(); + if (flush_type == BUF_FLUSH_LRU) { + mutex_enter(&LRU_list_mutex); + } goto flush_next; } else if (flush_type == BUF_FLUSH_LRU) { @@ -1081,16 +1141,28 @@ flush_next: } else { ut_ad(flush_type == BUF_FLUSH_LIST); - bpage = UT_LIST_GET_PREV(list, bpage); - ut_ad(!bpage || bpage->in_flush_list); + mutex_enter(&flush_list_mutex); + bpage = UT_LIST_GET_PREV(flush_list, bpage); + //ut_ad(!bpage || bpage->in_flush_list); /* optimistic */ + mutex_exit(&flush_list_mutex); + remaining--; } } while (bpage != NULL); + if (remaining) + goto flush_next; + /* If we could not find anything to flush, leave the loop */ break; } + if (flush_type == BUF_FLUSH_LRU) { + mutex_exit(&LRU_list_mutex); + } + + mutex_enter(&buf_pool_mutex); + buf_pool->init_flush[flush_type] = FALSE; if (buf_pool->n_flush[flush_type] == 0) { @@ -1100,7 +1172,8 @@ flush_next: os_event_set(buf_pool->no_flush[flush_type]); } - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + mutex_exit(&buf_pool_mutex); buf_flush_buffered_writes(); @@ -1154,7 +1227,7 @@ buf_flush_LRU_recommendation(void) //buf_pool_mutex_enter(); if (have_LRU_mutex) - buf_pool_mutex_enter(); + mutex_enter(&LRU_list_mutex); n_replaceable = UT_LIST_GET_LEN(buf_pool->free); @@ -1173,7 +1246,13 @@ buf_flush_LRU_recommendation(void) mutex_t* block_mutex = buf_page_get_mutex(bpage); +retry_lock: mutex_enter(block_mutex); + if (block_mutex != buf_page_get_mutex(bpage)) { + mutex_exit(block_mutex); + block_mutex = buf_page_get_mutex(bpage); + goto retry_lock; + } if (buf_flush_ready_for_replace(bpage)) { n_replaceable++; @@ -1188,7 +1267,7 @@ buf_flush_LRU_recommendation(void) //buf_pool_mutex_exit(); if (have_LRU_mutex) - buf_pool_mutex_exit(); + mutex_exit(&LRU_list_mutex); if (n_replaceable >= BUF_FLUSH_FREE_BLOCK_MARGIN) { @@ -1238,17 +1317,17 @@ buf_flush_validate_low(void) { buf_page_t* bpage; - UT_LIST_VALIDATE(list, buf_page_t, buf_pool->flush_list); + UT_LIST_VALIDATE(flush_list, buf_page_t, buf_pool->flush_list); bpage = UT_LIST_GET_FIRST(buf_pool->flush_list); while (bpage != NULL) { const ib_uint64_t om = bpage->oldest_modification; ut_ad(bpage->in_flush_list); - ut_a(buf_page_in_file(bpage)); + //ut_a(buf_page_in_file(bpage)); /* optimistic */ ut_a(om > 0); - bpage = UT_LIST_GET_NEXT(list, bpage); + bpage = UT_LIST_GET_NEXT(flush_list, bpage); ut_a(!bpage || om >= bpage->oldest_modification); } @@ -1266,11 +1345,13 @@ buf_flush_validate(void) { ibool ret; - buf_pool_mutex_enter(); + //buf_pool_mutex_enter(); + mutex_enter(&flush_list_mutex); ret = buf_flush_validate_low(); - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + mutex_exit(&flush_list_mutex); return(ret); } === modified file 'storage/xtradb/buf/buf0lru.c' --- a/storage/xtradb/buf/buf0lru.c 2009-05-04 04:32:30 +0000 +++ b/storage/xtradb/buf/buf0lru.c 2009-06-25 01:43:25 +0000 @@ -129,25 +129,31 @@ static void buf_LRU_block_free_hashed_page( /*===========================*/ - buf_block_t* block); /* in: block, must contain a file page and + buf_block_t* block, /* in: block, must contain a file page and be in a state where it can be freed */ + ibool have_page_hash_mutex); /********************************************************************** Determines if the unzip_LRU list should be used for evicting a victim instead of the general LRU list. */ UNIV_INLINE ibool -buf_LRU_evict_from_unzip_LRU(void) +buf_LRU_evict_from_unzip_LRU( + ibool have_LRU_mutex) /*==============================*/ /* out: TRUE if should use unzip_LRU */ { ulint io_avg; ulint unzip_avg; - ut_ad(buf_pool_mutex_own()); + //ut_ad(buf_pool_mutex_own()); + if (!have_LRU_mutex) + mutex_enter(&LRU_list_mutex); /* If the unzip_LRU list is empty, we can only use the LRU. */ if (UT_LIST_GET_LEN(buf_pool->unzip_LRU) == 0) { + if (!have_LRU_mutex) + mutex_exit(&LRU_list_mutex); return(FALSE); } @@ -156,14 +162,20 @@ buf_LRU_evict_from_unzip_LRU(void) decompressed pages in the buffer pool. */ if (UT_LIST_GET_LEN(buf_pool->unzip_LRU) <= UT_LIST_GET_LEN(buf_pool->LRU) / 10) { + if (!have_LRU_mutex) + mutex_exit(&LRU_list_mutex); return(FALSE); } /* If eviction hasn't started yet, we assume by default that a workload is disk bound. */ if (buf_pool->freed_page_clock == 0) { + if (!have_LRU_mutex) + mutex_exit(&LRU_list_mutex); return(TRUE); } + if (!have_LRU_mutex) + mutex_exit(&LRU_list_mutex); /* Calculate the average over past intervals, and add the values of the current interval. */ @@ -229,7 +241,8 @@ buf_LRU_drop_page_hash_for_tablespace( page_arr = ut_malloc(sizeof(ulint) * BUF_LRU_DROP_SEARCH_HASH_SIZE); - buf_pool_mutex_enter(); + //buf_pool_mutex_enter(); + mutex_enter(&LRU_list_mutex); scan_again: num_entries = 0; @@ -239,7 +252,13 @@ scan_again: mutex_t* block_mutex = buf_page_get_mutex(bpage); buf_page_t* prev_bpage; +retry_lock: mutex_enter(block_mutex); + if (block_mutex != buf_page_get_mutex(bpage)) { + mutex_exit(block_mutex); + block_mutex = buf_page_get_mutex(bpage); + goto retry_lock; + } prev_bpage = UT_LIST_GET_PREV(LRU, bpage); ut_a(buf_page_in_file(bpage)); @@ -269,12 +288,14 @@ scan_again: } /* Array full. We release the buf_pool_mutex to obey the latching order. */ - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + mutex_exit(&LRU_list_mutex); buf_LRU_drop_page_hash_batch(id, zip_size, page_arr, num_entries); num_entries = 0; - buf_pool_mutex_enter(); + //buf_pool_mutex_enter(); + mutex_enter(&LRU_list_mutex); } else { mutex_exit(block_mutex); } @@ -299,7 +320,8 @@ next_page: } } - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + mutex_exit(&LRU_list_mutex); /* Drop any remaining batch of search hashed pages. */ buf_LRU_drop_page_hash_batch(id, zip_size, page_arr, num_entries); @@ -327,7 +349,9 @@ buf_LRU_invalidate_tablespace( buf_LRU_drop_page_hash_for_tablespace(id); scan_again: - buf_pool_mutex_enter(); + //buf_pool_mutex_enter(); + mutex_enter(&LRU_list_mutex); + rw_lock_x_lock(&page_hash_latch); all_freed = TRUE; @@ -339,7 +363,13 @@ scan_again: ut_a(buf_page_in_file(bpage)); +retry_lock: mutex_enter(block_mutex); + if (block_mutex != buf_page_get_mutex(bpage)) { + mutex_exit(block_mutex); + block_mutex = buf_page_get_mutex(bpage); + goto retry_lock; + } prev_bpage = UT_LIST_GET_PREV(LRU, bpage); if (buf_page_get_space(bpage) == id) { @@ -369,7 +399,9 @@ scan_again: ulint page_no; ulint zip_size; - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + mutex_exit(&LRU_list_mutex); + rw_lock_x_unlock(&page_hash_latch); zip_size = buf_page_get_zip_size(bpage); page_no = buf_page_get_page_no(bpage); @@ -393,7 +425,7 @@ scan_again: if (buf_LRU_block_remove_hashed_page(bpage, TRUE) != BUF_BLOCK_ZIP_FREE) { buf_LRU_block_free_hashed_page((buf_block_t*) - bpage); + bpage, TRUE); } else { /* The block_mutex should have been released by buf_LRU_block_remove_hashed_page() @@ -416,7 +448,9 @@ next_page: bpage = prev_bpage; } - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + mutex_exit(&LRU_list_mutex); + rw_lock_x_unlock(&page_hash_latch); if (!all_freed) { os_thread_sleep(20000); @@ -439,14 +473,16 @@ buf_LRU_get_recent_limit(void) ulint len; ulint limit; - buf_pool_mutex_enter(); + //buf_pool_mutex_enter(); + mutex_enter(&LRU_list_mutex); len = UT_LIST_GET_LEN(buf_pool->LRU); if (len < BUF_LRU_OLD_MIN_LEN) { /* The LRU list is too short to do read-ahead */ - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + mutex_exit(&LRU_list_mutex); return(0); } @@ -455,7 +491,8 @@ buf_LRU_get_recent_limit(void) limit = buf_page_get_LRU_position(bpage) - len / BUF_LRU_INITIAL_RATIO; - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + mutex_exit(&LRU_list_mutex); return(limit); } @@ -470,7 +507,9 @@ buf_LRU_insert_zip_clean( { buf_page_t* b; - ut_ad(buf_pool_mutex_own()); + //ut_ad(buf_pool_mutex_own()); + ut_ad(mutex_own(&LRU_list_mutex)); + ut_ad(mutex_own(&flush_list_mutex)); ut_ad(buf_page_get_state(bpage) == BUF_BLOCK_ZIP_PAGE); /* Find the first successor of bpage in the LRU list @@ -478,17 +517,17 @@ buf_LRU_insert_zip_clean( b = bpage; do { b = UT_LIST_GET_NEXT(LRU, b); - } while (b && buf_page_get_state(b) != BUF_BLOCK_ZIP_PAGE); + } while (b && (buf_page_get_state(b) != BUF_BLOCK_ZIP_PAGE || !b->in_LRU_list)); /* Insert bpage before b, i.e., after the predecessor of b. */ if (b) { - b = UT_LIST_GET_PREV(list, b); + b = UT_LIST_GET_PREV(zip_list, b); } if (b) { - UT_LIST_INSERT_AFTER(list, buf_pool->zip_clean, b, bpage); + UT_LIST_INSERT_AFTER(zip_list, buf_pool->zip_clean, b, bpage); } else { - UT_LIST_ADD_FIRST(list, buf_pool->zip_clean, bpage); + UT_LIST_ADD_FIRST(zip_list, buf_pool->zip_clean, bpage); } } @@ -500,16 +539,17 @@ ibool buf_LRU_free_from_unzip_LRU_list( /*=============================*/ /* out: TRUE if freed */ - ulint n_iterations) /* in: how many times this has been called + ulint n_iterations, /* in: how many times this has been called repeatedly without result: a high value means that we should search farther; we will search n_iterations / 5 of the unzip_LRU list, or nothing if n_iterations >= 5 */ + ibool have_LRU_mutex) { buf_block_t* block; ulint distance; - ut_ad(buf_pool_mutex_own()); + //ut_ad(buf_pool_mutex_own()); /* optimistic */ /* Theoratically it should be much easier to find a victim from unzip_LRU as we can choose even a dirty block (as we'll @@ -519,7 +559,7 @@ buf_LRU_free_from_unzip_LRU_list( if we have done five iterations so far. */ if (UNIV_UNLIKELY(n_iterations >= 5) - || !buf_LRU_evict_from_unzip_LRU()) { + || !buf_LRU_evict_from_unzip_LRU(have_LRU_mutex)) { return(FALSE); } @@ -527,18 +567,25 @@ buf_LRU_free_from_unzip_LRU_list( distance = 100 + (n_iterations * UT_LIST_GET_LEN(buf_pool->unzip_LRU)) / 5; +restart: for (block = UT_LIST_GET_LAST(buf_pool->unzip_LRU); UNIV_LIKELY(block != NULL) && UNIV_LIKELY(distance > 0); block = UT_LIST_GET_PREV(unzip_LRU, block), distance--) { enum buf_lru_free_block_status freed; + mutex_enter(&block->mutex); + if (!block->in_unzip_LRU_list || !block->page.in_LRU_list + || buf_block_get_state(block) != BUF_BLOCK_FILE_PAGE) { + mutex_exit(&block->mutex); + goto restart; + } + ut_ad(buf_block_get_state(block) == BUF_BLOCK_FILE_PAGE); ut_ad(block->in_unzip_LRU_list); ut_ad(block->page.in_LRU_list); - mutex_enter(&block->mutex); - freed = buf_LRU_free_block(&block->page, FALSE, NULL); + freed = buf_LRU_free_block(&block->page, FALSE, NULL, have_LRU_mutex); mutex_exit(&block->mutex); switch (freed) { @@ -571,20 +618,22 @@ ibool buf_LRU_free_from_common_LRU_list( /*==============================*/ /* out: TRUE if freed */ - ulint n_iterations) /* in: how many times this has been called + ulint n_iterations, /* in: how many times this has been called repeatedly without result: a high value means that we should search farther; if n_iterations < 10, then we search n_iterations / 10 * buf_pool->curr_size pages from the end of the LRU list */ + ibool have_LRU_mutex) { buf_page_t* bpage; ulint distance; - ut_ad(buf_pool_mutex_own()); + //ut_ad(buf_pool_mutex_own()); /* optimistic */ distance = 100 + (n_iterations * buf_pool->curr_size) / 10; +restart: for (bpage = UT_LIST_GET_LAST(buf_pool->LRU); UNIV_LIKELY(bpage != NULL) && UNIV_LIKELY(distance > 0); bpage = UT_LIST_GET_PREV(LRU, bpage), distance--) { @@ -593,11 +642,25 @@ buf_LRU_free_from_common_LRU_list( mutex_t* block_mutex = buf_page_get_mutex(bpage); +retry_lock: + mutex_enter(block_mutex); + + if (block_mutex != buf_page_get_mutex(bpage)) { + mutex_exit(block_mutex); + block_mutex = buf_page_get_mutex(bpage); + goto retry_lock; + } + + if (!bpage->in_LRU_list + || !buf_page_in_file(bpage)) { + mutex_exit(block_mutex); + goto restart; + } + ut_ad(buf_page_in_file(bpage)); ut_ad(bpage->in_LRU_list); - mutex_enter(block_mutex); - freed = buf_LRU_free_block(bpage, TRUE, NULL); + freed = buf_LRU_free_block(bpage, TRUE, NULL, have_LRU_mutex); mutex_exit(block_mutex); switch (freed) { @@ -640,22 +703,33 @@ buf_LRU_search_and_free_block( n_iterations / 5 of the unzip_LRU list. */ { ibool freed = FALSE; + ibool have_LRU_mutex = FALSE; + + if (UT_LIST_GET_LEN(buf_pool->unzip_LRU)) + have_LRU_mutex = TRUE; - buf_pool_mutex_enter(); + /* optimistic search... */ + //buf_pool_mutex_enter(); + if (have_LRU_mutex) + mutex_enter(&LRU_list_mutex); - freed = buf_LRU_free_from_unzip_LRU_list(n_iterations); + freed = buf_LRU_free_from_unzip_LRU_list(n_iterations, have_LRU_mutex); if (!freed) { - freed = buf_LRU_free_from_common_LRU_list(n_iterations); + freed = buf_LRU_free_from_common_LRU_list(n_iterations, have_LRU_mutex); } + mutex_enter(&buf_pool_mutex); if (!freed) { buf_pool->LRU_flush_ended = 0; } else if (buf_pool->LRU_flush_ended > 0) { buf_pool->LRU_flush_ended--; } + mutex_exit(&buf_pool_mutex); - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + if (have_LRU_mutex) + mutex_exit(&LRU_list_mutex); return(freed); } @@ -673,18 +747,22 @@ void buf_LRU_try_free_flushed_blocks(void) /*=================================*/ { - buf_pool_mutex_enter(); + //buf_pool_mutex_enter(); + mutex_enter(&buf_pool_mutex); while (buf_pool->LRU_flush_ended > 0) { - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + mutex_exit(&buf_pool_mutex); buf_LRU_search_and_free_block(1); - buf_pool_mutex_enter(); + //buf_pool_mutex_enter(); + mutex_enter(&buf_pool_mutex); } - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + mutex_exit(&buf_pool_mutex); } /********************************************************************** @@ -700,7 +778,9 @@ buf_LRU_buf_pool_running_out(void) { ibool ret = FALSE; - buf_pool_mutex_enter(); + //buf_pool_mutex_enter(); + mutex_enter(&LRU_list_mutex); + mutex_enter(&free_list_mutex); if (!recv_recovery_on && UT_LIST_GET_LEN(buf_pool->free) + UT_LIST_GET_LEN(buf_pool->LRU) < buf_pool->curr_size / 4) { @@ -708,7 +788,9 @@ buf_LRU_buf_pool_running_out(void) ret = TRUE; } - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + mutex_exit(&LRU_list_mutex); + mutex_exit(&free_list_mutex); return(ret); } @@ -725,9 +807,10 @@ buf_LRU_get_free_only(void) { buf_block_t* block; - ut_ad(buf_pool_mutex_own()); + //ut_ad(buf_pool_mutex_own()); - block = (buf_block_t*) UT_LIST_GET_FIRST(buf_pool->free); + mutex_enter(&free_list_mutex); + block = (buf_block_t*) UT_LIST_GET_LAST(buf_pool->free); if (block) { ut_ad(block->page.in_free_list); @@ -735,7 +818,9 @@ buf_LRU_get_free_only(void) ut_ad(!block->page.in_flush_list); ut_ad(!block->page.in_LRU_list); ut_a(!buf_page_in_file(&block->page)); - UT_LIST_REMOVE(list, buf_pool->free, (&block->page)); + UT_LIST_REMOVE(free, buf_pool->free, (&block->page)); + + mutex_exit(&free_list_mutex); mutex_enter(&block->mutex); @@ -743,6 +828,8 @@ buf_LRU_get_free_only(void) UNIV_MEM_ALLOC(block->frame, UNIV_PAGE_SIZE); mutex_exit(&block->mutex); + } else { + mutex_exit(&free_list_mutex); } return(block); @@ -767,7 +854,7 @@ buf_LRU_get_free_block( ibool mon_value_was = FALSE; ibool started_monitor = FALSE; loop: - buf_pool_mutex_enter(); + //buf_pool_mutex_enter(); if (!recv_recovery_on && UT_LIST_GET_LEN(buf_pool->free) + UT_LIST_GET_LEN(buf_pool->LRU) < buf_pool->curr_size / 20) { @@ -847,14 +934,16 @@ loop: if (UNIV_UNLIKELY(zip_size)) { ibool lru; page_zip_set_size(&block->page.zip, zip_size); - block->page.zip.data = buf_buddy_alloc(zip_size, &lru); + mutex_enter(&LRU_list_mutex); + block->page.zip.data = buf_buddy_alloc(zip_size, &lru, FALSE); + mutex_exit(&LRU_list_mutex); UNIV_MEM_DESC(block->page.zip.data, zip_size, block); } else { page_zip_set_size(&block->page.zip, 0); block->page.zip.data = NULL; } - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); if (started_monitor) { srv_print_innodb_monitor = mon_value_was; @@ -866,7 +955,7 @@ loop: /* If no block was in the free list, search from the end of the LRU list and try to free a block there */ - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); freed = buf_LRU_search_and_free_block(n_iterations); @@ -915,18 +1004,21 @@ loop: os_aio_simulated_wake_handler_threads(); - buf_pool_mutex_enter(); + //buf_pool_mutex_enter(); + mutex_enter(&buf_pool_mutex); if (buf_pool->LRU_flush_ended > 0) { /* We have written pages in an LRU flush. To make the insert buffer more efficient, we try to move these pages to the free list. */ - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + mutex_exit(&buf_pool_mutex); buf_LRU_try_free_flushed_blocks(); } else { - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + mutex_exit(&buf_pool_mutex); } if (n_iterations > 10) { @@ -951,7 +1043,8 @@ buf_LRU_old_adjust_len(void) ulint new_len; ut_a(buf_pool->LRU_old); - ut_ad(buf_pool_mutex_own()); + //ut_ad(buf_pool_mutex_own()); + ut_ad(mutex_own(&LRU_list_mutex)); #if 3 * (BUF_LRU_OLD_MIN_LEN / 8) <= BUF_LRU_OLD_TOLERANCE + 5 # error "3 * (BUF_LRU_OLD_MIN_LEN / 8) <= BUF_LRU_OLD_TOLERANCE + 5" #endif @@ -1009,7 +1102,8 @@ buf_LRU_old_init(void) { buf_page_t* bpage; - ut_ad(buf_pool_mutex_own()); + //ut_ad(buf_pool_mutex_own()); + ut_ad(mutex_own(&LRU_list_mutex)); ut_a(UT_LIST_GET_LEN(buf_pool->LRU) == BUF_LRU_OLD_MIN_LEN); /* We first initialize all blocks in the LRU list as old and then use @@ -1041,13 +1135,14 @@ buf_unzip_LRU_remove_block_if_needed( ut_ad(buf_pool); ut_ad(bpage); ut_ad(buf_page_in_file(bpage)); - ut_ad(buf_pool_mutex_own()); + //ut_ad(buf_pool_mutex_own()); + ut_ad(mutex_own(&LRU_list_mutex)); if (buf_page_belongs_to_unzip_LRU(bpage)) { buf_block_t* block = (buf_block_t*) bpage; ut_ad(block->in_unzip_LRU_list); - ut_d(block->in_unzip_LRU_list = FALSE); + block->in_unzip_LRU_list = FALSE; UT_LIST_REMOVE(unzip_LRU, buf_pool->unzip_LRU, block); } @@ -1063,7 +1158,8 @@ buf_LRU_remove_block( { ut_ad(buf_pool); ut_ad(bpage); - ut_ad(buf_pool_mutex_own()); + //ut_ad(buf_pool_mutex_own()); + ut_ad(mutex_own(&LRU_list_mutex)); ut_a(buf_page_in_file(bpage)); @@ -1126,12 +1222,13 @@ buf_unzip_LRU_add_block( { ut_ad(buf_pool); ut_ad(block); - ut_ad(buf_pool_mutex_own()); + //ut_ad(buf_pool_mutex_own()); + ut_ad(mutex_own(&LRU_list_mutex)); ut_a(buf_page_belongs_to_unzip_LRU(&block->page)); ut_ad(!block->in_unzip_LRU_list); - ut_d(block->in_unzip_LRU_list = TRUE); + block->in_unzip_LRU_list = TRUE; if (old) { UT_LIST_ADD_LAST(unzip_LRU, buf_pool->unzip_LRU, block); @@ -1152,7 +1249,8 @@ buf_LRU_add_block_to_end_low( ut_ad(buf_pool); ut_ad(bpage); - ut_ad(buf_pool_mutex_own()); + //ut_ad(buf_pool_mutex_own()); + ut_ad(mutex_own(&LRU_list_mutex)); ut_a(buf_page_in_file(bpage)); @@ -1212,7 +1310,8 @@ buf_LRU_add_block_low( { ut_ad(buf_pool); ut_ad(bpage); - ut_ad(buf_pool_mutex_own()); + //ut_ad(buf_pool_mutex_own()); + ut_ad(mutex_own(&LRU_list_mutex)); ut_a(buf_page_in_file(bpage)); ut_ad(!bpage->in_LRU_list); @@ -1331,22 +1430,23 @@ buf_LRU_free_block( buf_page_t* bpage, /* in: block to be freed */ ibool zip, /* in: TRUE if should remove also the compressed page of an uncompressed page */ - ibool* buf_pool_mutex_released) + ibool* buf_pool_mutex_released, /* in: pointer to a variable that will be assigned TRUE if buf_pool_mutex was temporarily released, or NULL */ + ibool have_LRU_mutex) { buf_page_t* b = NULL; mutex_t* block_mutex = buf_page_get_mutex(bpage); - ut_ad(buf_pool_mutex_own()); + //ut_ad(buf_pool_mutex_own()); ut_ad(mutex_own(block_mutex)); ut_ad(buf_page_in_file(bpage)); - ut_ad(bpage->in_LRU_list); + //ut_ad(bpage->in_LRU_list); ut_ad(!bpage->in_flush_list == !bpage->oldest_modification); UNIV_MEM_ASSERT_RW(bpage, sizeof *bpage); - if (!buf_page_can_relocate(bpage)) { + if (!bpage->in_LRU_list || !block_mutex || !buf_page_can_relocate(bpage)) { /* Do not free buffer-fixed or I/O-fixed blocks. */ return(BUF_LRU_NOT_FREED); @@ -1378,15 +1478,15 @@ buf_LRU_free_block( If it cannot be allocated (without freeing a block from the LRU list), refuse to free bpage. */ alloc: - buf_pool_mutex_exit_forbid(); - b = buf_buddy_alloc(sizeof *b, NULL); - buf_pool_mutex_exit_allow(); + //buf_pool_mutex_exit_forbid(); + b = buf_buddy_alloc(sizeof *b, NULL, FALSE); + //buf_pool_mutex_exit_allow(); if (UNIV_UNLIKELY(!b)) { return(BUF_LRU_CANNOT_RELOCATE); } - memcpy(b, bpage, sizeof *b); + //memcpy(b, bpage, sizeof *b); } #ifdef UNIV_DEBUG @@ -1397,6 +1497,39 @@ alloc: } #endif /* UNIV_DEBUG */ + /* not to break latch order, must re-enter block_mutex */ + mutex_exit(block_mutex); + + if (!have_LRU_mutex) + mutex_enter(&LRU_list_mutex); /* optimistic */ + rw_lock_x_lock(&page_hash_latch); + mutex_enter(block_mutex); + + /* recheck states of block */ + if (!bpage->in_LRU_list || block_mutex != buf_page_get_mutex(bpage) + || !buf_page_can_relocate(bpage)) { +not_freed: + if (b) { + buf_buddy_free(b, sizeof *b, TRUE); + } + if (!have_LRU_mutex) + mutex_exit(&LRU_list_mutex); + rw_lock_x_unlock(&page_hash_latch); + return(BUF_LRU_NOT_FREED); + } else if (zip || !bpage->zip.data) { + if (bpage->oldest_modification) + goto not_freed; + } else if (bpage->oldest_modification) { + if (buf_page_get_state(bpage) != BUF_BLOCK_FILE_PAGE) { + ut_ad(buf_page_get_state(bpage) == BUF_BLOCK_ZIP_DIRTY); + goto not_freed; + } + } + + if (b) { + memcpy(b, bpage, sizeof *b); + } + if (buf_LRU_block_remove_hashed_page(bpage, zip) != BUF_BLOCK_ZIP_FREE) { ut_a(bpage->buf_fix_count == 0); @@ -1408,6 +1541,10 @@ alloc: ut_a(!buf_page_hash_get(bpage->space, bpage->offset)); + while (prev_b && !prev_b->in_LRU_list) { + prev_b = UT_LIST_GET_PREV(LRU, prev_b); + } + b->state = b->oldest_modification ? BUF_BLOCK_ZIP_DIRTY : BUF_BLOCK_ZIP_PAGE; @@ -1482,6 +1619,7 @@ alloc: buf_LRU_add_block_low(b, buf_page_is_old(b)); } + mutex_enter(&flush_list_mutex); if (b->state == BUF_BLOCK_ZIP_PAGE) { buf_LRU_insert_zip_clean(b); } else { @@ -1490,22 +1628,23 @@ alloc: ut_ad(b->in_flush_list); ut_d(bpage->in_flush_list = FALSE); - prev = UT_LIST_GET_PREV(list, b); - UT_LIST_REMOVE(list, buf_pool->flush_list, b); + prev = UT_LIST_GET_PREV(flush_list, b); + UT_LIST_REMOVE(flush_list, buf_pool->flush_list, b); if (prev) { ut_ad(prev->in_flush_list); UT_LIST_INSERT_AFTER( - list, + flush_list, buf_pool->flush_list, prev, b); } else { UT_LIST_ADD_FIRST( - list, + flush_list, buf_pool->flush_list, b); } } + mutex_exit(&flush_list_mutex); bpage->zip.data = NULL; page_zip_set_size(&bpage->zip, 0); @@ -1521,7 +1660,9 @@ alloc: *buf_pool_mutex_released = TRUE; } - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + mutex_exit(&LRU_list_mutex); + rw_lock_x_unlock(&page_hash_latch); mutex_exit(block_mutex); /* Remove possible adaptive hash index on the page. @@ -1553,7 +1694,9 @@ alloc: : BUF_NO_CHECKSUM_MAGIC); } - buf_pool_mutex_enter(); + //buf_pool_mutex_enter(); + if (have_LRU_mutex) + mutex_enter(&LRU_list_mutex); mutex_enter(block_mutex); if (b) { @@ -1563,13 +1706,17 @@ alloc: mutex_exit(&buf_pool_zip_mutex); } - buf_LRU_block_free_hashed_page((buf_block_t*) bpage); + buf_LRU_block_free_hashed_page((buf_block_t*) bpage, FALSE); } else { /* The block_mutex should have been released by buf_LRU_block_remove_hashed_page() when it returns BUF_BLOCK_ZIP_FREE. */ ut_ad(block_mutex == &buf_pool_zip_mutex); mutex_enter(block_mutex); + + if (!have_LRU_mutex) + mutex_exit(&LRU_list_mutex); + rw_lock_x_unlock(&page_hash_latch); } return(BUF_LRU_FREED); @@ -1581,12 +1728,13 @@ UNIV_INTERN void buf_LRU_block_free_non_file_page( /*=============================*/ - buf_block_t* block) /* in: block, must not contain a file page */ + buf_block_t* block, /* in: block, must not contain a file page */ + ibool have_page_hash_mutex) { void* data; ut_ad(block); - ut_ad(buf_pool_mutex_own()); + //ut_ad(buf_pool_mutex_own()); ut_ad(mutex_own(&block->mutex)); switch (buf_block_get_state(block)) { @@ -1620,15 +1768,17 @@ buf_LRU_block_free_non_file_page( if (data) { block->page.zip.data = NULL; mutex_exit(&block->mutex); - buf_pool_mutex_exit_forbid(); - buf_buddy_free(data, page_zip_get_size(&block->page.zip)); - buf_pool_mutex_exit_allow(); + //buf_pool_mutex_exit_forbid(); + buf_buddy_free(data, page_zip_get_size(&block->page.zip), have_page_hash_mutex); + //buf_pool_mutex_exit_allow(); mutex_enter(&block->mutex); page_zip_set_size(&block->page.zip, 0); } - UT_LIST_ADD_FIRST(list, buf_pool->free, (&block->page)); + mutex_enter(&free_list_mutex); + UT_LIST_ADD_FIRST(free, buf_pool->free, (&block->page)); ut_d(block->page.in_free_list = TRUE); + mutex_exit(&free_list_mutex); UNIV_MEM_ASSERT_AND_FREE(block->frame, UNIV_PAGE_SIZE); } @@ -1657,7 +1807,11 @@ buf_LRU_block_remove_hashed_page( { const buf_page_t* hashed_bpage; ut_ad(bpage); - ut_ad(buf_pool_mutex_own()); + //ut_ad(buf_pool_mutex_own()); + ut_ad(mutex_own(&LRU_list_mutex)); +#ifdef UNIV_SYNC_DEBUG + ut_ad(rw_lock_own(&page_hash_latch, RW_LOCK_EX)); +#endif ut_ad(mutex_own(buf_page_get_mutex(bpage))); ut_a(buf_page_get_io_fix(bpage) == BUF_IO_NONE); @@ -1758,7 +1912,9 @@ buf_LRU_block_remove_hashed_page( #if defined UNIV_DEBUG || defined UNIV_BUF_DEBUG mutex_exit(buf_page_get_mutex(bpage)); - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + mutex_exit(&LRU_list_mutex); + rw_lock_x_unlock(&page_hash_latch); buf_print(); buf_LRU_print(); buf_validate(); @@ -1781,14 +1937,14 @@ buf_LRU_block_remove_hashed_page( ut_a(bpage->zip.data); ut_a(buf_page_get_zip_size(bpage)); - UT_LIST_REMOVE(list, buf_pool->zip_clean, bpage); + UT_LIST_REMOVE(zip_list, buf_pool->zip_clean, bpage); mutex_exit(&buf_pool_zip_mutex); - buf_pool_mutex_exit_forbid(); + //buf_pool_mutex_exit_forbid(); buf_buddy_free(bpage->zip.data, - page_zip_get_size(&bpage->zip)); - buf_buddy_free(bpage, sizeof(*bpage)); - buf_pool_mutex_exit_allow(); + page_zip_get_size(&bpage->zip), TRUE); + buf_buddy_free(bpage, sizeof(*bpage), TRUE); + //buf_pool_mutex_exit_allow(); UNIV_MEM_UNDESC(bpage); return(BUF_BLOCK_ZIP_FREE); @@ -1807,9 +1963,9 @@ buf_LRU_block_remove_hashed_page( bpage->zip.data = NULL; mutex_exit(&((buf_block_t*) bpage)->mutex); - buf_pool_mutex_exit_forbid(); - buf_buddy_free(data, page_zip_get_size(&bpage->zip)); - buf_pool_mutex_exit_allow(); + //buf_pool_mutex_exit_forbid(); + buf_buddy_free(data, page_zip_get_size(&bpage->zip), TRUE); + //buf_pool_mutex_exit_allow(); mutex_enter(&((buf_block_t*) bpage)->mutex); page_zip_set_size(&bpage->zip, 0); } @@ -1835,15 +1991,16 @@ static void buf_LRU_block_free_hashed_page( /*===========================*/ - buf_block_t* block) /* in: block, must contain a file page and + buf_block_t* block, /* in: block, must contain a file page and be in a state where it can be freed */ + ibool have_page_hash_mutex) { - ut_ad(buf_pool_mutex_own()); + //ut_ad(buf_pool_mutex_own()); ut_ad(mutex_own(&block->mutex)); buf_block_set_state(block, BUF_BLOCK_MEMORY); - buf_LRU_block_free_non_file_page(block); + buf_LRU_block_free_non_file_page(block, have_page_hash_mutex); } /************************************************************************ @@ -1861,7 +2018,8 @@ buf_LRU_stat_update(void) goto func_exit; } - buf_pool_mutex_enter(); + //buf_pool_mutex_enter(); + mutex_enter(&buf_pool_mutex); /* Update the index. */ item = &buf_LRU_stat_arr[buf_LRU_stat_arr_ind]; @@ -1875,7 +2033,8 @@ buf_LRU_stat_update(void) /* Put current entry in the array. */ memcpy(item, &buf_LRU_stat_cur, sizeof *item); - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + mutex_exit(&buf_pool_mutex); func_exit: /* Clear the current entry. */ @@ -1897,7 +2056,8 @@ buf_LRU_validate(void) ulint LRU_pos; ut_ad(buf_pool); - buf_pool_mutex_enter(); + //buf_pool_mutex_enter(); + mutex_enter(&LRU_list_mutex); if (UT_LIST_GET_LEN(buf_pool->LRU) >= BUF_LRU_OLD_MIN_LEN) { @@ -1956,15 +2116,21 @@ buf_LRU_validate(void) ut_a(buf_pool->LRU_old_len == old_len); } - UT_LIST_VALIDATE(list, buf_page_t, buf_pool->free); + mutex_exit(&LRU_list_mutex); + mutex_enter(&free_list_mutex); + + UT_LIST_VALIDATE(free, buf_page_t, buf_pool->free); for (bpage = UT_LIST_GET_FIRST(buf_pool->free); bpage != NULL; - bpage = UT_LIST_GET_NEXT(list, bpage)) { + bpage = UT_LIST_GET_NEXT(free, bpage)) { ut_a(buf_page_get_state(bpage) == BUF_BLOCK_NOT_USED); } + mutex_exit(&free_list_mutex); + mutex_enter(&LRU_list_mutex); + UT_LIST_VALIDATE(unzip_LRU, buf_block_t, buf_pool->unzip_LRU); for (block = UT_LIST_GET_FIRST(buf_pool->unzip_LRU); @@ -1976,7 +2142,8 @@ buf_LRU_validate(void) ut_a(buf_page_belongs_to_unzip_LRU(&block->page)); } - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + mutex_exit(&LRU_list_mutex); return(TRUE); } #endif /* UNIV_DEBUG || UNIV_BUF_DEBUG */ @@ -1992,7 +2159,8 @@ buf_LRU_print(void) const buf_page_t* bpage; ut_ad(buf_pool); - buf_pool_mutex_enter(); + //buf_pool_mutex_enter(); + mutex_enter(&LRU_list_mutex); fprintf(stderr, "Pool ulint clock %lu\n", (ulong) buf_pool->ulint_clock); @@ -2055,6 +2223,7 @@ buf_LRU_print(void) bpage = UT_LIST_GET_NEXT(LRU, bpage); } - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + mutex_exit(&LRU_list_mutex); } #endif /* UNIV_DEBUG_PRINT || UNIV_DEBUG || UNIV_BUF_DEBUG */ === modified file 'storage/xtradb/buf/buf0rea.c' --- a/storage/xtradb/buf/buf0rea.c 2009-05-04 02:45:47 +0000 +++ b/storage/xtradb/buf/buf0rea.c 2009-07-06 05:47:15 +0000 @@ -134,6 +134,46 @@ buf_read_page_low( bpage = buf_page_init_for_read(err, mode, space, zip_size, unzip, tablespace_version, offset); if (bpage == NULL) { + /* bugfix: http://bugs.mysql.com/bug.php?id=43948 */ + if (recv_recovery_is_on() && *err == DB_TABLESPACE_DELETED) { + /* hashed log recs must be treated here */ + recv_addr_t* recv_addr; + + mutex_enter(&(recv_sys->mutex)); + + if (recv_sys->apply_log_recs == FALSE) { + mutex_exit(&(recv_sys->mutex)); + goto not_to_recover; + } + + /* recv_get_fil_addr_struct() */ + recv_addr = HASH_GET_FIRST(recv_sys->addr_hash, + hash_calc_hash(ut_fold_ulint_pair(space, offset), + recv_sys->addr_hash)); + while (recv_addr) { + if ((recv_addr->space == space) + && (recv_addr->page_no == offset)) { + break; + } + recv_addr = HASH_GET_NEXT(addr_hash, recv_addr); + } + + if ((recv_addr == NULL) + || (recv_addr->state == RECV_BEING_PROCESSED) + || (recv_addr->state == RECV_PROCESSED)) { + mutex_exit(&(recv_sys->mutex)); + goto not_to_recover; + } + + fprintf(stderr, " (cannot find space: %lu)", space); + recv_addr->state = RECV_PROCESSED; + + ut_a(recv_sys->n_addrs); + recv_sys->n_addrs--; + + mutex_exit(&(recv_sys->mutex)); + } +not_to_recover: return(0); } @@ -246,18 +286,22 @@ buf_read_ahead_random( LRU_recent_limit = buf_LRU_get_recent_limit(); - buf_pool_mutex_enter(); + //buf_pool_mutex_enter(); + mutex_enter(&buf_pool_mutex); if (buf_pool->n_pend_reads > buf_pool->curr_size / BUF_READ_AHEAD_PEND_LIMIT) { - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + mutex_exit(&buf_pool_mutex); return(0); } + mutex_exit(&buf_pool_mutex); /* Count how many blocks in the area have been recently accessed, that is, reside near the start of the LRU list. */ + rw_lock_s_lock(&page_hash_latch); for (i = low; i < high; i++) { const buf_page_t* bpage = buf_page_hash_get(space, i); @@ -269,13 +313,15 @@ buf_read_ahead_random( if (recent_blocks >= BUF_READ_AHEAD_RANDOM_THRESHOLD) { - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + rw_lock_s_unlock(&page_hash_latch); goto read_ahead; } } } - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + rw_lock_s_unlock(&page_hash_latch); /* Do nothing */ return(0); @@ -469,10 +515,12 @@ buf_read_ahead_linear( tablespace_version = fil_space_get_version(space); - buf_pool_mutex_enter(); + //buf_pool_mutex_enter(); + mutex_enter(&buf_pool_mutex); if (high > fil_space_get_size(space)) { - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + mutex_exit(&buf_pool_mutex); /* The area is not whole, return */ return(0); @@ -480,10 +528,12 @@ buf_read_ahead_linear( if (buf_pool->n_pend_reads > buf_pool->curr_size / BUF_READ_AHEAD_PEND_LIMIT) { - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + mutex_exit(&buf_pool_mutex); return(0); } + mutex_exit(&buf_pool_mutex); /* Check that almost all pages in the area have been accessed; if offset == low, the accesses must be in a descending order, otherwise, @@ -497,6 +547,7 @@ buf_read_ahead_linear( fail_count = 0; + rw_lock_s_lock(&page_hash_latch); for (i = low; i < high; i++) { bpage = buf_page_hash_get(space, i); @@ -520,7 +571,8 @@ buf_read_ahead_linear( * LINEAR_AREA_THRESHOLD_COEF) { /* Too many failures: return */ - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + rw_lock_s_unlock(&page_hash_latch); return(0); } @@ -531,7 +583,8 @@ buf_read_ahead_linear( bpage = buf_page_hash_get(space, offset); if (bpage == NULL) { - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + rw_lock_s_unlock(&page_hash_latch); return(0); } @@ -557,7 +610,8 @@ buf_read_ahead_linear( pred_offset = fil_page_get_prev(frame); succ_offset = fil_page_get_next(frame); - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + rw_lock_s_unlock(&page_hash_latch); if ((offset == low) && (succ_offset == offset + 1)) { @@ -770,11 +824,11 @@ buf_read_recv_pages( while (buf_pool->n_pend_reads >= recv_n_pool_free_frames / 2) { os_aio_simulated_wake_handler_threads(); - os_thread_sleep(500000); + os_thread_sleep(10000); count++; - if (count > 100) { + if (count > 5000) { fprintf(stderr, "InnoDB: Error: InnoDB has waited for" " 50 seconds for pending\n" === modified file 'storage/xtradb/dict/dict0boot.c' --- a/storage/xtradb/dict/dict0boot.c 2009-03-26 06:11:11 +0000 +++ b/storage/xtradb/dict/dict0boot.c 2009-06-25 01:43:25 +0000 @@ -265,6 +265,7 @@ dict_boot(void) system tables */ /*-------------------------*/ table = dict_mem_table_create("SYS_TABLES", DICT_HDR_SPACE, 8, 0); + table->n_mysql_handles_opened = 1; /* for pin */ dict_mem_table_add_col(table, heap, "NAME", DATA_BINARY, 0, 0); dict_mem_table_add_col(table, heap, "ID", DATA_BINARY, 0, 0); @@ -314,6 +315,7 @@ dict_boot(void) /*-------------------------*/ table = dict_mem_table_create("SYS_COLUMNS", DICT_HDR_SPACE, 7, 0); + table->n_mysql_handles_opened = 1; /* for pin */ dict_mem_table_add_col(table, heap, "TABLE_ID", DATA_BINARY, 0, 0); dict_mem_table_add_col(table, heap, "POS", DATA_INT, 0, 4); @@ -346,6 +348,7 @@ dict_boot(void) /*-------------------------*/ table = dict_mem_table_create("SYS_INDEXES", DICT_HDR_SPACE, 7, 0); + table->n_mysql_handles_opened = 1; /* for pin */ dict_mem_table_add_col(table, heap, "TABLE_ID", DATA_BINARY, 0, 0); dict_mem_table_add_col(table, heap, "ID", DATA_BINARY, 0, 0); @@ -388,6 +391,7 @@ dict_boot(void) /*-------------------------*/ table = dict_mem_table_create("SYS_FIELDS", DICT_HDR_SPACE, 3, 0); + table->n_mysql_handles_opened = 1; /* for pin */ dict_mem_table_add_col(table, heap, "INDEX_ID", DATA_BINARY, 0, 0); dict_mem_table_add_col(table, heap, "POS", DATA_INT, 0, 4); === modified file 'storage/xtradb/dict/dict0crea.c' --- a/storage/xtradb/dict/dict0crea.c 2009-03-26 06:11:11 +0000 +++ b/storage/xtradb/dict/dict0crea.c 2009-06-25 01:43:25 +0000 @@ -1184,6 +1184,9 @@ dict_create_or_check_foreign_constraint_ /* Foreign constraint system tables have already been created, and they are ok */ + table1->n_mysql_handles_opened = 1; /* for pin */ + table2->n_mysql_handles_opened = 1; /* for pin */ + mutex_exit(&(dict_sys->mutex)); return(DB_SUCCESS); @@ -1265,6 +1268,11 @@ dict_create_or_check_foreign_constraint_ trx_commit_for_mysql(trx); + table1 = dict_table_get_low("SYS_FOREIGN"); + table2 = dict_table_get_low("SYS_FOREIGN_COLS"); + table1->n_mysql_handles_opened = 1; /* for pin */ + table2->n_mysql_handles_opened = 1; /* for pin */ + row_mysql_unlock_data_dictionary(trx); trx_free_for_mysql(trx); === modified file 'storage/xtradb/dict/dict0dict.c' --- a/storage/xtradb/dict/dict0dict.c 2009-03-26 06:11:11 +0000 +++ b/storage/xtradb/dict/dict0dict.c 2009-08-03 07:14:02 +0000 @@ -545,6 +545,8 @@ dict_table_get_on_id( table = dict_table_get_on_id_low(table_id); + dict_table_LRU_trim(table); + mutex_exit(&(dict_sys->mutex)); return(table); @@ -659,6 +661,8 @@ dict_table_get( table->n_mysql_handles_opened++; } + dict_table_LRU_trim(table); + mutex_exit(&(dict_sys->mutex)); if (table != NULL) { @@ -1153,6 +1157,64 @@ dict_table_remove_from_cache( dict_mem_table_free(table); } +/************************************************************************** +Frees tables from the end of table_LRU if the dictionary cache occupies +too much space. */ +UNIV_INTERN +void +dict_table_LRU_trim( +/*================*/ + dict_table_t* self) +{ + dict_table_t* table; + dict_table_t* prev_table; + dict_foreign_t* foreign; + ulint n_removed; + ulint n_have_parent; + ulint cached_foreign_tables; + +#ifdef UNIV_SYNC_DEBUG + ut_ad(mutex_own(&(dict_sys->mutex))); +#endif /* UNIV_SYNC_DEBUG */ + +retry: + n_removed = n_have_parent = 0; + table = UT_LIST_GET_LAST(dict_sys->table_LRU); + + while ( srv_dict_size_limit && table + && ((dict_sys->table_hash->n_cells + + dict_sys->table_id_hash->n_cells) * sizeof(hash_cell_t) + + dict_sys->size) > srv_dict_size_limit ) { + prev_table = UT_LIST_GET_PREV(table_LRU, table); + + if (table == self || table->n_mysql_handles_opened) + goto next_loop; + + cached_foreign_tables = 0; + foreign = UT_LIST_GET_FIRST(table->foreign_list); + while (foreign != NULL) { + if (foreign->referenced_table) + cached_foreign_tables++; + foreign = UT_LIST_GET_NEXT(foreign_list, foreign); + } + + if (cached_foreign_tables == 0) { + dict_table_remove_from_cache(table); + n_removed++; + } else { + n_have_parent++; + } +next_loop: + table = prev_table; + } + + if ( srv_dict_size_limit && n_have_parent && n_removed + && ((dict_sys->table_hash->n_cells + + dict_sys->table_id_hash->n_cells) * sizeof(hash_cell_t) + + dict_sys->size) > srv_dict_size_limit ) + goto retry; +} + /******************************************************************** If the given column name is reserved for InnoDB system columns, return TRUE. */ @@ -2987,7 +3049,7 @@ scan_more: } else if (quote) { /* Within quotes: do not look for starting quotes or comments. */ - } else if (*sptr == '"' || *sptr == '`') { + } else if (*sptr == '"' || *sptr == '`' || *sptr == '\'') { /* Starting quote: remember the quote character. */ quote = *sptr; } else if (*sptr == '#' @@ -4276,7 +4338,8 @@ dict_table_print_low( ut_ad(mutex_own(&(dict_sys->mutex))); - dict_update_statistics_low(table, TRUE); + if (srv_stats_auto_update) + dict_update_statistics_low(table, TRUE); fprintf(stderr, "--------------------------------------\n" === modified file 'storage/xtradb/dict/dict0load.c' --- a/storage/xtradb/dict/dict0load.c 2009-03-26 06:11:11 +0000 +++ b/storage/xtradb/dict/dict0load.c 2009-06-25 01:43:25 +0000 @@ -223,7 +223,7 @@ loop: /* The table definition was corrupt if there is no index */ - if (dict_table_get_first_index(table)) { + if (srv_stats_auto_update && dict_table_get_first_index(table)) { dict_update_statistics_low(table, TRUE); } === modified file 'storage/xtradb/fil/fil0fil.c' --- a/storage/xtradb/fil/fil0fil.c 2009-03-26 06:11:11 +0000 +++ b/storage/xtradb/fil/fil0fil.c 2009-06-25 01:43:25 +0000 @@ -42,6 +42,10 @@ Created 10/25/1995 Heikki Tuuri #include "mtr0log.h" #include "dict0dict.h" #include "page0zip.h" +#include "trx0trx.h" +#include "trx0sys.h" +#include "pars0pars.h" +#include "row0mysql.h" /* @@ -2977,7 +2981,7 @@ fil_open_single_table_tablespace( ut_a(flags != DICT_TF_COMPACT); file = os_file_create_simple_no_error_handling( - filepath, OS_FILE_OPEN, OS_FILE_READ_ONLY, &success); + filepath, OS_FILE_OPEN, OS_FILE_READ_WRITE, &success); if (!success) { /* The following call prints an error message */ os_file_get_last_error(TRUE); @@ -3025,6 +3029,275 @@ fil_open_single_table_tablespace( space_id = fsp_header_get_space_id(page); space_flags = fsp_header_get_flags(page); + if (srv_expand_import && (space_id != id || space_flags != flags)) { + dulint old_id[31]; + dulint new_id[31]; + ulint root_page[31]; + ulint n_index; + os_file_t info_file = -1; + char* info_file_path; + ulint i; + int len; + ib_uint64_t current_lsn; + + current_lsn = log_get_lsn(); + + /* overwrite fsp header */ + fsp_header_init_fields(page, id, flags); + mach_write_to_4(page + FIL_PAGE_ARCH_LOG_NO_OR_SPACE_ID, id); + space_id = id; + space_flags = flags; + if (mach_read_ull(page + FIL_PAGE_FILE_FLUSH_LSN) > current_lsn) + mach_write_ull(page + FIL_PAGE_FILE_FLUSH_LSN, current_lsn); + mach_write_to_4(page + FIL_PAGE_SPACE_OR_CHKSUM, + srv_use_checksums + ? buf_calc_page_new_checksum(page) + : BUF_NO_CHECKSUM_MAGIC); + mach_write_to_4(page + UNIV_PAGE_SIZE - FIL_PAGE_END_LSN_OLD_CHKSUM, + srv_use_checksums + ? buf_calc_page_old_checksum(page) + : BUF_NO_CHECKSUM_MAGIC); + success = os_file_write(filepath, file, page, 0, 0, UNIV_PAGE_SIZE); + + /* get file size */ + ulint size_low, size_high, size; + ib_int64_t size_bytes; + os_file_get_size(file, &size_low, &size_high); + size_bytes = (((ib_int64_t)size_high) << 32) + + (ib_int64_t)size_low; + + /* get cruster index information */ + dict_table_t* table; + dict_index_t* index; + table = dict_table_get_low(name); + index = dict_table_get_first_index(table); + ut_a(index->page==3); + + + /* read metadata from .exp file */ + n_index = 0; + bzero(old_id, sizeof(old_id)); + bzero(new_id, sizeof(new_id)); + bzero(root_page, sizeof(root_page)); + + info_file_path = fil_make_ibd_name(name, FALSE); + len = strlen(info_file_path); + info_file_path[len - 3] = 'e'; + info_file_path[len - 2] = 'x'; + info_file_path[len - 1] = 'p'; + + info_file = os_file_create_simple_no_error_handling( + info_file_path, OS_FILE_OPEN, OS_FILE_READ_ONLY, &success); + if (!success) { + fprintf(stderr, "InnoDB: cannot open %s\n", info_file_path); + goto skip_info; + } + success = os_file_read(info_file, page, 0, 0, UNIV_PAGE_SIZE); + if (!success) { + fprintf(stderr, "InnoDB: cannot read %s\n", info_file_path); + goto skip_info; + } + if (mach_read_from_4(page) != 0x78706f72UL + || mach_read_from_4(page + 4) != 0x74696e66UL) { + fprintf(stderr, "InnoDB: %s seems not to be a correct .exp file\n", info_file_path); + goto skip_info; + } + + fprintf(stderr, "InnoDB: import: extended import of %s is started.\n", name); + + n_index = mach_read_from_4(page + 8); + fprintf(stderr, "InnoDB: import: %lu indexes are detected.\n", (ulong)n_index); + for (i = 0; i < n_index; i++) { + new_id[i] = + dict_table_get_index_on_name(table, + (page + (i + 1) * 512 + 12))->id; + old_id[i] = mach_read_from_8(page + (i + 1) * 512); + root_page[i] = mach_read_from_4(page + (i + 1) * 512 + 8); + } + +skip_info: + if (info_file != -1) + os_file_close(info_file); + + /* + if (size_bytes >= 1024 * 1024) { + size_bytes = ut_2pow_round(size_bytes, 1024 * 1024); + } + */ + if (!(flags & DICT_TF_ZSSIZE_MASK)) { + mem_heap_t* heap = NULL; + ulint offsets_[REC_OFFS_NORMAL_SIZE]; + ulint* offsets = offsets_; + size = (ulint) (size_bytes / UNIV_PAGE_SIZE); + /* over write space id of all pages */ + ib_int64_t offset; + + rec_offs_init(offsets_); + + fprintf(stderr, "InnoDB: Progress in %:"); + + for (offset = 0; offset < size_bytes; offset += UNIV_PAGE_SIZE) { + success = os_file_read(file, page, + (ulint)(offset & 0xFFFFFFFFUL), + (ulint)(offset >> 32), UNIV_PAGE_SIZE); + if (mach_read_from_4(page + FIL_PAGE_OFFSET) || !offset) { + mach_write_to_4(page + FIL_PAGE_ARCH_LOG_NO_OR_SPACE_ID, id); + + for (i = 0; i < n_index; i++) { + if (offset / UNIV_PAGE_SIZE == root_page[i]) { + /* this is index root page */ + mach_write_to_4(page + FIL_PAGE_DATA + PAGE_BTR_SEG_LEAF + + FSEG_HDR_SPACE, id); + mach_write_to_4(page + FIL_PAGE_DATA + PAGE_BTR_SEG_TOP + + FSEG_HDR_SPACE, id); + break; + } + } + + if (fil_page_get_type(page) == FIL_PAGE_INDEX) { + dulint tmp = mach_read_from_8(page + (PAGE_HEADER + PAGE_INDEX_ID)); + + if (mach_read_from_2(page + PAGE_HEADER + PAGE_LEVEL) == 0 + && ut_dulint_cmp(old_id[0], tmp) == 0) { + /* leaf page of cluster index, reset trx_id of records */ + rec_t* rec; + rec_t* supremum; + ulint n_recs; + + supremum = page_get_supremum_rec(page); + rec = page_rec_get_next(page_get_infimum_rec(page)); + n_recs = page_get_n_recs(page); + + while (rec && rec != supremum && n_recs > 0) { + ulint offset = index->trx_id_offset; + if (!offset) { + offsets = rec_get_offsets(rec, index, offsets, + ULINT_UNDEFINED, &heap); + offset = row_get_trx_id_offset(rec, index, offsets); + } + trx_write_trx_id(rec + offset, ut_dulint_create(0, 1)); + rec = page_rec_get_next(rec); + n_recs--; + } + } + + for (i = 0; i < n_index; i++) { + if (ut_dulint_cmp(old_id[i], tmp) == 0) { + mach_write_to_8(page + (PAGE_HEADER + PAGE_INDEX_ID), new_id[i]); + break; + } + } + } + + if (mach_read_ull(page + FIL_PAGE_LSN) > current_lsn) { + mach_write_ull(page + FIL_PAGE_LSN, current_lsn); + mach_write_ull(page + UNIV_PAGE_SIZE - FIL_PAGE_END_LSN_OLD_CHKSUM, + current_lsn); + } + + mach_write_to_4(page + FIL_PAGE_SPACE_OR_CHKSUM, + srv_use_checksums + ? buf_calc_page_new_checksum(page) + : BUF_NO_CHECKSUM_MAGIC); + mach_write_to_4(page + UNIV_PAGE_SIZE - FIL_PAGE_END_LSN_OLD_CHKSUM, + srv_use_checksums + ? buf_calc_page_old_checksum(page) + : BUF_NO_CHECKSUM_MAGIC); + + success = os_file_write(filepath, file, page, + (ulint)(offset & 0xFFFFFFFFUL), + (ulint)(offset >> 32), UNIV_PAGE_SIZE); + } + + if (size_bytes + && ((ib_int64_t)((offset + UNIV_PAGE_SIZE) * 100) / size_bytes) + != ((offset * 100) / size_bytes)) { + fprintf(stderr, " %lu", + (ulong)((ib_int64_t)((offset + UNIV_PAGE_SIZE) * 100) / size_bytes)); + } + } + + fprintf(stderr, " done.\n"); + + /* update SYS_INDEXES set root page */ + index = dict_table_get_first_index(table); + while (index) { + for (i = 0; i < n_index; i++) { + if (ut_dulint_cmp(new_id[i], index->id) == 0) { + break; + } + } + + if (i != n_index + && root_page[i] != index->page) { + /* must update */ + ulint error; + trx_t* trx; + pars_info_t* info = NULL; + + trx = trx_allocate_for_mysql(); + trx->op_info = "extended import"; + + info = pars_info_create(); + + pars_info_add_dulint_literal(info, "indexid", new_id[i]); + pars_info_add_int4_literal(info, "new_page", (lint) root_page[i]); + + error = que_eval_sql(info, + "PROCEDURE UPDATE_INDEX_PAGE () IS\n" + "BEGIN\n" + "UPDATE SYS_INDEXES" + " SET PAGE_NO = :new_page" + " WHERE ID = :indexid;\n" + "COMMIT WORK;\n" + "END;\n", + FALSE, trx); + + if (error != DB_SUCCESS) { + fprintf(stderr, "InnoDB: failed to update SYS_INDEXES\n"); + } + + trx_commit_for_mysql(trx); + + trx_free_for_mysql(trx); + + index->page = root_page[i]; + } + + index = dict_table_get_next_index(index); + } + if (UNIV_LIKELY_NULL(heap)) { + mem_heap_free(heap); + } + } else { + /* zip page? */ + size = (ulint) + (size_bytes + / dict_table_flags_to_zip_size(flags)); + fprintf(stderr, "InnoDB: import: table %s seems to be in newer format." + " It may not be able to treated for now.\n", name); + } + /* .exp file should be removed */ + success = os_file_delete(info_file_path); + if (!success) { + success = os_file_delete_if_exists(info_file_path); + } + mem_free(info_file_path); + + fil_system_t* system = fil_system; + mutex_enter(&(system->mutex)); + fil_node_t* node = NULL; + fil_space_t* space; + space = fil_space_get_by_id(id); + if (space) + node = UT_LIST_GET_FIRST(space->chain); + if (node && node->size < size) { + space->size += (size - node->size); + node->size = size; + } + mutex_exit(&(system->mutex)); + } + ut_free(buf2); if (UNIV_UNLIKELY(space_id != id || space_flags != flags)) { === modified file 'storage/xtradb/handler/ha_innodb.cc' --- a/storage/xtradb/handler/ha_innodb.cc 2009-06-18 12:39:21 +0000 +++ b/storage/xtradb/handler/ha_innodb.cc 2009-08-03 20:09:53 +0000 @@ -157,6 +157,7 @@ static long innobase_mirrored_log_groups innobase_autoinc_lock_mode; static unsigned long innobase_read_io_threads, innobase_write_io_threads; +static my_bool innobase_thread_concurrency_timer_based; static long long innobase_buffer_pool_size, innobase_log_file_size; /* The default values for the following char* start-up parameters @@ -488,6 +489,8 @@ static SHOW_VAR innodb_status_variables[ (char*) &export_vars.innodb_dblwr_pages_written, SHOW_LONG}, {"dblwr_writes", (char*) &export_vars.innodb_dblwr_writes, SHOW_LONG}, + {"dict_tables", + (char*) &export_vars.innodb_dict_tables, SHOW_LONG}, {"have_atomic_builtins", (char*) &export_vars.innodb_have_atomic_builtins, SHOW_BOOL}, {"log_waits", @@ -2100,77 +2103,6 @@ mem_free_and_error: goto error; } -#ifdef HAVE_REPLICATION -#ifdef MYSQL_SERVER - if(innobase_overwrite_relay_log_info) { - /* If InnoDB progressed from relay-log.info, overwrite it */ - if (fname[0] == '\0') { - fprintf(stderr, - "InnoDB: something wrong with relay-info.log. InnoDB will not overwrite it.\n"); - } else if (0 != strcmp(fname, trx_sys_mysql_master_log_name) - || pos != trx_sys_mysql_master_log_pos) { - /* Overwrite relay-log.info */ - bzero((char*) &info_file, sizeof(info_file)); - fn_format(fname, relay_log_info_file, mysql_data_home, "", 4+32); - - int error = 0; - - if (!access(fname,F_OK)) { - /* exist */ - if ((info_fd = my_open(fname, O_RDWR|O_BINARY, MYF(MY_WME))) < 0) { - error = 1; - } else if (init_io_cache(&info_file, info_fd, IO_SIZE*2, - WRITE_CACHE, 0L, 0, MYF(MY_WME))) { - error = 1; - } - - if (error) { - if (info_fd >= 0) - my_close(info_fd, MYF(0)); - goto skip_overwrite; - } - } else { - error = 1; - goto skip_overwrite; - } - - char buff[FN_REFLEN*2+22*2+4], *pos; - - my_b_seek(&info_file, 0L); - pos=strmov(buff, trx_sys_mysql_relay_log_name); - *pos++='\n'; - pos=longlong2str(trx_sys_mysql_relay_log_pos, pos, 10); - *pos++='\n'; - pos=strmov(pos, trx_sys_mysql_master_log_name); - *pos++='\n'; - pos=longlong2str(trx_sys_mysql_master_log_pos, pos, 10); - *pos='\n'; - - if (my_b_write(&info_file, (uchar*) buff, (size_t) (pos-buff)+1)) - error = 1; - if (flush_io_cache(&info_file)) - error = 1; - - end_io_cache(&info_file); - if (info_fd >= 0) - my_close(info_fd, MYF(0)); -skip_overwrite: - if (error) { - fprintf(stderr, - "InnoDB: ERROR: error occured during overwriting relay-log.info.\n"); - } else { - fprintf(stderr, - "InnoDB: relay-log.info was overwritten.\n"); - } - } else { - fprintf(stderr, - "InnoDB: InnoDB and relay-log.info are synchronized. InnoDB will not overwrite it.\n"); - } - } -#endif /* MYSQL_SERVER */ -#endif /* HAVE_REPLICATION */ - - srv_extra_undoslots = (ibool) innobase_extra_undoslots; /* -------------- Log files ---------------------------*/ @@ -2266,6 +2198,9 @@ skip_overwrite: srv_n_log_files = (ulint) innobase_log_files_in_group; srv_log_file_size = (ulint) innobase_log_file_size; + srv_thread_concurrency_timer_based = + (ibool) innobase_thread_concurrency_timer_based; + #ifdef UNIV_LOG_ARCHIVE srv_log_archive_on = (ulint) innobase_log_archive; #endif /* UNIV_LOG_ARCHIVE */ @@ -2280,6 +2215,7 @@ skip_overwrite: srv_n_write_io_threads = (ulint) innobase_write_io_threads; srv_read_ahead &= 3; + srv_adaptive_checkpoint %= 3; srv_force_recovery = (ulint) innobase_force_recovery; @@ -2329,6 +2265,76 @@ skip_overwrite: goto mem_free_and_error; } +#ifdef HAVE_REPLICATION +#ifdef MYSQL_SERVER + if(innobase_overwrite_relay_log_info) { + /* If InnoDB progressed from relay-log.info, overwrite it */ + if (fname[0] == '\0') { + fprintf(stderr, + "InnoDB: something wrong with relay-info.log. InnoDB will not overwrite it.\n"); + } else if (0 != strcmp(fname, trx_sys_mysql_master_log_name) + || pos != trx_sys_mysql_master_log_pos) { + /* Overwrite relay-log.info */ + bzero((char*) &info_file, sizeof(info_file)); + fn_format(fname, relay_log_info_file, mysql_data_home, "", 4+32); + + int error = 0; + + if (!access(fname,F_OK)) { + /* exist */ + if ((info_fd = my_open(fname, O_RDWR|O_BINARY, MYF(MY_WME))) < 0) { + error = 1; + } else if (init_io_cache(&info_file, info_fd, IO_SIZE*2, + WRITE_CACHE, 0L, 0, MYF(MY_WME))) { + error = 1; + } + + if (error) { + if (info_fd >= 0) + my_close(info_fd, MYF(0)); + goto skip_overwrite; + } + } else { + error = 1; + goto skip_overwrite; + } + + char buff[FN_REFLEN*2+22*2+4], *pos; + + my_b_seek(&info_file, 0L); + pos=strmov(buff, trx_sys_mysql_relay_log_name); + *pos++='\n'; + pos=longlong2str(trx_sys_mysql_relay_log_pos, pos, 10); + *pos++='\n'; + pos=strmov(pos, trx_sys_mysql_master_log_name); + *pos++='\n'; + pos=longlong2str(trx_sys_mysql_master_log_pos, pos, 10); + *pos='\n'; + + if (my_b_write(&info_file, (uchar*) buff, (size_t) (pos-buff)+1)) + error = 1; + if (flush_io_cache(&info_file)) + error = 1; + + end_io_cache(&info_file); + if (info_fd >= 0) + my_close(info_fd, MYF(0)); +skip_overwrite: + if (error) { + fprintf(stderr, + "InnoDB: ERROR: error occured during overwriting relay-log.info.\n"); + } else { + fprintf(stderr, + "InnoDB: relay-log.info was overwritten.\n"); + } + } else { + fprintf(stderr, + "InnoDB: InnoDB and relay-log.info are synchronized. InnoDB will not overwrite it.\n"); + } + } +#endif /* MYSQL_SERVER */ +#endif /* HAVE_REPLICATION */ + innobase_open_tables = hash_create(200); pthread_mutex_init(&innobase_share_mutex, MY_MUTEX_INIT_FAST); pthread_mutex_init(&prepare_commit_mutex, MY_MUTEX_INIT_FAST); @@ -7079,7 +7085,9 @@ ha_innobase::info( ib_table = prebuilt->table; if (flag & HA_STATUS_TIME) { - if (innobase_stats_on_metadata) { + if (innobase_stats_on_metadata + && (thd_sql_command(user_thd) == SQLCOM_ANALYZE + || srv_stats_auto_update)) { /* In sql_show we call with this flag: update then statistics so that they are up-to-date */ @@ -9321,7 +9329,8 @@ ha_innobase::check_if_incompatible_data( if (info_row_type == ROW_TYPE_DEFAULT) info_row_type = ROW_TYPE_COMPACT; if ((info->used_fields & HA_CREATE_USED_ROW_FORMAT) && - row_type != info_row_type) { + get_row_type() != ((info->row_type == ROW_TYPE_DEFAULT) + ? ROW_TYPE_COMPACT : info->row_type)) { DBUG_PRINT("info", ("get_row_type()=%d != info->row_type=%d -> " "COMPATIBLE_DATA_NO", @@ -9830,6 +9839,31 @@ static MYSQL_SYSVAR_ULONGLONG(stats_samp "The number of index pages to sample when calculating statistics (default 8)", NULL, NULL, 8, 1, ~0ULL, 0); +const char *innobase_stats_method_names[]= +{ + "nulls_equal", + "nulls_unequal", + "nulls_ignored", + NullS +}; +TYPELIB innobase_stats_method_typelib= +{ + array_elements(innobase_stats_method_names) - 1, "innobase_stats_method_typelib", + innobase_stats_method_names, NULL +}; +static MYSQL_SYSVAR_ENUM(stats_method, srv_stats_method, + PLUGIN_VAR_RQCMDARG, + "Specifies how InnoDB index statistics collection code should threat NULLs. " + "Possible values of name are same to for 'myisam_stats_method'. " + "This is startup parameter.", + NULL, NULL, 0, &innobase_stats_method_typelib); + +static MYSQL_SYSVAR_ULONG(stats_auto_update, srv_stats_auto_update, + PLUGIN_VAR_RQCMDARG, + "Enable/Disable InnoDB's auto update statistics of indexes. " + "(except for ANALYZE TABLE command) 0:disable 1:enable", + NULL, NULL, 1, 0, 1, 0); + static MYSQL_SYSVAR_BOOL(adaptive_hash_index, btr_search_enabled, PLUGIN_VAR_OPCMDARG, "Enable InnoDB adaptive hash index (enabled by default). " @@ -9907,6 +9941,12 @@ static MYSQL_SYSVAR_ULONG(sync_spin_loop "Count of spin-loop rounds in InnoDB mutexes", NULL, NULL, 20L, 0L, ~0L, 0); +static MYSQL_SYSVAR_BOOL(thread_concurrency_timer_based, + innobase_thread_concurrency_timer_based, + PLUGIN_VAR_NOCMDARG | PLUGIN_VAR_READONLY, + "Use InnoDB timer based concurrency throttling. ", + NULL, NULL, FALSE); + static MYSQL_SYSVAR_ULONG(thread_concurrency, srv_thread_concurrency, PLUGIN_VAR_RQCMDARG, "Helps in performance tuning in heavily concurrent environments. Sets the maximum number of threads allowed inside InnoDB. Value 0 will disable the thread throttling.", @@ -9953,7 +9993,7 @@ static MYSQL_SYSVAR_STR(change_buffering static MYSQL_SYSVAR_ULONG(io_capacity, srv_io_capacity, PLUGIN_VAR_RQCMDARG, "Number of IO operations per second the server can do. Tunes background IO rate.", - NULL, NULL, 100, 100, 999999999, 0); + NULL, NULL, 200, 100, 999999999, 0); static MYSQL_SYSVAR_LONGLONG(ibuf_max_size, srv_ibuf_max_size, PLUGIN_VAR_RQCMDARG | PLUGIN_VAR_READONLY, @@ -10008,10 +10048,36 @@ static MYSQL_SYSVAR_ENUM(read_ahead, srv "Control read ahead activity. (none, random, linear, [both])", NULL, innodb_read_ahead_update, 3, &read_ahead_typelib); -static MYSQL_SYSVAR_ULONG(adaptive_checkpoint, srv_adaptive_checkpoint, +static +void +innodb_adaptive_checkpoint_update( + THD* thd, + struct st_mysql_sys_var* var, + void* var_ptr, + const void* save) +{ + *(long *)var_ptr= (*(long *)save) % 3; +} +const char *adaptive_checkpoint_names[]= +{ + "none", /* 0 */ + "reflex", /* 1 */ + "estimate", /* 2 */ + /* For compatibility of the older patch */ + "0", /* 3 ("none" + 3) */ + "1", /* 4 ("reflex" + 3) */ + "2", /* 5 ("estimate" + 3) */ + NullS +}; +TYPELIB adaptive_checkpoint_typelib= +{ + array_elements(adaptive_checkpoint_names) - 1, "adaptive_checkpoint_typelib", + adaptive_checkpoint_names, NULL +}; +static MYSQL_SYSVAR_ENUM(adaptive_checkpoint, srv_adaptive_checkpoint, PLUGIN_VAR_RQCMDARG, - "Enable/Disable flushing along modified age. 0:disable 1:enable", - NULL, NULL, 0, 0, 1, 0); + "Enable/Disable flushing along modified age. ([none], reflex, estimate)", + NULL, innodb_adaptive_checkpoint_update, 0, &adaptive_checkpoint_typelib); static MYSQL_SYSVAR_ULONG(enable_unsafe_group_commit, srv_enable_unsafe_group_commit, PLUGIN_VAR_RQCMDARG, @@ -10021,18 +10087,28 @@ static MYSQL_SYSVAR_ULONG(enable_unsafe_ static MYSQL_SYSVAR_ULONG(read_io_threads, innobase_read_io_threads, PLUGIN_VAR_RQCMDARG | PLUGIN_VAR_READONLY, "Number of background read I/O threads in InnoDB.", - NULL, NULL, 1, 1, 64, 0); + NULL, NULL, 8, 1, 64, 0); static MYSQL_SYSVAR_ULONG(write_io_threads, innobase_write_io_threads, PLUGIN_VAR_RQCMDARG | PLUGIN_VAR_READONLY, "Number of background write I/O threads in InnoDB.", - NULL, NULL, 1, 1, 64, 0); + NULL, NULL, 8, 1, 64, 0); + +static MYSQL_SYSVAR_ULONG(expand_import, srv_expand_import, + PLUGIN_VAR_RQCMDARG, + "Enable/Disable converting automatically *.ibd files when import tablespace.", + NULL, NULL, 0, 0, 1, 0); static MYSQL_SYSVAR_ULONG(extra_rsegments, srv_extra_rsegments, PLUGIN_VAR_RQCMDARG | PLUGIN_VAR_READONLY, "Number of extra user rollback segments when create new database.", NULL, NULL, 0, 0, 127, 0); +static MYSQL_SYSVAR_ULONG(dict_size_limit, srv_dict_size_limit, + PLUGIN_VAR_RQCMDARG, + "Limit the allocated memory for dictionary cache. (0: unlimited)", + NULL, NULL, 0, 0, LONG_MAX, 0); + static struct st_mysql_sys_var* innobase_system_variables[]= { MYSQL_SYSVAR(additional_mem_pool_size), MYSQL_SYSVAR(autoextend_increment), @@ -10069,6 +10145,8 @@ static struct st_mysql_sys_var* innobase MYSQL_SYSVAR(overwrite_relay_log_info), MYSQL_SYSVAR(rollback_on_timeout), MYSQL_SYSVAR(stats_on_metadata), + MYSQL_SYSVAR(stats_method), + MYSQL_SYSVAR(stats_auto_update), MYSQL_SYSVAR(stats_sample_pages), MYSQL_SYSVAR(adaptive_hash_index), MYSQL_SYSVAR(replication_delay), @@ -10078,6 +10156,7 @@ static struct st_mysql_sys_var* innobase MYSQL_SYSVAR(sync_spin_loops), MYSQL_SYSVAR(table_locks), MYSQL_SYSVAR(thread_concurrency), + MYSQL_SYSVAR(thread_concurrency_timer_based), MYSQL_SYSVAR(thread_sleep_delay), MYSQL_SYSVAR(autoinc_lock_mode), MYSQL_SYSVAR(show_verbose_locks), @@ -10093,7 +10172,9 @@ static struct st_mysql_sys_var* innobase MYSQL_SYSVAR(enable_unsafe_group_commit), MYSQL_SYSVAR(read_io_threads), MYSQL_SYSVAR(write_io_threads), + MYSQL_SYSVAR(expand_import), MYSQL_SYSVAR(extra_rsegments), + MYSQL_SYSVAR(dict_size_limit), MYSQL_SYSVAR(use_sys_malloc), MYSQL_SYSVAR(change_buffering), NULL @@ -10287,6 +10368,8 @@ i_s_innodb_cmp, i_s_innodb_cmp_reset, i_s_innodb_cmpmem, i_s_innodb_cmpmem_reset, +i_s_innodb_table_stats, +i_s_innodb_index_stats, i_s_innodb_patches mysql_declare_plugin_end; === modified file 'storage/xtradb/handler/i_s.cc' --- a/storage/xtradb/handler/i_s.cc 2009-05-04 02:45:47 +0000 +++ b/storage/xtradb/handler/i_s.cc 2009-06-25 01:43:25 +0000 @@ -45,6 +45,7 @@ extern "C" { #include "dict0dict.h" /* for dict_index_get_if_in_cache */ #include "trx0rseg.h" /* for trx_rseg_struct */ #include "trx0sys.h" /* for trx_sys */ +#include "dict0dict.h" /* for dict_sys */ /* from buf0buf.c */ struct buf_chunk_struct{ ulint mem_size; /* allocated size of the chunk */ @@ -2282,7 +2283,8 @@ i_s_cmpmem_fill_low( RETURN_IF_INNODB_NOT_STARTED(tables->schema_table_name); - buf_pool_mutex_enter(); + //buf_pool_mutex_enter(); + mutex_enter(&zip_free_mutex); for (uint x = 0; x <= BUF_BUDDY_SIZES; x++) { buf_buddy_stat_t* buddy_stat = &buf_buddy_stat[x]; @@ -2308,7 +2310,8 @@ i_s_cmpmem_fill_low( } } - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + mutex_exit(&zip_free_mutex); DBUG_RETURN(status); } @@ -2653,3 +2656,299 @@ UNIV_INTERN struct st_mysql_plugin i_s_i /* void* */ STRUCT_FLD(__reserved1, NULL) }; + +/*********************************************************************** +*/ +static ST_FIELD_INFO i_s_innodb_table_stats_info[] = +{ + {STRUCT_FLD(field_name, "table_name"), + STRUCT_FLD(field_length, NAME_LEN), + STRUCT_FLD(field_type, MYSQL_TYPE_STRING), + STRUCT_FLD(value, 0), + STRUCT_FLD(field_flags, 0), + STRUCT_FLD(old_name, ""), + STRUCT_FLD(open_method, SKIP_OPEN_TABLE)}, + + {STRUCT_FLD(field_name, "rows"), + STRUCT_FLD(field_length, MY_INT64_NUM_DECIMAL_DIGITS), + STRUCT_FLD(field_type, MYSQL_TYPE_LONGLONG), + STRUCT_FLD(value, 0), + STRUCT_FLD(field_flags, MY_I_S_UNSIGNED), + STRUCT_FLD(old_name, ""), + STRUCT_FLD(open_method, SKIP_OPEN_TABLE)}, + + {STRUCT_FLD(field_name, "clust_size"), + STRUCT_FLD(field_length, MY_INT64_NUM_DECIMAL_DIGITS), + STRUCT_FLD(field_type, MYSQL_TYPE_LONGLONG), + STRUCT_FLD(value, 0), + STRUCT_FLD(field_flags, MY_I_S_UNSIGNED), + STRUCT_FLD(old_name, ""), + STRUCT_FLD(open_method, SKIP_OPEN_TABLE)}, + + {STRUCT_FLD(field_name, "other_size"), + STRUCT_FLD(field_length, MY_INT64_NUM_DECIMAL_DIGITS), + STRUCT_FLD(field_type, MYSQL_TYPE_LONGLONG), + STRUCT_FLD(value, 0), + STRUCT_FLD(field_flags, MY_I_S_UNSIGNED), + STRUCT_FLD(old_name, ""), + STRUCT_FLD(open_method, SKIP_OPEN_TABLE)}, + + {STRUCT_FLD(field_name, "modified"), + STRUCT_FLD(field_length, MY_INT64_NUM_DECIMAL_DIGITS), + STRUCT_FLD(field_type, MYSQL_TYPE_LONGLONG), + STRUCT_FLD(value, 0), + STRUCT_FLD(field_flags, MY_I_S_UNSIGNED), + STRUCT_FLD(old_name, ""), + STRUCT_FLD(open_method, SKIP_OPEN_TABLE)}, + + END_OF_ST_FIELD_INFO +}; + +static ST_FIELD_INFO i_s_innodb_index_stats_info[] = +{ + {STRUCT_FLD(field_name, "table_name"), + STRUCT_FLD(field_length, NAME_LEN), + STRUCT_FLD(field_type, MYSQL_TYPE_STRING), + STRUCT_FLD(value, 0), + STRUCT_FLD(field_flags, 0), + STRUCT_FLD(old_name, ""), + STRUCT_FLD(open_method, SKIP_OPEN_TABLE)}, + + {STRUCT_FLD(field_name, "index_name"), + STRUCT_FLD(field_length, NAME_LEN), + STRUCT_FLD(field_type, MYSQL_TYPE_STRING), + STRUCT_FLD(value, 0), + STRUCT_FLD(field_flags, 0), + STRUCT_FLD(old_name, ""), + STRUCT_FLD(open_method, SKIP_OPEN_TABLE)}, + + {STRUCT_FLD(field_name, "fields"), + STRUCT_FLD(field_length, MY_INT64_NUM_DECIMAL_DIGITS), + STRUCT_FLD(field_type, MYSQL_TYPE_LONGLONG), + STRUCT_FLD(value, 0), + STRUCT_FLD(field_flags, MY_I_S_UNSIGNED), + STRUCT_FLD(old_name, ""), + STRUCT_FLD(open_method, SKIP_OPEN_TABLE)}, + + {STRUCT_FLD(field_name, "row_per_keys"), + STRUCT_FLD(field_length, 256), + STRUCT_FLD(field_type, MYSQL_TYPE_STRING), + STRUCT_FLD(value, 0), + STRUCT_FLD(field_flags, 0), + STRUCT_FLD(old_name, ""), + STRUCT_FLD(open_method, SKIP_OPEN_TABLE)}, + + {STRUCT_FLD(field_name, "index_size"), + STRUCT_FLD(field_length, MY_INT64_NUM_DECIMAL_DIGITS), + STRUCT_FLD(field_type, MYSQL_TYPE_LONGLONG), + STRUCT_FLD(value, 0), + STRUCT_FLD(field_flags, MY_I_S_UNSIGNED), + STRUCT_FLD(old_name, ""), + STRUCT_FLD(open_method, SKIP_OPEN_TABLE)}, + + {STRUCT_FLD(field_name, "leaf_pages"), + STRUCT_FLD(field_length, MY_INT64_NUM_DECIMAL_DIGITS), + STRUCT_FLD(field_type, MYSQL_TYPE_LONGLONG), + STRUCT_FLD(value, 0), + STRUCT_FLD(field_flags, MY_I_S_UNSIGNED), + STRUCT_FLD(old_name, ""), + STRUCT_FLD(open_method, SKIP_OPEN_TABLE)}, + + END_OF_ST_FIELD_INFO +}; + +static +int +i_s_innodb_table_stats_fill( +/*========================*/ + THD* thd, + TABLE_LIST* tables, + COND* cond) +{ + TABLE* i_s_table = (TABLE *) tables->table; + int status = 0; + dict_table_t* table; + + DBUG_ENTER("i_s_innodb_table_stats_fill"); + + /* deny access to non-superusers */ + if (check_global_access(thd, PROCESS_ACL)) { + DBUG_RETURN(0); + } + + mutex_enter(&(dict_sys->mutex)); + + table = UT_LIST_GET_FIRST(dict_sys->table_LRU); + + while (table) { + if (table->stat_clustered_index_size == 0) { + table = UT_LIST_GET_NEXT(table_LRU, table); + continue; + } + + field_store_string(i_s_table->field[0], table->name); + i_s_table->field[1]->store(table->stat_n_rows); + i_s_table->field[2]->store(table->stat_clustered_index_size); + i_s_table->field[3]->store(table->stat_sum_of_other_index_sizes); + i_s_table->field[4]->store(table->stat_modified_counter); + + if (schema_table_store_record(thd, i_s_table)) { + status = 1; + break; + } + + table = UT_LIST_GET_NEXT(table_LRU, table); + } + + mutex_exit(&(dict_sys->mutex)); + + DBUG_RETURN(status); +} + +static +int +i_s_innodb_index_stats_fill( +/*========================*/ + THD* thd, + TABLE_LIST* tables, + COND* cond) +{ + TABLE* i_s_table = (TABLE *) tables->table; + int status = 0; + dict_table_t* table; + dict_index_t* index; + + DBUG_ENTER("i_s_innodb_index_stats_fill"); + + /* deny access to non-superusers */ + if (check_global_access(thd, PROCESS_ACL)) { + DBUG_RETURN(0); + } + + mutex_enter(&(dict_sys->mutex)); + + table = UT_LIST_GET_FIRST(dict_sys->table_LRU); + + while (table) { + if (table->stat_clustered_index_size == 0) { + table = UT_LIST_GET_NEXT(table_LRU, table); + continue; + } + + ib_int64_t n_rows = table->stat_n_rows; + + if (n_rows < 0) { + n_rows = 0; + } + + index = dict_table_get_first_index(table); + + while (index) { + char buff[256+1]; + char row_per_keys[256+1]; + ulint i; + + field_store_string(i_s_table->field[0], table->name); + field_store_string(i_s_table->field[1], index->name); + i_s_table->field[2]->store(index->n_uniq); + + row_per_keys[0] = '\0'; + if (index->stat_n_diff_key_vals) { + for (i = 1; i <= index->n_uniq; i++) { + ib_int64_t rec_per_key; + if (index->stat_n_diff_key_vals[i]) { + rec_per_key = n_rows / index->stat_n_diff_key_vals[i]; + } else { + rec_per_key = n_rows; + } + snprintf(buff, 256, (i == index->n_uniq)?"%llu":"%llu, ", + rec_per_key); + strncat(row_per_keys, buff, 256 - strlen(row_per_keys)); + } + } + field_store_string(i_s_table->field[3], row_per_keys); + + i_s_table->field[4]->store(index->stat_index_size); + i_s_table->field[5]->store(index->stat_n_leaf_pages); + + if (schema_table_store_record(thd, i_s_table)) { + status = 1; + break; + } + + index = dict_table_get_next_index(index); + } + + if (status == 1) { + break; + } + + table = UT_LIST_GET_NEXT(table_LRU, table); + } + + mutex_exit(&(dict_sys->mutex)); + + DBUG_RETURN(status); +} + +static +int +i_s_innodb_table_stats_init( +/*========================*/ + void* p) +{ + DBUG_ENTER("i_s_innodb_table_stats_init"); + ST_SCHEMA_TABLE* schema = (ST_SCHEMA_TABLE*) p; + + schema->fields_info = i_s_innodb_table_stats_info; + schema->fill_table = i_s_innodb_table_stats_fill; + + DBUG_RETURN(0); +} + +static +int +i_s_innodb_index_stats_init( +/*========================*/ + void* p) +{ + DBUG_ENTER("i_s_innodb_index_stats_init"); + ST_SCHEMA_TABLE* schema = (ST_SCHEMA_TABLE*) p; + + schema->fields_info = i_s_innodb_index_stats_info; + schema->fill_table = i_s_innodb_index_stats_fill; + + DBUG_RETURN(0); +} + +UNIV_INTERN struct st_mysql_plugin i_s_innodb_table_stats = +{ + STRUCT_FLD(type, MYSQL_INFORMATION_SCHEMA_PLUGIN), + STRUCT_FLD(info, &i_s_info), + STRUCT_FLD(name, "INNODB_TABLE_STATS"), + STRUCT_FLD(author, plugin_author), + STRUCT_FLD(descr, "InnoDB table statistics in memory"), + STRUCT_FLD(license, PLUGIN_LICENSE_GPL), + STRUCT_FLD(init, i_s_innodb_table_stats_init), + STRUCT_FLD(deinit, i_s_common_deinit), + STRUCT_FLD(version, 0x0100 /* 1.0 */), + STRUCT_FLD(status_vars, NULL), + STRUCT_FLD(system_vars, NULL), + STRUCT_FLD(__reserved1, NULL) +}; + +UNIV_INTERN struct st_mysql_plugin i_s_innodb_index_stats = +{ + STRUCT_FLD(type, MYSQL_INFORMATION_SCHEMA_PLUGIN), + STRUCT_FLD(info, &i_s_info), + STRUCT_FLD(name, "INNODB_INDEX_STATS"), + STRUCT_FLD(author, plugin_author), + STRUCT_FLD(descr, "InnoDB index statistics in memory"), + STRUCT_FLD(license, PLUGIN_LICENSE_GPL), + STRUCT_FLD(init, i_s_innodb_index_stats_init), + STRUCT_FLD(deinit, i_s_common_deinit), + STRUCT_FLD(version, 0x0100 /* 1.0 */), + STRUCT_FLD(status_vars, NULL), + STRUCT_FLD(system_vars, NULL), + STRUCT_FLD(__reserved1, NULL) +}; === modified file 'storage/xtradb/handler/i_s.h' --- a/storage/xtradb/handler/i_s.h 2009-03-26 06:11:11 +0000 +++ b/storage/xtradb/handler/i_s.h 2009-06-25 01:43:25 +0000 @@ -37,5 +37,7 @@ extern struct st_mysql_plugin i_s_innodb extern struct st_mysql_plugin i_s_innodb_cmpmem_reset; extern struct st_mysql_plugin i_s_innodb_patches; extern struct st_mysql_plugin i_s_innodb_rseg; +extern struct st_mysql_plugin i_s_innodb_table_stats; +extern struct st_mysql_plugin i_s_innodb_index_stats; #endif /* i_s_h */ === modified file 'storage/xtradb/handler/innodb_patch_info.h' --- a/storage/xtradb/handler/innodb_patch_info.h 2009-05-04 02:45:47 +0000 +++ b/storage/xtradb/handler/innodb_patch_info.h 2009-07-06 05:47:15 +0000 @@ -31,5 +31,12 @@ struct innodb_enhancement { {"innodb_expand_undo_slots","expandable maximum number of undo slots","from 1024 (default) to about 4000","http://www.percona.com/docs/wiki/percona-xtradb"}, {"innodb_extra_rseg","allow to create extra rollback segments","When create new db, the new parameter allows to create more rollback segments","http://www.percona.com/docs/wiki/percona-xtradb"}, {"innodb_overwrite_relay_log_info","overwrite relay-log.info when slave recovery","Building as plugin, it is not used.","http://www.percona.com/docs/wiki/percona-xtradb:innodb_overwrite_relay_log_…"}, +{"innodb_pause_in_spin","use 'pause' instruction during spin loop for x86 (gcc)","","http://www.percona.com/docs/wiki/percona-xtradb"}, +{"innodb_thread_concurrency_timer_based","use InnoDB timer based concurrency throttling (backport from MySQL 5.4.0)","",""}, +{"innodb_expand_import","convert .ibd file automatically when import tablespace","the files are generated by xtrabackup export mode.","http://www.percona.com/docs/wiki/percona-xtradb"}, +{"innodb_dict_size_limit","Limit dictionary cache size","Variable innodb_dict_size_limit in bytes","http://www.percona.com/docs/wiki/percona-xtradb"}, +{"innodb_split_buf_pool_mutex","More fix of buffer_pool mutex","Spliting buf_pool_mutex and optimizing based on innodb_opt_lru_count","http://www.percona.com/docs/wiki/percona-xtradb"}, +{"innodb_stats","Additional features about InnoDB statistics/optimizer","","http://www.percona.com/docs/wiki/percona-xtradb"}, +{"innodb_recovery_patches","Bugfixes and adjustments about recovery process","","http://www.percona.com/docs/wiki/percona-xtradb"}, {NULL, NULL, NULL, NULL} }; === modified file 'storage/xtradb/ibuf/ibuf0ibuf.c' --- a/storage/xtradb/ibuf/ibuf0ibuf.c 2009-06-22 08:06:35 +0000 +++ b/storage/xtradb/ibuf/ibuf0ibuf.c 2009-08-03 20:09:53 +0000 @@ -472,6 +472,7 @@ ibuf_init_at_db_start(void) /* Use old-style record format for the insert buffer. */ table = dict_mem_table_create(IBUF_TABLE_NAME, IBUF_SPACE_ID, 1, 0); + table->n_mysql_handles_opened = 1; /* for pin */ dict_mem_table_add_col(table, heap, "DUMMY_COLUMN", DATA_BINARY, 0, 0); === modified file 'storage/xtradb/include/buf0buddy.h' --- a/storage/xtradb/include/buf0buddy.h 2009-05-04 02:45:47 +0000 +++ b/storage/xtradb/include/buf0buddy.h 2009-06-25 01:43:25 +0000 @@ -49,10 +49,11 @@ buf_buddy_alloc( /* out: allocated block, possibly NULL if lru == NULL */ ulint size, /* in: block size, up to UNIV_PAGE_SIZE */ - ibool* lru) /* in: pointer to a variable that will be assigned + ibool* lru, /* in: pointer to a variable that will be assigned TRUE if storage was allocated from the LRU list and buf_pool_mutex was temporarily released, or NULL if the LRU list should not be used */ + ibool have_page_hash_mutex) __attribute__((malloc)); /************************************************************************** @@ -63,7 +64,8 @@ buf_buddy_free( /*===========*/ void* buf, /* in: block to be freed, must not be pointed to by the buffer pool */ - ulint size) /* in: block size, up to UNIV_PAGE_SIZE */ + ulint size, /* in: block size, up to UNIV_PAGE_SIZE */ + ibool have_page_hash_mutex) __attribute__((nonnull)); /** Statistics of buddy blocks of a given size. */ === modified file 'storage/xtradb/include/buf0buddy.ic' --- a/storage/xtradb/include/buf0buddy.ic 2009-05-04 02:45:47 +0000 +++ b/storage/xtradb/include/buf0buddy.ic 2009-06-25 01:43:25 +0000 @@ -44,10 +44,11 @@ buf_buddy_alloc_low( possibly NULL if lru==NULL */ ulint i, /* in: index of buf_pool->zip_free[], or BUF_BUDDY_SIZES */ - ibool* lru) /* in: pointer to a variable that will be assigned + ibool* lru, /* in: pointer to a variable that will be assigned TRUE if storage was allocated from the LRU list and buf_pool_mutex was temporarily released, or NULL if the LRU list should not be used */ + ibool have_page_hash_mutex) __attribute__((malloc)); /************************************************************************** @@ -58,8 +59,9 @@ buf_buddy_free_low( /*===============*/ void* buf, /* in: block to be freed, must not be pointed to by the buffer pool */ - ulint i) /* in: index of buf_pool->zip_free[], + ulint i, /* in: index of buf_pool->zip_free[], or BUF_BUDDY_SIZES */ + ibool have_page_hash_mutex) __attribute__((nonnull)); /************************************************************************** @@ -98,14 +100,15 @@ buf_buddy_alloc( /* out: allocated block, possibly NULL if lru == NULL */ ulint size, /* in: block size, up to UNIV_PAGE_SIZE */ - ibool* lru) /* in: pointer to a variable that will be assigned + ibool* lru, /* in: pointer to a variable that will be assigned TRUE if storage was allocated from the LRU list and buf_pool_mutex was temporarily released, or NULL if the LRU list should not be used */ + ibool have_page_hash_mutex) { - ut_ad(buf_pool_mutex_own()); + //ut_ad(buf_pool_mutex_own()); - return(buf_buddy_alloc_low(buf_buddy_get_slot(size), lru)); + return(buf_buddy_alloc_low(buf_buddy_get_slot(size), lru, have_page_hash_mutex)); } /************************************************************************** @@ -116,11 +119,24 @@ buf_buddy_free( /*===========*/ void* buf, /* in: block to be freed, must not be pointed to by the buffer pool */ - ulint size) /* in: block size, up to UNIV_PAGE_SIZE */ + ulint size, /* in: block size, up to UNIV_PAGE_SIZE */ + ibool have_page_hash_mutex) { - ut_ad(buf_pool_mutex_own()); + //ut_ad(buf_pool_mutex_own()); - buf_buddy_free_low(buf, buf_buddy_get_slot(size)); + if (!have_page_hash_mutex) { + mutex_enter(&LRU_list_mutex); + rw_lock_x_lock(&page_hash_latch); + } + + mutex_enter(&zip_free_mutex); + buf_buddy_free_low(buf, buf_buddy_get_slot(size), TRUE); + mutex_exit(&zip_free_mutex); + + if (!have_page_hash_mutex) { + mutex_exit(&LRU_list_mutex); + rw_lock_x_unlock(&page_hash_latch); + } } #ifdef UNIV_MATERIALIZE === modified file 'storage/xtradb/include/buf0buf.h' --- a/storage/xtradb/include/buf0buf.h 2009-05-04 04:32:30 +0000 +++ b/storage/xtradb/include/buf0buf.h 2009-06-25 01:43:25 +0000 @@ -1024,7 +1024,7 @@ struct buf_page_struct{ /* 2. Page flushing fields; protected by buf_pool_mutex */ - UT_LIST_NODE_T(buf_page_t) list; + /* UT_LIST_NODE_T(buf_page_t) list; */ /* based on state, this is a list node in one of the following lists in buf_pool: @@ -1034,6 +1034,10 @@ struct buf_page_struct{ BUF_BLOCK_ZIP_DIRTY: flush_list BUF_BLOCK_ZIP_PAGE: zip_clean BUF_BLOCK_ZIP_FREE: zip_free[] */ + /* resplit for optimistic use */ + UT_LIST_NODE_T(buf_page_t) free; + UT_LIST_NODE_T(buf_page_t) flush_list; + UT_LIST_NODE_T(buf_page_t) zip_list; /* zip_clean or zip_free[] */ #ifdef UNIV_DEBUG ibool in_flush_list; /* TRUE if in buf_pool->flush_list; when buf_pool_mutex is free, the @@ -1104,11 +1108,11 @@ struct buf_block_struct{ a block is in the unzip_LRU list if page.state == BUF_BLOCK_FILE_PAGE and page.zip.data != NULL */ -#ifdef UNIV_DEBUG +//#ifdef UNIV_DEBUG ibool in_unzip_LRU_list;/* TRUE if the page is in the decompressed LRU list; used in debugging */ -#endif /* UNIV_DEBUG */ +//#endif /* UNIV_DEBUG */ byte* frame; /* pointer to buffer frame which is of size UNIV_PAGE_SIZE, and aligned to an address divisible by @@ -1316,6 +1320,12 @@ struct buf_pool_struct{ /* mutex protecting the buffer pool struct and control blocks, except the read-write lock in them */ extern mutex_t buf_pool_mutex; +extern mutex_t LRU_list_mutex; +extern mutex_t flush_list_mutex; +extern rw_lock_t page_hash_latch; +extern mutex_t free_list_mutex; +extern mutex_t zip_free_mutex; +extern mutex_t zip_hash_mutex; /* mutex protecting the control blocks of compressed-only pages (of type buf_page_t, not buf_block_t) */ extern mutex_t buf_pool_zip_mutex; === modified file 'storage/xtradb/include/buf0buf.ic' --- a/storage/xtradb/include/buf0buf.ic 2009-05-04 02:45:47 +0000 +++ b/storage/xtradb/include/buf0buf.ic 2009-06-25 01:43:25 +0000 @@ -100,7 +100,9 @@ buf_pool_get_oldest_modification(void) buf_page_t* bpage; ib_uint64_t lsn; - buf_pool_mutex_enter(); +try_again: + //buf_pool_mutex_enter(); + mutex_enter(&flush_list_mutex); bpage = UT_LIST_GET_LAST(buf_pool->flush_list); @@ -109,9 +111,14 @@ buf_pool_get_oldest_modification(void) } else { ut_ad(bpage->in_flush_list); lsn = bpage->oldest_modification; + if (lsn == 0) { + mutex_exit(&flush_list_mutex); + goto try_again; + } } - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + mutex_exit(&flush_list_mutex); /* The returned answer may be out of date: the flush_list can change after the mutex has been released. */ @@ -128,7 +135,8 @@ buf_pool_clock_tic(void) /*====================*/ /* out: new clock value */ { - ut_ad(buf_pool_mutex_own()); + //ut_ad(buf_pool_mutex_own()); + ut_ad(mutex_own(&LRU_list_mutex)); buf_pool->ulint_clock++; @@ -246,7 +254,7 @@ buf_page_in_file( case BUF_BLOCK_ZIP_FREE: /* This is a free page in buf_pool->zip_free[]. Such pages should only be accessed by the buddy allocator. */ - ut_error; + /* ut_error; */ /* optimistic */ break; case BUF_BLOCK_ZIP_PAGE: case BUF_BLOCK_ZIP_DIRTY: @@ -288,7 +296,7 @@ buf_page_get_LRU_position( const buf_page_t* bpage) /* in: control block */ { ut_ad(buf_page_in_file(bpage)); - ut_ad(buf_pool_mutex_own()); + //ut_ad(buf_pool_mutex_own()); /* This is used in optimistic */ return(bpage->LRU_position); } @@ -305,7 +313,7 @@ buf_page_get_mutex( { switch (buf_page_get_state(bpage)) { case BUF_BLOCK_ZIP_FREE: - ut_error; + /* ut_error; */ /* optimistic */ return(NULL); case BUF_BLOCK_ZIP_PAGE: case BUF_BLOCK_ZIP_DIRTY: @@ -410,7 +418,7 @@ buf_page_set_io_fix( buf_page_t* bpage, /* in/out: control block */ enum buf_io_fix io_fix) /* in: io_fix state */ { - ut_ad(buf_pool_mutex_own()); + //ut_ad(buf_pool_mutex_own()); ut_ad(mutex_own(buf_page_get_mutex(bpage))); bpage->io_fix = io_fix; @@ -438,12 +446,13 @@ buf_page_can_relocate( /*==================*/ const buf_page_t* bpage) /* control block being relocated */ { - ut_ad(buf_pool_mutex_own()); + //ut_ad(buf_pool_mutex_own()); ut_ad(mutex_own(buf_page_get_mutex(bpage))); ut_ad(buf_page_in_file(bpage)); - ut_ad(bpage->in_LRU_list); + /* optimistic */ + //ut_ad(bpage->in_LRU_list); - return(buf_page_get_io_fix(bpage) == BUF_IO_NONE + return(bpage->in_LRU_list && bpage->io_fix == BUF_IO_NONE && bpage->buf_fix_count == 0); } @@ -457,7 +466,7 @@ buf_page_is_old( const buf_page_t* bpage) /* in: control block */ { ut_ad(buf_page_in_file(bpage)); - ut_ad(buf_pool_mutex_own()); + //ut_ad(buf_pool_mutex_own()); /* This is used in optimistic */ return(bpage->old); } @@ -472,7 +481,8 @@ buf_page_set_old( ibool old) /* in: old */ { ut_a(buf_page_in_file(bpage)); - ut_ad(buf_pool_mutex_own()); + //ut_ad(buf_pool_mutex_own()); + ut_ad(mutex_own(&LRU_list_mutex)); ut_ad(bpage->in_LRU_list); #ifdef UNIV_LRU_DEBUG @@ -728,17 +738,17 @@ buf_block_free( /*===========*/ buf_block_t* block) /* in, own: block to be freed */ { - buf_pool_mutex_enter(); + //buf_pool_mutex_enter(); mutex_enter(&block->mutex); ut_a(buf_block_get_state(block) != BUF_BLOCK_FILE_PAGE); - buf_LRU_block_free_non_file_page(block); + buf_LRU_block_free_non_file_page(block, FALSE); mutex_exit(&block->mutex); - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); } /************************************************************************* @@ -783,14 +793,23 @@ buf_page_io_query( buf_page_t* bpage) /* in: buf_pool block, must be bufferfixed */ { ibool io_fixed; + mutex_t* block_mutex = buf_page_get_mutex(bpage); - buf_pool_mutex_enter(); + //buf_pool_mutex_enter(); +retry_lock: + mutex_enter(block_mutex); + if (block_mutex != buf_page_get_mutex(bpage)) { + mutex_exit(block_mutex); + block_mutex = buf_page_get_mutex(bpage); + goto retry_lock; + } ut_ad(buf_page_in_file(bpage)); ut_ad(bpage->buf_fix_count > 0); io_fixed = buf_page_get_io_fix(bpage) != BUF_IO_NONE; - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + mutex_exit(block_mutex); return(io_fixed); } @@ -809,7 +828,13 @@ buf_page_get_newest_modification( ib_uint64_t lsn; mutex_t* block_mutex = buf_page_get_mutex(bpage); +retry_lock: mutex_enter(block_mutex); + if (block_mutex != buf_page_get_mutex(bpage)) { + mutex_exit(block_mutex); + block_mutex = buf_page_get_mutex(bpage); + goto retry_lock; + } if (buf_page_in_file(bpage)) { lsn = bpage->newest_modification; @@ -833,7 +858,7 @@ buf_block_modify_clock_inc( buf_block_t* block) /* in: block */ { #ifdef UNIV_SYNC_DEBUG - ut_ad((buf_pool_mutex_own() + ut_ad((mutex_own(&LRU_list_mutex) && (block->page.buf_fix_count == 0)) || rw_lock_own(&(block->lock), RW_LOCK_EXCLUSIVE)); #endif /* UNIV_SYNC_DEBUG */ @@ -917,7 +942,11 @@ buf_page_hash_get( ulint fold; ut_ad(buf_pool); - ut_ad(buf_pool_mutex_own()); + //ut_ad(buf_pool_mutex_own()); +#ifdef UNIV_SYNC_DEBUG + ut_ad(rw_lock_own(&page_hash_latch, RW_LOCK_EX) + || rw_lock_own(&page_hash_latch, RW_LOCK_SHARED)); +#endif /* Look for the page in the hash table */ @@ -966,11 +995,13 @@ buf_page_peek( { const buf_page_t* bpage; - buf_pool_mutex_enter(); + //buf_pool_mutex_enter(); + rw_lock_s_lock(&page_hash_latch); bpage = buf_page_hash_get(space, offset); - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + rw_lock_s_unlock(&page_hash_latch); return(bpage != NULL); } @@ -1032,11 +1063,14 @@ buf_page_release( ut_a(buf_block_get_state(block) == BUF_BLOCK_FILE_PAGE); ut_a(block->page.buf_fix_count > 0); + /* buf_flush_note_modification() should be called before this function. */ +/* if (rw_latch == RW_X_LATCH && mtr->modifications) { buf_pool_mutex_enter(); buf_flush_note_modification(block, mtr); buf_pool_mutex_exit(); } +*/ mutex_enter(&block->mutex); === modified file 'storage/xtradb/include/buf0flu.ic' --- a/storage/xtradb/include/buf0flu.ic 2009-05-04 02:45:47 +0000 +++ b/storage/xtradb/include/buf0flu.ic 2009-06-25 01:43:25 +0000 @@ -53,13 +53,23 @@ buf_flush_note_modification( buf_block_t* block, /* in: block which is modified */ mtr_t* mtr) /* in: mtr */ { + ibool use_LRU_mutex = FALSE; + + if (UT_LIST_GET_LEN(buf_pool->unzip_LRU)) + use_LRU_mutex = TRUE; + + if (use_LRU_mutex) + mutex_enter(&LRU_list_mutex); + + mutex_enter(&block->mutex); + ut_ad(block); ut_ad(buf_block_get_state(block) == BUF_BLOCK_FILE_PAGE); ut_ad(block->page.buf_fix_count > 0); #ifdef UNIV_SYNC_DEBUG ut_ad(rw_lock_own(&(block->lock), RW_LOCK_EX)); #endif /* UNIV_SYNC_DEBUG */ - ut_ad(buf_pool_mutex_own()); + //ut_ad(buf_pool_mutex_own()); ut_ad(mtr->start_lsn != 0); ut_ad(mtr->modifications); @@ -68,16 +78,23 @@ buf_flush_note_modification( block->page.newest_modification = mtr->end_lsn; if (!block->page.oldest_modification) { + mutex_enter(&flush_list_mutex); block->page.oldest_modification = mtr->start_lsn; ut_ad(block->page.oldest_modification != 0); buf_flush_insert_into_flush_list(block); + mutex_exit(&flush_list_mutex); } else { ut_ad(block->page.oldest_modification <= mtr->start_lsn); } + mutex_exit(&block->mutex); + ++srv_buf_pool_write_requests; + + if (use_LRU_mutex) + mutex_exit(&LRU_list_mutex); } /************************************************************************ @@ -92,6 +109,16 @@ buf_flush_recv_note_modification( ib_uint64_t end_lsn) /* in: end lsn of the last mtr in the set of mtr's */ { + ibool use_LRU_mutex = FALSE; + + if(UT_LIST_GET_LEN(buf_pool->unzip_LRU)) + use_LRU_mutex = TRUE; + + if (use_LRU_mutex) + mutex_enter(&LRU_list_mutex); + + mutex_enter(&(block->mutex)); + ut_ad(block); ut_ad(buf_block_get_state(block) == BUF_BLOCK_FILE_PAGE); ut_ad(block->page.buf_fix_count > 0); @@ -99,22 +126,27 @@ buf_flush_recv_note_modification( ut_ad(rw_lock_own(&(block->lock), RW_LOCK_EX)); #endif /* UNIV_SYNC_DEBUG */ - buf_pool_mutex_enter(); + //buf_pool_mutex_enter(); ut_ad(block->page.newest_modification <= end_lsn); block->page.newest_modification = end_lsn; if (!block->page.oldest_modification) { + mutex_enter(&flush_list_mutex); block->page.oldest_modification = start_lsn; ut_ad(block->page.oldest_modification != 0); buf_flush_insert_sorted_into_flush_list(block); + mutex_exit(&flush_list_mutex); } else { ut_ad(block->page.oldest_modification <= start_lsn); } - buf_pool_mutex_exit(); + //buf_pool_mutex_exit(); + if (use_LRU_mutex) + mutex_exit(&LRU_list_mutex); + mutex_exit(&(block->mutex)); } === modified file 'storage/xtradb/include/buf0lru.h' --- a/storage/xtradb/include/buf0lru.h 2009-05-04 02:45:47 +0000 +++ b/storage/xtradb/include/buf0lru.h 2009-06-25 01:43:25 +0000 @@ -122,10 +122,11 @@ buf_LRU_free_block( buf_page_t* bpage, /* in: block to be freed */ ibool zip, /* in: TRUE if should remove also the compressed page of an uncompressed page */ - ibool* buf_pool_mutex_released); + ibool* buf_pool_mutex_released, /* in: pointer to a variable that will be assigned TRUE if buf_pool_mutex was temporarily released, or NULL */ + ibool have_LRU_mutex); /********************************************************************** Try to free a replaceable block. */ UNIV_INTERN @@ -169,7 +170,8 @@ UNIV_INTERN void buf_LRU_block_free_non_file_page( /*=============================*/ - buf_block_t* block); /* in: block, must not contain a file page */ + buf_block_t* block, /* in: block, must not contain a file page */ + ibool have_page_hash_mutex); /********************************************************************** Adds a block to the LRU list. */ UNIV_INTERN === modified file 'storage/xtradb/include/dict0dict.h' --- a/storage/xtradb/include/dict0dict.h 2009-03-26 06:11:11 +0000 +++ b/storage/xtradb/include/dict0dict.h 2009-06-25 01:43:25 +0000 @@ -1102,6 +1102,12 @@ dict_table_get_index_on_name_and_min_id( /* out: index, NULL if does not exist */ dict_table_t* table, /* in: table */ const char* name); /* in: name of the index to find */ + +UNIV_INTERN +void +dict_table_LRU_trim( +/*================*/ + dict_table_t* self); /* Buffers for storing detailed information about the latest foreign key and unique key errors */ extern FILE* dict_foreign_err_file; === modified file 'storage/xtradb/include/dict0dict.ic' --- a/storage/xtradb/include/dict0dict.ic 2009-03-26 06:11:11 +0000 +++ b/storage/xtradb/include/dict0dict.ic 2009-06-25 01:43:25 +0000 @@ -723,6 +723,13 @@ dict_table_check_if_in_cache_low( HASH_SEARCH(name_hash, dict_sys->table_hash, table_fold, dict_table_t*, table, ut_ad(table->cached), !strcmp(table->name, table_name)); + + /* make young in table_LRU */ + if (table) { + UT_LIST_REMOVE(table_LRU, dict_sys->table_LRU, table); + UT_LIST_ADD_FIRST(table_LRU, dict_sys->table_LRU, table); + } + return(table); } @@ -776,6 +783,12 @@ dict_table_get_on_id_low( table = dict_load_table_on_id(table_id); } + /* make young in table_LRU */ + if (table) { + UT_LIST_REMOVE(table_LRU, dict_sys->table_LRU, table); + UT_LIST_ADD_FIRST(table_LRU, dict_sys->table_LRU, table); + } + ut_ad(!table || table->cached); /* TODO: should get the type information from MySQL */ === modified file 'storage/xtradb/include/log0log.h' --- a/storage/xtradb/include/log0log.h 2009-03-26 06:11:11 +0000 +++ b/storage/xtradb/include/log0log.h 2009-06-25 01:43:25 +0000 @@ -186,6 +186,13 @@ void log_buffer_flush_to_disk(void); /*==========================*/ /******************************************************************** +Flushes the log buffer. Forces it to disk depending on the value of +the configuration parameter innodb_flush_log_at_trx_commit. */ +UNIV_INTERN +void +log_buffer_flush_maybe_sync(void); +/*=============================*/ +/******************************************************************** Advances the smallest lsn for which there are unflushed dirty blocks in the buffer pool and also may make a new checkpoint. NOTE: this function may only be called if the calling thread owns no synchronization objects! */ === modified file 'storage/xtradb/include/rem0cmp.h' --- a/storage/xtradb/include/rem0cmp.h 2009-03-26 06:11:11 +0000 +++ b/storage/xtradb/include/rem0cmp.h 2009-06-25 01:43:25 +0000 @@ -177,10 +177,11 @@ cmp_rec_rec_with_match( matched fields; when the function returns, contains the value the for current comparison */ - ulint* matched_bytes);/* in/out: number of already matched + ulint* matched_bytes, /* in/out: number of already matched bytes within the first field not completely matched; when the function returns, contains the value for the current comparison */ + ulint stats_method); /***************************************************************** This function is used to compare two physical records. Only the common first fields are compared. */ === modified file 'storage/xtradb/include/rem0cmp.ic' --- a/storage/xtradb/include/rem0cmp.ic 2009-03-26 06:11:11 +0000 +++ b/storage/xtradb/include/rem0cmp.ic 2009-06-25 01:43:25 +0000 @@ -88,5 +88,5 @@ cmp_rec_rec( ulint match_b = 0; return(cmp_rec_rec_with_match(rec1, rec2, offsets1, offsets2, index, - &match_f, &match_b)); + &match_f, &match_b, 0)); } === modified file 'storage/xtradb/include/srv0srv.h' --- a/storage/xtradb/include/srv0srv.h 2009-03-26 06:11:11 +0000 +++ b/storage/xtradb/include/srv0srv.h 2009-06-25 01:43:25 +0000 @@ -127,6 +127,8 @@ extern ulint srv_buf_pool_curr_size; /* extern ulint srv_mem_pool_size; extern ulint srv_lock_table_size; +extern ibool srv_thread_concurrency_timer_based; + extern ulint srv_n_file_io_threads; extern ulint srv_n_read_io_threads; extern ulint srv_n_write_io_threads; @@ -163,6 +165,11 @@ extern ulint srv_fast_shutdown; /* If t extern ibool srv_innodb_status; extern unsigned long long srv_stats_sample_pages; +extern ulint srv_stats_method; +#define SRV_STATS_METHOD_NULLS_EQUAL 0 +#define SRV_STATS_METHOD_NULLS_NOT_EQUAL 1 +#define SRV_STATS_METHOD_IGNORE_NULLS 2 +extern ulint srv_stats_auto_update; extern ibool srv_use_doublewrite_buf; extern ibool srv_use_checksums; @@ -184,8 +191,10 @@ extern ulint srv_enable_unsafe_group_com extern ulint srv_read_ahead; extern ulint srv_adaptive_checkpoint; -extern ulint srv_extra_rsegments; +extern ulint srv_expand_import; +extern ulint srv_extra_rsegments; +extern ulint srv_dict_size_limit; /*-------------------------------------------*/ extern ulint srv_n_rows_inserted; @@ -552,6 +561,7 @@ struct export_var_struct{ ulint innodb_data_writes; ulint innodb_data_written; ulint innodb_data_reads; + ulint innodb_dict_tables; ulint innodb_buffer_pool_pages_total; ulint innodb_buffer_pool_pages_data; ulint innodb_buffer_pool_pages_dirty; === modified file 'storage/xtradb/include/sync0sync.h' --- a/storage/xtradb/include/sync0sync.h 2009-03-26 06:11:11 +0000 +++ b/storage/xtradb/include/sync0sync.h 2009-06-25 01:43:25 +0000 @@ -464,8 +464,14 @@ or row lock! */ SYNC_SEARCH_SYS, as memory allocation can call routines there! Otherwise the level is SYNC_MEM_HASH. */ +#define SYNC_BUF_LRU_LIST 157 +#define SYNC_BUF_PAGE_HASH 156 +#define SYNC_BUF_BLOCK 155 +#define SYNC_BUF_FREE_LIST 153 +#define SYNC_BUF_ZIP_FREE 152 +#define SYNC_BUF_ZIP_HASH 151 #define SYNC_BUF_POOL 150 -#define SYNC_BUF_BLOCK 149 +#define SYNC_BUF_FLUSH_LIST 149 #define SYNC_DOUBLEWRITE 140 #define SYNC_ANY_LATCH 135 #define SYNC_THR_LOCAL 133 === modified file 'storage/xtradb/include/univ.i' --- a/storage/xtradb/include/univ.i 2009-06-18 12:39:21 +0000 +++ b/storage/xtradb/include/univ.i 2009-08-03 20:09:53 +0000 @@ -35,7 +35,7 @@ Created 1/20/1994 Heikki Tuuri #define INNODB_VERSION_MAJOR 1 #define INNODB_VERSION_MINOR 0 #define INNODB_VERSION_BUGFIX 3 -#define PERCONA_INNODB_VERSION 5a +#define PERCONA_INNODB_VERSION 6a /* The following is the InnoDB version as shown in SELECT plugin_version FROM information_schema.plugins; === modified file 'storage/xtradb/include/ut0auxconf.h' --- a/storage/xtradb/include/ut0auxconf.h 2009-04-27 04:54:14 +0000 +++ b/storage/xtradb/include/ut0auxconf.h 2009-06-25 01:43:25 +0000 @@ -12,3 +12,8 @@ If by any chance Makefile.in and ./confi the hack from Makefile.in wiped away then the "real" check from plug.in will take over. */ +/* This is temprary fix for http://bugs.mysql.com/43740 */ +/* force to enable */ +#ifdef HAVE_GCC_ATOMIC_BUILTINS +#define HAVE_ATOMIC_PTHREAD_T +#endif === modified file 'storage/xtradb/log/log0log.c' --- a/storage/xtradb/log/log0log.c 2009-03-26 06:11:11 +0000 +++ b/storage/xtradb/log/log0log.c 2009-06-25 01:43:25 +0000 @@ -1526,6 +1526,26 @@ log_buffer_flush_to_disk(void) } /******************************************************************** +Flush the log buffer. Force it to disk depending on the value of +innodb_flush_log_at_trx_commit. */ +UNIV_INTERN +void +log_buffer_flush_maybe_sync(void) +/*=============================*/ +{ + ib_uint64_t lsn; + + mutex_enter(&(log_sys->mutex)); + + lsn = log_sys->lsn; + + mutex_exit(&(log_sys->mutex)); + + /* Force log buffer to disk when innodb_flush_log_at_trx_commit = 1. */ + log_write_up_to(lsn, LOG_WAIT_ALL_GROUPS, + srv_flush_log_at_trx_commit == 1 ? TRUE : FALSE); +} +/******************************************************************** Tries to establish a big enough margin of free space in the log buffer, such that a new log entry can be catenated without an immediate need for a flush. */ static === modified file 'storage/xtradb/log/log0recv.c' --- a/storage/xtradb/log/log0recv.c 2009-03-26 06:11:11 +0000 +++ b/storage/xtradb/log/log0recv.c 2009-07-06 05:47:15 +0000 @@ -110,7 +110,7 @@ the log and store the scanned log record use these free frames to read in pages when we start applying the log records to the database. */ -UNIV_INTERN ulint recv_n_pool_free_frames = 256; +UNIV_INTERN ulint recv_n_pool_free_frames = 1024; /* The maximum lsn we see for a page during the recovery process. If this is bigger than the lsn we are able to scan up to, that is an indication that @@ -1225,6 +1225,8 @@ recv_recover_page( buf_block_get_page_no(block)); if ((recv_addr == NULL) + /* bugfix: http://bugs.mysql.com/bug.php?id=44140 */ + || (recv_addr->state == RECV_BEING_READ && !just_read_in) || (recv_addr->state == RECV_BEING_PROCESSED) || (recv_addr->state == RECV_PROCESSED)) { === modified file 'storage/xtradb/mtr/mtr0mtr.c' --- a/storage/xtradb/mtr/mtr0mtr.c 2009-03-26 06:11:11 +0000 +++ b/storage/xtradb/mtr/mtr0mtr.c 2009-06-25 01:43:25 +0000 @@ -102,6 +102,38 @@ mtr_memo_pop_all( } } +UNIV_INLINE +void +mtr_memo_note_modification_all( +/*===========================*/ + mtr_t* mtr) /* in: mtr */ +{ + mtr_memo_slot_t* slot; + dyn_array_t* memo; + ulint offset; + + ut_ad(mtr); + ut_ad(mtr->magic_n == MTR_MAGIC_N); + ut_ad(mtr->state == MTR_COMMITTING); /* Currently only used in + commit */ + ut_ad(mtr->modifications); + + memo = &(mtr->memo); + + offset = dyn_array_get_data_size(memo); + + while (offset > 0) { + offset -= sizeof(mtr_memo_slot_t); + slot = dyn_array_get_element(memo, offset); + + if (UNIV_LIKELY(slot->object != NULL) && + slot->type == MTR_MEMO_PAGE_X_FIX) { + buf_flush_note_modification( + (buf_block_t*)slot->object, mtr); + } + } +} + /**************************************************************** Writes the contents of a mini-transaction log, if any, to the database log. */ static @@ -180,6 +212,8 @@ mtr_commit( if (write_log) { mtr_log_reserve_and_write(mtr); + + mtr_memo_note_modification_all(mtr); } /* We first update the modification info to buffer pages, and only @@ -190,12 +224,13 @@ mtr_commit( required when we insert modified buffer pages in to the flush list which must be sorted on oldest_modification. */ - mtr_memo_pop_all(mtr); - if (write_log) { log_release(); } + /* All unlocking has been moved here, after log_sys mutex release. */ + mtr_memo_pop_all(mtr); + ut_d(mtr->state = MTR_COMMITTED); dyn_array_free(&(mtr->memo)); dyn_array_free(&(mtr->log)); @@ -263,6 +298,12 @@ mtr_memo_release( slot = dyn_array_get_element(memo, offset); if ((object == slot->object) && (type == slot->type)) { + if (mtr->modifications && + UNIV_LIKELY(slot->object != NULL) && + slot->type == MTR_MEMO_PAGE_X_FIX) { + buf_flush_note_modification( + (buf_block_t*)slot->object, mtr); + } mtr_memo_slot_release(mtr, slot); === modified file 'storage/xtradb/os/os0file.c' --- a/storage/xtradb/os/os0file.c 2009-03-26 06:11:11 +0000 +++ b/storage/xtradb/os/os0file.c 2009-06-25 01:43:25 +0000 @@ -73,6 +73,28 @@ UNIV_INTERN ibool os_aio_use_native_aio UNIV_INTERN ibool os_aio_print_debug = FALSE; +/* State for the state of an IO request in simulated AIO. + Protocol for simulated aio: + client requests IO: find slot with reserved = FALSE. Add entry with + status = OS_AIO_NOT_ISSUED. + IO thread wakes: find adjacent slots with reserved = TRUE and status = + OS_AIO_NOT_ISSUED. Change status for slots to + OS_AIO_ISSUED. + IO operation completes: set status for slots to OS_AIO_DONE. set status + for the first slot to OS_AIO_CLAIMED and return + result for that slot. + When there are multiple read and write threads, they all compete to execute + the requests in the array (os_aio_array_t). This avoids the need to load + balance requests at the time the request is made at the cost of waking all + threads when a request is available. +*/ +typedef enum { + OS_AIO_NOT_ISSUED, /* Available to be processed by an IO thread. */ + OS_AIO_ISSUED, /* Being processed by an IO thread. */ + OS_AIO_DONE, /* Request processed. */ + OS_AIO_CLAIMED /* Result being returned to client. */ +} os_aio_status; + /* The aio array slot structure */ typedef struct os_aio_slot_struct os_aio_slot_t; @@ -81,6 +103,8 @@ struct os_aio_slot_struct{ ulint pos; /* index of the slot in the aio array */ ibool reserved; /* TRUE if this slot is reserved */ + os_aio_status status; /* Status for current request. Valid when reserved + is TRUE. Used only in simulated aio. */ time_t reservation_time;/* time when reserved */ ulint len; /* length of the block to read or write */ @@ -91,11 +115,11 @@ struct os_aio_slot_struct{ ulint offset_high; /* 32 high bits of file offset */ os_file_t file; /* file where to read or write */ const char* name; /* file name or path */ - ibool io_already_done;/* used only in simulated aio: - TRUE if the physical i/o already - made and only the slot message - needs to be passed to the caller - of os_aio_simulated_handle */ +// ibool io_already_done;/* used only in simulated aio: +// TRUE if the physical i/o already +// made and only the slot message +// needs to be passed to the caller +// of os_aio_simulated_handle */ fil_node_t* message1; /* message which is given by the */ void* message2; /* the requester of an aio operation and which can be used to identify @@ -141,6 +165,13 @@ struct os_aio_array_struct{ /* Array of events used in simulated aio */ static os_event_t* os_aio_segment_wait_events = NULL; +/* Number for the first global segment for reading. */ +const ulint os_aio_first_read_segment = 2; + +/* Number for the first global segment for writing. Set to +2 + os_aio_read_write_threads. */ +ulint os_aio_first_write_segment = 0; + /* The aio arrays for non-ibuf i/o and ibuf i/o, as well as sync aio. These are NULL when the module has not yet been initialized. */ static os_aio_array_t* os_aio_read_array = NULL; @@ -149,11 +180,17 @@ static os_aio_array_t* os_aio_ibuf_array static os_aio_array_t* os_aio_log_array = NULL; static os_aio_array_t* os_aio_sync_array = NULL; +/* Per thread buffer used for merged IO requests. Used by +os_aio_simulated_handle so that a buffer doesn't have to be allocated +for each request. */ +static char* os_aio_thread_buffer[SRV_MAX_N_IO_THREADS]; +static ulint os_aio_thread_buffer_size[SRV_MAX_N_IO_THREADS]; + static ulint os_aio_n_segments = ULINT_UNDEFINED; /* If the following is TRUE, read i/o handler threads try to wait until a batch of new read requests have been posted */ -static ibool os_aio_recommend_sleep_for_read_threads = FALSE; +static volatile ibool os_aio_recommend_sleep_for_read_threads = FALSE; UNIV_INTERN ulint os_n_file_reads = 0; UNIV_INTERN ulint os_bytes_read_since_printout = 0; @@ -2956,6 +2993,8 @@ os_aio_init( for (i = 0; i < n_segments; i++) { srv_set_io_thread_op_info(i, "not started yet"); + os_aio_thread_buffer[i] = 0; + os_aio_thread_buffer_size[i] = 0; } n_per_seg = n / n_segments; @@ -2964,6 +3003,7 @@ os_aio_init( /* fprintf(stderr, "Array n per seg %lu\n", n_per_seg); */ + os_aio_first_write_segment = os_aio_first_read_segment + n_read_threads; os_aio_ibuf_array = os_aio_array_create(n_per_seg, 1); srv_io_thread_function[0] = "insert buffer thread"; @@ -2972,14 +3012,14 @@ os_aio_init( srv_io_thread_function[1] = "log thread"; - os_aio_read_array = os_aio_array_create(n_read_segs * n_per_seg, + os_aio_read_array = os_aio_array_create(n_per_seg, n_read_segs); for (i = 2; i < 2 + n_read_segs; i++) { ut_a(i < SRV_MAX_N_IO_THREADS); srv_io_thread_function[i] = "read thread"; } - os_aio_write_array = os_aio_array_create(n_write_segs * n_per_seg, + os_aio_write_array = os_aio_array_create(n_per_seg, n_write_segs); for (i = 2 + n_read_segs; i < n_segments; i++) { ut_a(i < SRV_MAX_N_IO_THREADS); @@ -3225,7 +3265,8 @@ loop: slot->buf = buf; slot->offset = offset; slot->offset_high = offset_high; - slot->io_already_done = FALSE; +// slot->io_already_done = FALSE; + slot->status = OS_AIO_NOT_ISSUED; #ifdef WIN_ASYNC_IO control = &(slot->control); @@ -3256,6 +3297,7 @@ os_aio_array_free_slot( ut_ad(slot->reserved); slot->reserved = FALSE; + slot->status = OS_AIO_NOT_ISSUED; array->n_reserved--; @@ -3292,16 +3334,18 @@ os_aio_simulated_wake_handler_thread( segment = os_aio_get_array_and_local_segment(&array, global_segment); - n = array->n_slots / array->n_segments; + n = array->n_slots; /* Look through n slots after the segment * n'th slot */ os_mutex_enter(array->mutex); for (i = 0; i < n; i++) { - slot = os_aio_array_get_nth_slot(array, i + segment * n); + slot = os_aio_array_get_nth_slot(array, i); - if (slot->reserved) { + if (slot->reserved && + (slot->status == OS_AIO_NOT_ISSUED || + slot->status == OS_AIO_DONE)) { /* Found an i/o request */ break; @@ -3311,7 +3355,25 @@ os_aio_simulated_wake_handler_thread( os_mutex_exit(array->mutex); if (i < n) { - os_event_set(os_aio_segment_wait_events[global_segment]); + if (array == os_aio_ibuf_array) { + os_event_set(os_aio_segment_wait_events[0]); + + } else if (array == os_aio_log_array) { + os_event_set(os_aio_segment_wait_events[1]); + + } else if (array == os_aio_read_array) { + ulint x; + for (x = os_aio_first_read_segment; x < os_aio_first_write_segment; x++) + os_event_set(os_aio_segment_wait_events[x]); + + } else if (array == os_aio_write_array) { + ulint x; + for (x = os_aio_first_write_segment; x < os_aio_n_segments; x++) + os_event_set(os_aio_segment_wait_events[x]); + + } else { + ut_a(0); + } } } @@ -3322,8 +3384,6 @@ void os_aio_simulated_wake_handler_threads(void) /*=======================================*/ { - ulint i; - if (os_aio_use_native_aio) { /* We do not use simulated aio: do nothing */ @@ -3332,9 +3392,10 @@ os_aio_simulated_wake_handler_threads(vo os_aio_recommend_sleep_for_read_threads = FALSE; - for (i = 0; i < os_aio_n_segments; i++) { - os_aio_simulated_wake_handler_thread(i); - } + os_aio_simulated_wake_handler_thread(0); + os_aio_simulated_wake_handler_thread(1); + os_aio_simulated_wake_handler_thread(os_aio_first_read_segment); + os_aio_simulated_wake_handler_thread(os_aio_first_write_segment); } /************************************************************************** @@ -3606,7 +3667,7 @@ os_aio_windows_handle( ut_ad(os_aio_validate()); ut_ad(segment < array->n_segments); - n = array->n_slots / array->n_segments; + n = array->n_slots; if (array == os_aio_sync_array) { os_event_wait(os_aio_array_get_nth_slot(array, pos)->event); @@ -3615,12 +3676,12 @@ os_aio_windows_handle( srv_set_io_thread_op_info(orig_seg, "wait Windows aio"); i = os_event_wait_multiple(n, (array->native_events) - + segment * n); + ); } os_mutex_enter(array->mutex); - slot = os_aio_array_get_nth_slot(array, i + segment * n); + slot = os_aio_array_get_nth_slot(array, i); ut_a(slot->reserved); @@ -3685,10 +3746,13 @@ os_aio_simulated_handle( os_aio_slot_t* slot; os_aio_slot_t* slot2; os_aio_slot_t* consecutive_ios[OS_AIO_MERGE_N_CONSECUTIVE]; + os_aio_slot_t* lowest_request; + os_aio_slot_t* oldest_request; ulint n_consecutive; ulint total_len; ulint offs; ulint lowest_offset; + ulint oldest_offset; ulint biggest_age; ulint age; byte* combined_buf; @@ -3696,6 +3760,7 @@ os_aio_simulated_handle( ibool ret; ulint n; ulint i; + time_t now; segment = os_aio_get_array_and_local_segment(&array, global_segment); @@ -3708,7 +3773,7 @@ restart: ut_ad(os_aio_validate()); ut_ad(segment < array->n_segments); - n = array->n_slots / array->n_segments; + n = array->n_slots; /* Look through n slots after the segment * n'th slot */ @@ -3730,9 +3795,9 @@ restart: done */ for (i = 0; i < n; i++) { - slot = os_aio_array_get_nth_slot(array, i + segment * n); + slot = os_aio_array_get_nth_slot(array, i); - if (slot->reserved && slot->io_already_done) { + if (slot->reserved && slot->status == OS_AIO_DONE) { if (os_aio_print_debug) { fprintf(stderr, @@ -3754,67 +3819,57 @@ restart: then pick the one at the lowest offset. */ biggest_age = 0; - lowest_offset = ULINT_MAX; + now = time(NULL); + oldest_request = lowest_request = NULL; + oldest_offset = lowest_offset = ULINT_MAX; + /* Find the oldest request and the request with the smallest offset */ for (i = 0; i < n; i++) { - slot = os_aio_array_get_nth_slot(array, i + segment * n); + slot = os_aio_array_get_nth_slot(array, i); - if (slot->reserved) { - age = (ulint)difftime(time(NULL), - slot->reservation_time); + if (slot->reserved && slot->status == OS_AIO_NOT_ISSUED) { + age = (ulint)difftime(now, slot->reservation_time); if ((age >= 2 && age > biggest_age) || (age >= 2 && age == biggest_age - && slot->offset < lowest_offset)) { + && slot->offset < oldest_offset)) { /* Found an i/o request */ - consecutive_ios[0] = slot; - - n_consecutive = 1; - biggest_age = age; - lowest_offset = slot->offset; + oldest_request = slot; + oldest_offset = slot->offset; } - } - } - - if (n_consecutive == 0) { - /* There were no old requests. Look for an i/o request at the - lowest offset in the array (we ignore the high 32 bits of the - offset in these heuristics) */ - - lowest_offset = ULINT_MAX; - - for (i = 0; i < n; i++) { - slot = os_aio_array_get_nth_slot(array, - i + segment * n); - - if (slot->reserved && slot->offset < lowest_offset) { + /* Look for an i/o request at the lowest offset in the array + * (we ignore the high 32 bits of the offset) */ + if (slot->offset < lowest_offset) { /* Found an i/o request */ - consecutive_ios[0] = slot; - - n_consecutive = 1; - + lowest_request = slot; lowest_offset = slot->offset; } } } - if (n_consecutive == 0) { + if (!lowest_request && !oldest_request) { /* No i/o requested at the moment */ goto wait_for_io; } - slot = consecutive_ios[0]; + if (oldest_request) { + slot = oldest_request; + } else { + slot = lowest_request; + } + consecutive_ios[0] = slot; + n_consecutive = 1; /* Check if there are several consecutive blocks to read or write */ consecutive_loop: for (i = 0; i < n; i++) { - slot2 = os_aio_array_get_nth_slot(array, i + segment * n); + slot2 = os_aio_array_get_nth_slot(array, i); if (slot2->reserved && slot2 != slot && slot2->offset == slot->offset + slot->len @@ -3822,7 +3877,8 @@ consecutive_loop: && slot->offset + slot->len > slot->offset && slot2->offset_high == slot->offset_high && slot2->type == slot->type - && slot2->file == slot->file) { + && slot2->file == slot->file + && slot2->status == OS_AIO_NOT_ISSUED) { /* Found a consecutive i/o request */ @@ -3851,6 +3907,8 @@ consecutive_loop: for (i = 0; i < n_consecutive; i++) { total_len += consecutive_ios[i]->len; + ut_a(consecutive_ios[i]->status == OS_AIO_NOT_ISSUED); + consecutive_ios[i]->status = OS_AIO_ISSUED; } if (n_consecutive == 1) { @@ -3858,7 +3916,14 @@ consecutive_loop: combined_buf = slot->buf; combined_buf2 = NULL; } else { - combined_buf2 = ut_malloc(total_len + UNIV_PAGE_SIZE); + if ((total_len + UNIV_PAGE_SIZE) > os_aio_thread_buffer_size[global_segment]) { + if (os_aio_thread_buffer[global_segment]) + ut_free(os_aio_thread_buffer[global_segment]); + + os_aio_thread_buffer[global_segment] = ut_malloc(total_len + UNIV_PAGE_SIZE); + os_aio_thread_buffer_size[global_segment] = total_len + UNIV_PAGE_SIZE; + } + combined_buf2 = os_aio_thread_buffer[global_segment]; ut_a(combined_buf2); @@ -3869,6 +3934,9 @@ consecutive_loop: this assumes that there is just one i/o-handler thread serving a single segment of slots! */ + ut_a(slot->reserved); + ut_a(slot->status == OS_AIO_ISSUED); + os_mutex_exit(array->mutex); if (slot->type == OS_FILE_WRITE && n_consecutive > 1) { @@ -3924,16 +3992,13 @@ consecutive_loop: } } - if (combined_buf2) { - ut_free(combined_buf2); - } - os_mutex_enter(array->mutex); /* Mark the i/os done in slots */ for (i = 0; i < n_consecutive; i++) { - consecutive_ios[i]->io_already_done = TRUE; + ut_a(consecutive_ios[i]->status == OS_AIO_ISSUED); + consecutive_ios[i]->status = OS_AIO_DONE; } /* We return the messages for the first slot now, and if there were @@ -3943,6 +4008,8 @@ consecutive_loop: slot_io_done: ut_a(slot->reserved); + ut_a(slot->status == OS_AIO_DONE); + slot->status = OS_AIO_CLAIMED; *message1 = slot->message1; *message2 = slot->message2; === modified file 'storage/xtradb/rem/rem0cmp.c' --- a/storage/xtradb/rem/rem0cmp.c 2009-03-26 06:11:11 +0000 +++ b/storage/xtradb/rem/rem0cmp.c 2009-06-25 01:43:25 +0000 @@ -892,10 +892,11 @@ cmp_rec_rec_with_match( matched fields; when the function returns, contains the value the for current comparison */ - ulint* matched_bytes) /* in/out: number of already matched + ulint* matched_bytes, /* in/out: number of already matched bytes within the first field not completely matched; when the function returns, contains the value for the current comparison */ + ulint stats_method) { #ifndef UNIV_HOTBACKUP ulint rec1_n_fields; /* the number of fields in rec */ @@ -989,7 +990,11 @@ cmp_rec_rec_with_match( if (rec1_f_len == rec2_f_len) { - goto next_field; + if (stats_method == SRV_STATS_METHOD_NULLS_EQUAL) { + goto next_field; + } else { + ret = -1; + } } else if (rec2_f_len == UNIV_SQL_NULL) { === modified file 'storage/xtradb/row/row0mysql.c' --- a/storage/xtradb/row/row0mysql.c 2009-03-26 06:11:11 +0000 +++ b/storage/xtradb/row/row0mysql.c 2009-06-25 01:43:25 +0000 @@ -854,6 +854,9 @@ row_update_statistics_if_needed( table->stat_modified_counter = counter + 1; + if (!srv_stats_auto_update) + return; + /* Calculate new statistics if 1 / 16 of table has been modified since the last time a statistics batch was run, or if stat_modified_counter > 2 000 000 000 (to avoid wrap-around). === modified file 'storage/xtradb/scripts/install_innodb_plugins.sql' --- a/storage/xtradb/scripts/install_innodb_plugins.sql 2009-01-29 16:54:13 +0000 +++ b/storage/xtradb/scripts/install_innodb_plugins.sql 2009-06-25 01:43:25 +0000 @@ -12,3 +12,5 @@ INSTALL PLUGIN INNODB_BUFFER_POOL_PAGES INSTALL PLUGIN INNODB_BUFFER_POOL_PAGES_BLOB SONAME 'ha_innodb.so'; INSTALL PLUGIN INNODB_BUFFER_POOL_PAGES_INDEX SONAME 'ha_innodb.so'; INSTALL PLUGIN innodb_rseg SONAME 'ha_innodb.so'; +INSTALL PLUGIN innodb_table_stats SONAME 'ha_innodb.so'; +INSTALL PLUGIN innodb_index_stats SONAME 'ha_innodb.so'; === modified file 'storage/xtradb/srv/srv0srv.c' --- a/storage/xtradb/srv/srv0srv.c 2009-03-26 06:11:11 +0000 +++ b/storage/xtradb/srv/srv0srv.c 2009-07-06 05:47:15 +0000 @@ -285,6 +285,7 @@ Value 10 should be good if there are les computer. Bigger computers need bigger values. Value 0 will disable the concurrency check. */ +UNIV_INTERN ibool srv_thread_concurrency_timer_based = FALSE; UNIV_INTERN ulong srv_thread_concurrency = 0; UNIV_INTERN ulong srv_commit_concurrency = 0; @@ -336,6 +337,8 @@ UNIV_INTERN ibool srv_innodb_status = FA /* When estimating number of different key values in an index, sample this many index pages */ UNIV_INTERN unsigned long long srv_stats_sample_pages = 8; +UNIV_INTERN ulint srv_stats_method = 0; +UNIV_INTERN ulint srv_stats_auto_update = 1; UNIV_INTERN ibool srv_use_doublewrite_buf = TRUE; UNIV_INTERN ibool srv_use_checksums = TRUE; @@ -361,14 +364,18 @@ UNIV_INTERN ulint srv_flush_neighbor_pag UNIV_INTERN ulint srv_enable_unsafe_group_commit = 0; /* 0:disable 1:enable */ UNIV_INTERN ulint srv_read_ahead = 3; /* 1: random 2: linear 3: Both */ -UNIV_INTERN ulint srv_adaptive_checkpoint = 0; /* 0:disable 1:enable */ +UNIV_INTERN ulint srv_adaptive_checkpoint = 0; /* 0: none 1: reflex 2: estimate */ + +UNIV_INTERN ulint srv_expand_import = 0; /* 0:disable 1:enable */ UNIV_INTERN ulint srv_extra_rsegments = 0; /* extra rseg for users */ +UNIV_INTERN ulint srv_dict_size_limit = 0; /*-------------------------------------------*/ UNIV_INTERN ulong srv_n_spin_wait_rounds = 20; UNIV_INTERN ulong srv_n_free_tickets_to_enter = 500; UNIV_INTERN ulong srv_thread_sleep_delay = 10000; UNIV_INTERN ulint srv_spin_wait_delay = 5; +UNIV_INTERN ulint srv_spins_microsec = 50; UNIV_INTERN ibool srv_priority_boost = TRUE; #ifdef UNIV_DEBUG @@ -657,6 +664,47 @@ are indexed by the type of the thread. * UNIV_INTERN ulint srv_n_threads_active[SRV_MASTER + 1]; UNIV_INTERN ulint srv_n_threads[SRV_MASTER + 1]; +static +void +srv_align_spins_microsec(void) +{ + ulint start_sec, end_sec; + ulint start_usec, end_usec; + ib_uint64_t usecs; + + /* change temporary */ + srv_spins_microsec = 1; + + if (ut_usectime(&start_sec, &start_usec)) { + srv_spins_microsec = 50; + goto end; + } + + ut_delay(100000); + + if (ut_usectime(&end_sec, &end_usec)) { + srv_spins_microsec = 50; + goto end; + } + + usecs = (end_sec - start_sec) * 1000000LL + (end_usec - start_usec); + + if (usecs) { + srv_spins_microsec = 100000 / usecs; + if (srv_spins_microsec == 0) + srv_spins_microsec = 1; + if (srv_spins_microsec > 50) + srv_spins_microsec = 50; + } else { + srv_spins_microsec = 50; + } +end: + if (srv_spins_microsec != 50) + fprintf(stderr, + "InnoDB: unit of spin count at ut_delay() is aligned to %lu\n", + srv_spins_microsec); +} + /************************************************************************* Sets the info describing an i/o thread current state. */ UNIV_INTERN @@ -889,6 +937,8 @@ srv_init(void) dict_table_t* table; ulint i; + srv_align_spins_microsec(); + srv_sys = mem_alloc(sizeof(srv_sys_t)); kernel_mutex_temp = mem_alloc(sizeof(mutex_t)); @@ -1009,6 +1059,75 @@ UNIV_INTERN ulong srv_max_purge_lag = 0 /************************************************************************* Puts an OS thread to wait if there are too many concurrent threads (>= srv_thread_concurrency) inside InnoDB. The threads wait in a FIFO queue. */ + +#ifdef INNODB_RW_LOCKS_USE_ATOMICS +static void +enter_innodb_with_tickets(trx_t* trx) +{ + trx->declared_to_be_inside_innodb = TRUE; + trx->n_tickets_to_enter_innodb = SRV_FREE_TICKETS_TO_ENTER; + return; +} + +static void +srv_conc_enter_innodb_timer_based(trx_t* trx) +{ + lint conc_n_threads; + ibool has_yielded = FALSE; + ulint has_slept = 0; + + if (trx->declared_to_be_inside_innodb) { + ut_print_timestamp(stderr); + fputs( +" InnoDB: Error: trying to declare trx to enter InnoDB, but\n" +"InnoDB: it already is declared.\n", stderr); + trx_print(stderr, trx, 0); + putc('\n', stderr); + } +retry: + if (srv_conc_n_threads < (lint) srv_thread_concurrency) { + conc_n_threads = __sync_add_and_fetch(&srv_conc_n_threads, 1); + if (conc_n_threads <= (lint) srv_thread_concurrency) { + enter_innodb_with_tickets(trx); + return; + } + __sync_add_and_fetch(&srv_conc_n_threads, -1); + } + if (!has_yielded) + { + has_yielded = TRUE; + os_thread_yield(); + goto retry; + } + if (trx->has_search_latch + || NULL != UT_LIST_GET_FIRST(trx->trx_locks)) { + + conc_n_threads = __sync_add_and_fetch(&srv_conc_n_threads, 1); + enter_innodb_with_tickets(trx); + return; + } + if (has_slept < 2) + { + trx->op_info = "sleeping before entering InnoDB"; + os_thread_sleep(10000); + trx->op_info = ""; + has_slept++; + } + conc_n_threads = __sync_add_and_fetch(&srv_conc_n_threads, 1); + enter_innodb_with_tickets(trx); + return; +} + +static void +srv_conc_exit_innodb_timer_based(trx_t* trx) +{ + __sync_add_and_fetch(&srv_conc_n_threads, -1); + trx->declared_to_be_inside_innodb = FALSE; + trx->n_tickets_to_enter_innodb = 0; + return; +} +#endif + UNIV_INTERN void srv_conc_enter_innodb( @@ -1039,6 +1158,13 @@ srv_conc_enter_innodb( return; } +#ifdef INNODB_RW_LOCKS_USE_ATOMICS + if (srv_thread_concurrency_timer_based) { + srv_conc_enter_innodb_timer_based(trx); + return; + } +#endif + os_fast_mutex_lock(&srv_conc_mutex); retry: if (trx->declared_to_be_inside_innodb) { @@ -1182,6 +1308,14 @@ srv_conc_force_enter_innodb( } ut_ad(srv_conc_n_threads >= 0); +#ifdef INNODB_RW_LOCKS_USE_ATOMICS + if (srv_thread_concurrency_timer_based) { + __sync_add_and_fetch(&srv_conc_n_threads, 1); + trx->declared_to_be_inside_innodb = TRUE; + trx->n_tickets_to_enter_innodb = 1; + return; + } +#endif os_fast_mutex_lock(&srv_conc_mutex); @@ -1215,6 +1349,13 @@ srv_conc_force_exit_innodb( return; } +#ifdef INNODB_RW_LOCKS_USE_ATOMICS + if (srv_thread_concurrency_timer_based) { + srv_conc_exit_innodb_timer_based(trx); + return; + } +#endif + os_fast_mutex_lock(&srv_conc_mutex); ut_ad(srv_conc_n_threads > 0); @@ -1934,6 +2075,7 @@ srv_export_innodb_status(void) export_vars.innodb_data_reads = os_n_file_reads; export_vars.innodb_data_writes = os_n_file_writes; export_vars.innodb_data_written = srv_data_written; + export_vars.innodb_dict_tables= (dict_sys ? UT_LIST_GET_LEN(dict_sys->table_LRU) : 0); export_vars.innodb_buffer_pool_read_requests = buf_pool->n_page_gets; export_vars.innodb_buffer_pool_write_requests = srv_buf_pool_write_requests; @@ -2348,6 +2490,8 @@ srv_master_thread( ibool skip_sleep = FALSE; ulint i; + ib_uint64_t lsn_old; + ib_uint64_t oldest_lsn; #ifdef UNIV_DEBUG_THREAD_CREATION @@ -2365,6 +2509,9 @@ srv_master_thread( mutex_exit(&kernel_mutex); + mutex_enter(&(log_sys->mutex)); + lsn_old = log_sys->lsn; + mutex_exit(&(log_sys->mutex)); loop: /*****************************************************************/ /* ---- When there is database activity by users, we cycle in this @@ -2399,6 +2546,19 @@ loop: if (!skip_sleep) { os_thread_sleep(1000000); + + /* + mutex_enter(&(log_sys->mutex)); + oldest_lsn = buf_pool_get_oldest_modification(); + ib_uint64_t lsn = log_sys->lsn; + mutex_exit(&(log_sys->mutex)); + + if(oldest_lsn) + fprintf(stderr, + "InnoDB flush: age pct: %lu, lsn progress: %lu\n", + (lsn - oldest_lsn) * 100 / log_sys->max_checkpoint_age, + lsn - lsn_old); + */ } skip_sleep = FALSE; @@ -2437,14 +2597,15 @@ loop: + log_sys->n_pending_writes; n_ios = log_sys->n_log_ios + buf_pool->n_pages_read + buf_pool->n_pages_written; - if (n_pend_ios < 3 && (n_ios - n_ios_old < PCT_IO(5))) { + if (n_pend_ios < PCT_IO(3) && (n_ios - n_ios_old < PCT_IO(5))) { srv_main_thread_op_info = "doing insert buffer merge"; ibuf_contract_for_n_pages( TRUE, PCT_IBUF_IO((srv_insert_buffer_batch_size / 4))); srv_main_thread_op_info = "flushing log"; - log_buffer_flush_to_disk(); + /* No fsync when srv_flush_log_at_trx_commit != 1 */ + log_buffer_flush_maybe_sync(); } if (UNIV_UNLIKELY(buf_get_modified_ratio_pct() @@ -2462,13 +2623,16 @@ loop: iteration of this loop. */ skip_sleep = TRUE; - } else if (srv_adaptive_checkpoint) { + mutex_enter(&(log_sys->mutex)); + lsn_old = log_sys->lsn; + mutex_exit(&(log_sys->mutex)); + } else if (srv_adaptive_checkpoint == 1) { /* Try to keep modified age not to exceed max_checkpoint_age * 7/8 line */ mutex_enter(&(log_sys->mutex)); - + lsn_old = log_sys->lsn; oldest_lsn = buf_pool_get_oldest_modification(); if (oldest_lsn == 0) { @@ -2504,7 +2668,93 @@ loop: mutex_exit(&(log_sys->mutex)); } } + } else if (srv_adaptive_checkpoint == 2) { + /* Try to keep modified age not to exceed + max_checkpoint_age * 7/8 line */ + + mutex_enter(&(log_sys->mutex)); + + oldest_lsn = buf_pool_get_oldest_modification(); + if (oldest_lsn == 0) { + lsn_old = log_sys->lsn; + mutex_exit(&(log_sys->mutex)); + + } else { + if ((log_sys->lsn - oldest_lsn) + > (log_sys->max_checkpoint_age) - ((log_sys->max_checkpoint_age) / 8)) { + /* LOG_POOL_PREFLUSH_RATIO_ASYNC is exceeded. */ + /* We should not flush from here. */ + lsn_old = log_sys->lsn; + mutex_exit(&(log_sys->mutex)); + } else if ((log_sys->lsn - oldest_lsn) + > (log_sys->max_checkpoint_age)/2 ) { + + /* defence line (max_checkpoint_age * 1/2) */ + ib_uint64_t lsn = log_sys->lsn; + + mutex_exit(&(log_sys->mutex)); + + ib_uint64_t level, bpl; + buf_page_t* bpage; + + mutex_enter(&flush_list_mutex); + + level = 0; + bpage = UT_LIST_GET_FIRST(buf_pool->flush_list); + + while (bpage != NULL) { + ib_uint64_t oldest_modification = bpage->oldest_modification; + if (oldest_modification != 0) { + level += log_sys->max_checkpoint_age + - (lsn - oldest_modification); + } + bpage = UT_LIST_GET_NEXT(flush_list, bpage); + } + + if (level) { + bpl = ((ib_uint64_t) UT_LIST_GET_LEN(buf_pool->flush_list) + * UT_LIST_GET_LEN(buf_pool->flush_list) + * (lsn - lsn_old)) / level; + } else { + bpl = 0; + } + + mutex_exit(&flush_list_mutex); + + if (!srv_use_doublewrite_buf) { + /* flush is faster than when doublewrite */ + bpl = (bpl * 3) / 4; + } + + if (bpl) { +retry_flush_batch: + n_pages_flushed = buf_flush_batch(BUF_FLUSH_LIST, + bpl, + oldest_lsn + (lsn - lsn_old)); + if (n_pages_flushed == ULINT_UNDEFINED) { + os_thread_sleep(5000); + goto retry_flush_batch; + } + } + + lsn_old = lsn; + /* + fprintf(stderr, + "InnoDB flush: age pct: %lu, lsn progress: %lu, blocks to flush:%llu\n", + (lsn - oldest_lsn) * 100 / log_sys->max_checkpoint_age, + lsn - lsn_old, bpl); + */ + } else { + lsn_old = log_sys->lsn; + mutex_exit(&(log_sys->mutex)); + } + } + + } else { + mutex_enter(&(log_sys->mutex)); + lsn_old = log_sys->lsn; + mutex_exit(&(log_sys->mutex)); } if (srv_activity_count == old_activity_count) { @@ -2537,7 +2787,8 @@ loop: buf_flush_batch(BUF_FLUSH_LIST, PCT_IO(100), IB_ULONGLONG_MAX); srv_main_thread_op_info = "flushing log"; - log_buffer_flush_to_disk(); + /* No fsync when srv_flush_log_at_trx_commit != 1 */ + log_buffer_flush_maybe_sync(); } /* We run a batch of insert buffer merge every 10 seconds, @@ -2547,7 +2798,8 @@ loop: ibuf_contract_for_n_pages(TRUE, PCT_IBUF_IO((srv_insert_buffer_batch_size / 4))); srv_main_thread_op_info = "flushing log"; - log_buffer_flush_to_disk(); + /* No fsync when srv_flush_log_at_trx_commit != 1 */ + log_buffer_flush_maybe_sync(); /* We run a full purge every 10 seconds, even if the server were active */ @@ -2718,7 +2970,14 @@ flush_loop: srv_main_thread_op_info = "flushing log"; - log_buffer_flush_to_disk(); + current_time = time(NULL); + if (difftime(current_time, last_flush_time) > 1) { + log_buffer_flush_to_disk(); + last_flush_time = current_time; + } else { + /* No fsync when srv_flush_log_at_trx_commit != 1 */ + log_buffer_flush_maybe_sync(); + } srv_main_thread_op_info = "making checkpoint"; === modified file 'storage/xtradb/srv/srv0start.c' --- a/storage/xtradb/srv/srv0start.c 2009-06-09 15:08:46 +0000 +++ b/storage/xtradb/srv/srv0start.c 2009-08-03 20:09:53 +0000 @@ -1269,7 +1269,7 @@ innobase_start_or_create_for_mysql(void) os_aio_init(8 * SRV_N_PENDING_IOS_PER_THREAD * srv_n_file_io_threads, srv_n_read_io_threads, srv_n_write_io_threads, - SRV_MAX_N_PENDING_SYNC_IOS * 8); + SRV_MAX_N_PENDING_SYNC_IOS); } else { os_aio_init(SRV_N_PENDING_IOS_PER_THREAD * srv_n_file_io_threads, === modified file 'storage/xtradb/sync/sync0sync.c' --- a/storage/xtradb/sync/sync0sync.c 2009-03-26 06:11:11 +0000 +++ b/storage/xtradb/sync/sync0sync.c 2009-06-25 01:43:25 +0000 @@ -1081,6 +1081,12 @@ sync_thread_add_level( case SYNC_TRX_SYS_HEADER: case SYNC_FILE_FORMAT_TAG: case SYNC_DOUBLEWRITE: + case SYNC_BUF_LRU_LIST: + case SYNC_BUF_FLUSH_LIST: + case SYNC_BUF_PAGE_HASH: + case SYNC_BUF_FREE_LIST: + case SYNC_BUF_ZIP_FREE: + case SYNC_BUF_ZIP_HASH: case SYNC_BUF_POOL: case SYNC_SEARCH_SYS: case SYNC_SEARCH_SYS_CONF: @@ -1107,7 +1113,7 @@ sync_thread_add_level( /* Either the thread must own the buffer pool mutex (buf_pool_mutex), or it is allowed to latch only ONE buffer block (block->mutex or buf_pool_zip_mutex). */ - ut_a((sync_thread_levels_contain(array, SYNC_BUF_POOL) + ut_a((sync_thread_levels_contain(array, SYNC_BUF_LRU_LIST) && sync_thread_levels_g(array, SYNC_BUF_BLOCK - 1)) || sync_thread_levels_g(array, SYNC_BUF_BLOCK)); break; === modified file 'storage/xtradb/ut/ut0ut.c' --- a/storage/xtradb/ut/ut0ut.c 2009-03-26 06:11:11 +0000 +++ b/storage/xtradb/ut/ut0ut.c 2009-06-25 01:43:25 +0000 @@ -372,6 +372,8 @@ ut_get_year_month_day( /***************************************************************** Runs an idle loop on CPU. The argument gives the desired delay in microseconds on 100 MHz Pentium + Visual C++. */ +extern ulint srv_spins_microsec; + UNIV_INTERN ulint ut_delay( @@ -383,7 +385,11 @@ ut_delay( j = 0; - for (i = 0; i < delay * 50; i++) { + for (i = 0; i < delay * srv_spins_microsec; i++) { +#if (defined (__i386__) || defined (__x86_64__)) && defined (__GNUC__) + /* it is equal to the instruction 'pause' */ + __asm__ __volatile__ ("rep; nop"); +#endif j += i; }

1 0

[Maria-developers] Updated (by Guest): Table elimination (17)
by worklog-noreply＠askmonty.org 29 Jul '09

29 Jul '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Table elimination CREATION DATE..: Sun, 10 May 2009, 19:57 SUPERVISOR.....: Monty IMPLEMENTOR....: Psergey COPIES TO......: CATEGORY.......: Client-BackLog TASK ID........: 17 (http://askmonty.org/worklog/?tid=17) VERSION........: Server-5.1 STATUS.........: In-Progress PRIORITY.......: 60 WORKED HOURS...: 1 ESTIMATE.......: 3 (hours remain) ORIG. ESTIMATE.: 3 PROGRESS NOTES: -=-=(Guest - Wed, 29 Jul 2009, 21:41)=-=- Low Level Design modified. --- /tmp/wklog.17.old.26011 2009-07-29 21:41:04.000000000 +0300 +++ /tmp/wklog.17.new.26011 2009-07-29 21:41:04.000000000 +0300 @@ -2,163 +2,146 @@ ~maria-captains/maria/maria-5.1-table-elimination tree. <contents> -1. Conditions for removal -1.1 Quick check if there are candidates -2. Removal operation properties -3. Removal operation -4. User interface -5. Tests and benchmarks -6. Todo, issues to resolve -6.1 To resolve -6.2 Resolved -7. Additional issues +1. Elimination criteria +2. No outside references check +2.1 Quick check if there are tables with no outside references +3. One-match check +3.1 Functional dependency source #1: Potential eq_ref access +3.2 Functional dependency source #2: col2=func(col1) +3.3 Functional dependency source #3: One or zero records in the table +3.4 Functional dependency check implementation +3.4.1 Equality collection: Option1 +3.4.2 Equality collection: Option2 +3.4.3 Functional dependency propagation - option 1 +3.4.4 Functional dependency propagation - option 2 +4. Removal operation properties +5. Removal operation +6. User interface +6.1 @@optimizer_switch flag +6.2 EXPLAIN [EXTENDED] +7. Miscellaneous adjustments +7.1 Fix used_tables() of aggregate functions +7.2 Make subquery predicates collect their outer references +8. Other concerns +8.1 Relationship with outer->inner joins converter +8.2 Relationship with prepared statements +8.3 Relationship with constant table detection +9. Tests and benchmarks </contents> It's not really about elimination of tables, it's about elimination of inner sides of outer joins. -1. Conditions for removal -------------------------- -We can eliminate an inner side of outer join if: -1. For each record combination of outer tables, it will always produce - exactly one record. -2. There are no references to columns of the inner tables anywhere else in +1. Elimination criteria +======================= +We can eliminate inner side of an outer join nest if: + +1. There are no references to columns of the inner tables anywhere else in the query. +2. For each record combination of outer tables, it will always produce + exactly one matching record combination. + +Most of effort in this WL entry is checking these two conditions. -#1 means that every table inside the outer join nest is: - - is a constant table: - = because it can be accessed via eq_ref(const) access, or - = it is a zero-rows or one-row MyISAM-like table [MARK1] - - has an eq_ref access method candidate. - -#2 means that WHERE clause, ON clauses of embedding outer joins, ORDER BY, - GROUP BY and HAVING do not refer to the inner tables of the outer join - nest. - -1.1 Quick check if there are candidates -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Before we start to enumerate join nests, here is a quick way to check if -there *can be* something to be removed: +2. No outside references check +============================== +Criterion #1 means that the WHERE clause, ON clauses of embedding/subsequent +outer joins, ORDER BY, GROUP BY and HAVING must have no references to inner +tables of the outer join nest we're trying to remove. + +For multi-table UPDATE/DELETE we also must not remove tables that we're +updating/deleting from or tables that are used in UPDATE's SET clause. + +2.1 Quick check if there are tables with no outside references +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Before we start searching for outer join nests that could be eliminated, +we'll do a quick and cheap check if there possibly could be something that +could be eliminated: - if ((tables used in select_list | + if (there are outer joins && + (tables used in select_list | tables used in group/order by UNION | - tables used in where) != bitmap_of_all_tables) + tables used in where) != bitmap_of_all_join_tables) { attempt table elimination; } -2. Removal operation properties -------------------------------- -* There is always one way to remove (no choice to remove either this or that) -* It is always better to remove as much tables as possible (at least within - our cost model). -Thus, no need for any cost calculations/etc. It's an unconditional rewrite. -3. Removal operation --------------------- -* Remove the outer join nest's nested join structure (i.e. get the - outer join's TABLE_LIST object $OJ and remove it from $OJ->embedding, - $OJ->embedding->nested_join. Update table_map's of all ancestor nested - joins). [MARK2] +3. One-match check +================== +We can eliminate inner side of outer join if it will always generate exactly +one matching record combination. -* Move the tables and their JOIN_TABs to front like it is done with const - tables, with exception that if eliminated outer join nest was within - another outer join nest, that shouldn't prevent us from moving away the - eliminated tables. +By definition of OUTER JOIN, a NULL-complemented record combination will be +generated when the inner side of outer join has not produced any matches. -* Update join->table_count and all-join-tables bitmap. +What remains to be checked is that there is no possiblity that inner side of +the outer join could produce more than one matching record combination. -* That's it. Nothing else? +We'll refer to one-match property as "functional dependency": -4. User interface ------------------ -* We'll add an @@optimizer switch flag for table elimination. Tentative - name: 'table_elimination'. - (Note ^^ utility of the above questioned ^, as table elimination can never - be worse than no elimination. We're leaning towards not adding the flag) - -* EXPLAIN will not show the removed tables at all. This will allow to check - if tables were removed, and also will behave nicely with anchor model and - VIEWs: stuff that user doesn't care about just won't be there. +- A outer join nest is functionally dependent [wrt outer tables] if it will + produce one matching record combination per each record combination of + outer tables -5. Tests and benchmarks ------------------------ -Create a benchmark in sql-bench which checks if the DBMS has table -elimination. -[According to Monty] Run - - queries that would use elimination - - queries that are very similar to one above (so that they would have same - QEP, execution cost, etc) but cannot use table elimination. -then compare run times and make a conclusion about whether dbms supports table -elimination. +- A table is functionally dependent wrt certain set of dependency tables, if + record combination of dependency tables uniquely identifies zero or one + matching record in the table -6. Todo, issues to resolve --------------------------- +- Definitions of functional dependency of keys (=column tuples) and columns are + apparent. -6.1 To resolve -~~~~~~~~~~~~~~ -- Relationship with prepared statements. - On one hand, it's natural to desire to make table elimination a - once-per-statement operation, like outer->inner join conversion. We'll have - to limit the applicability by removing [MARK1] as that can change during - lifetime of the statement. - - The other option is to do table elimination every time. This will require to - rework operation [MARK2] to be undoable. - - I'm leaning towards doing the former. With anchor modeling, it is unlikely - that we'll meet outer joins which have N inner tables of which some are 1-row - MyISAM tables that do not have primary key. - -6.2 Resolved -~~~~~~~~~~~~ -* outer->inner join conversion is not a problem for table elimination. - We make outer->inner conversions based on predicates in WHERE. If the WHERE - referred to an inner table (requirement for OJ->IJ conversion) then table - elimination would not be applicable anyway. - -* For Multi-table UPDATEs/DELETEs, need to also analyze the SET clause: - - affected tables must not be eliminated - - tables that are used on the right side of the SET x=y assignments must - not be eliminated either. +Our goal is to prove that the entire join nest is functionally-dependent. -* Aggregate functions used to report that they depend on all tables, that is, +Join nest is functionally dependent (on the otside tables) if each of its +elements (those can be either base tables or join nests) is functionally +dependent. - item_agg_func->used_tables() == (1ULL << join->tables) - 1 +Functional dependency is transitive: if table A is f-dependent on the outer +tables and table B is f.dependent on {A, outer_tables} then B is functionally +dependent on the outer tables. + +Subsequent sections list cases when we can declare a table to be +functionally-dependent. + +3.1 Functional dependency source #1: Potential eq_ref access +------------------------------------------------------------ +This is the most practically-important case. Taking the example from the HLD +of this WL entry: + + select + A.colA + from + tableA A + left outer join + tableB B + on + B.id = A.id; - always. Fixed it, now aggregate function reports it depends on - tables that its arguments depend on. In particular, COUNT(*) reports - that it depends on no tables (item_count_star->used_tables()==0). - One consequence of that is that "item->used_tables()==0" is not - equivalent to "item->const_item()==true" anymore (not sure if it's - "anymore" or this has been already happening). - -* EXPLAIN EXTENDED warning text was generated after the JOIN object has - been discarded. This didn't allow to use information about join plan - when printing the warning. Fixed this by keeping the JOIN objects until - we've printed the warning (have also an intent to remove the const - tables from the join output). - -7. Additional issues --------------------- -* We remove ON clauses within outer join nests. If these clauses contain - subqueries, they probably should be gone from EXPLAIN output also? - Yes. Current approach: when removing an outer join nest, walk the ON clause - and mark subselects as eliminated. Then let EXPLAIN code check if the - SELECT was eliminated before the printing (EXPLAIN is generated by doing - a recursive descent, so the check will also cause children of eliminated - selects not to be printed) - -* Table elimination is performed after constant table detection (but before - the range analysis). Constant tables are technically different from - eliminated ones (e.g. the former are shown in EXPLAIN and the latter aren't). - Considering we've already done the join_read_const_table() call, is there any - real difference between constant table and eliminated one? If there is, should - we mark const tables also as eliminated? - from user/EXPLAIN point of view: no. constant table is the one that we read - one record from. eliminated table is the one that we don't acccess at all. +and generalizing it: a table TBL is functionally-dependent if the ON +expression allows to construct a potential eq_ref access to table TBL that +uses only outer or functionally-dependent tables. + +In other words: table TBL will have one match if the ON expression can be +converted into this form + + TBL.unique_key=func(one_match_tables) AND .. remainder ... + +(with appropriate extension for multi-part keys), where + + one_match_tables= { + tables that are not on the inner side of the outer join in question, and + functionally dependent tables + } + +Note that this will cover constant tables, except those that are constant because +they have 0/1 record or are partitioned and have no used partitions. + + +3.2 Functional dependency source #2: col2=func(col1) +---------------------------------------------------- +This comes from the second example in the HLS: -* What is described above will not be able to eliminate this outer join create unique index idx on tableB (id, fromDate); ... left outer join @@ -169,32 +152,331 @@ B.fromDate = (select max(sub.fromDate) from tableB sub where sub.id = A.id); - This is because condition "B.fromDate= func(tableB)" cannot be used. - Reason#1: update_ref_and_keys() does not consider such conditions to - be of any use (and indeed they are not usable for ref access) - so they are not put into KEYUSE array. - Reason#2: even if they were put there, we would need to be able to tell - between predicates like - B.fromDate= func(B.id) // guarantees only one matching row as - // B.id is already bound by B.id=A.id - // hence B.fromDate becomes bound too. - and - "B.fromDate= func(B.*)" // Can potentially have many matching - // records. - We need to - - Have update_ref_and_keys() create KEYUSE elements for such equalities - - Have eliminate_tables() and friends make a more accurate check. - The right check is to check whether all parts of a unique key are bound. - If we have keypartX to be bound, then t.keypartY=func(keypartX) makes - keypartY to be bound. - The difficulty here is that correlated subquery predicate cannot tell what - columns it depends on (it only remembers tables). - Traversing the predicate is expensive and complicated. - We're leaning towards making each subquery predicate have a List<Item> with - items that - - are in the current select - - and it depends on. - This list will be useful in certain other subquery optimizations as well, - it is cheap to collect it in fix_fields() phase, so it will be collected - for every subquery predicate. +Here it is apparent that tableB can be eliminated. It is not possible to +construct eq_ref access to tableB, though, because for the second part of the +primary key (fromDate column) we only got a condition in this form: + + B.fromDate= func(tableB) + +(we write "func(tableB)" because ref optimizer can only determine which tables +the right part of the equality depends on). + +In general case, equality like this doesn't guarantee functional dependency. +For example, if func() == { return fromDate;}, i.e the ON expression is + + ... ON B.id = A.id and B.fromDate = B.fromDate + +then that would allow table B to have multiple matches per record of table A. + +In order to be able to distinguish between these two cases, we'll need to go +down to column level: + +- A table is functionally dependent if it has a unique key that's functionally + dependent + +- A unique key is functionally dependent when all of its columns are + functionally dependent + +- A table column is functionally dependent if the ON clause allows to extract + an AND-part in this form: + + tbl.column = f(functionally-dependent columns or columns of outer tables) + +3.3 Functional dependency source #3: One or zero records in the table +--------------------------------------------------------------------- +A table with one or zero records cannot generate more than one matching +record. This source is of lesser importance as one/zero-record tables are only +MyISAM tables. + +3.4 Functional dependency check implementation +---------------------------------------------- +As shown above, we need something similar to KEYUSE structures, but not +exactly that (we need things that current ref optimizer considers unusable and +don't need things that it considers usable). + +3.4.1 Equality collection: Option1 +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +We could +- extend KEYUSE structures to store all kinds of equalities we need +- change update_ref_and_keys() and co. to collect equalities both for ref + access and for table elimination + = [possibly] Improve [eq_]ref access to be able to use equalities in + form keypart2=func(keypart1) +- process the KEYUSE array both by table elimination and by ref access + optimizer. + ++ This requires less effort. +- Code will have to be changed all over sql_select.cc +- update_ref_and_keys() and co. already do several unrelated things. Hooking + up table elimination will make it even worse. + +3.4.2 Equality collection: Option2 +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Alternatively, we could process the WHERE clause totally on our own. ++ Table elimination is standalone and easy to detach module. +- Some code duplication with update_ref_and_keys() and co. + +Having got the equalities, we'll to propagate functional dependency property +to unique keys, tables and, ultimately, join nests. + +3.4.3 Functional dependency propagation - option 1 +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Borrow the approach used in constant table detection code: + + do + { + converted= FALSE; + for each table T in join nest + { + if (check_if_functionally_dependent(T)) + converted= TRUE; + } + } while (converted == TRUE); + + check_if_functionally_dependent(T) + { + if (T has eq_ref access based on func_dep_tables) + return TRUE; + + Apply the same do-while loop-based approach to available equalities + T.column1=func(other columns) + to spread the set of functionally-dependent columns. The goal is to get + all columns of a certain unique key to be bound. + } + + +3.4.4 Functional dependency propagation - option 2 +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Analyze the ON expression(s) and build a list of + + tbl.field = expr(...) + +equalities. tbl here is a table that belongs to a join nest that could +potentially be eliminated. + +besides those, add to the list + - An element for each unique key in the table that needs to be eliminated + - An element for each table that needs to be eliminated + - An element for each join nest that can be eliminated (i.e. has no + references from outside). + +Then, setup "reverse dependencies": each element should have pointers to +elements that are functionally dependent on it: + +- "tbl.field=expr(...)" equality is functionally dependent on all fields that + are used in "expr(...)" (here we take into account only fields that belong + to tables that can potentially be eliminated). +- a unique key is dependent on all of its components +- a table is dependent on all of its unique keys +- a join nest is dependent on all tables that it contains + +These pointers are stored in form of one bitmap, such that: + + "X depends on Y" == test( bitmap[(X's number)*n_objects + (Y's number)] ) + +Each object also stores a number of dependencies it needs to be satisfied +before it itself is satisfied: + +- "tbl.field=expr(...)" needs all its underlying fields (if a field is + referenced many times it is counted only once) + +- a unique key needs all of its key parts + +- a table needs only one of its unique keys + +- a join nest needs all of its tables + +(TODO: so what do we do when we've marked a table as constant? We'll need to +update the "field=expr(....)" elements that use fields of that table. And the +problem is that we won't know how much to decrement from the counters of those +elements. + +Solution#1: switch to table_map() based approach. +Solution#2: introduce separate elements for each involved field. + field will depend on its table, + "field=expr" will depend on fields. +) + +Besides the above, let each element have a pointer to another element, so that +we can have a linked list of elements. + +After the above structures have been created, we start the main algorithm. + +The first step is to create a list of functionally-dependent elements. We walk +across array of dependencies and mark those elements that are already bound +(i.e. their dependencies are satisfied). At the moment those immediately-bound +are only "field=expr" dependencies that don't refer to any columns that are +not bound. + +The second step is the loop + + while (bound_list is not empty) + { + Take the first bound element F off the list. + Use the bitmap to find out what other elements depended on it + for each such element E + { + if (E becomes bound after F is bound) + add E to the list; + } + } + +The last step is to walk through elements that represent the join nests. Those +that are bound can be eliminated. + +4. Removal operation properties +=============================== +* There is always one way to remove (no choice to remove either this or that) +* It is always better to remove as much tables as possible (at least within + our cost model). +Thus, no need for any cost calculations/etc. It's an unconditional rewrite. + + +5. Removal operation +==================== +(This depends a lot on whether we make table elimination a one-off rewrite or +conditional) + +At the moment table elimination is re-done for each join re-execution, hence +the removal operation is designed not to modify any statement's permanent +members. + +* Remove the outer join nest's nested join structure (i.e. get the + outer join's TABLE_LIST object $OJ and remove it from $OJ->embedding, + $OJ->embedding->nested_join. Update table_map's of all ancestor nested + joins). [MARK2] + +* Move the tables and their JOIN_TABs to the front of join order, like it is + done with const tables, with exception that if eliminated outer join nest + was within another outer join nest, that shouldn't prevent us from moving + away the eliminated tables. + +* Update join->table_count and all-join-tables bitmap. + ^ TODO: not true anymore ^ + +* That's it. Nothing else? + +6. User interface +================= + +6.1 @@optimizer_switch flag +--------------------------- +Argument againist adding the flag: +* It is always better to perform table elimination than not to do it. + +Arguments for the flag: +* It is always theoretically possible that the new code will cause unintended + slowdowns. +* Having the flag is useful for QA and comparative benchmarking. + +Decision so far: add the flag under #ifdef. Make the flag be present in debug +builds. + +6.2 EXPLAIN [EXTENDED] +---------------------- +There are two possible options: +1. Show eliminated tables, like we do with const tables. +2. Do not show eliminated tables. + +We chose option 2, because: +- the table is not accessed at all (besides locking it) +- it is more natural for anchor model user - when he's querying an anchor- + and attributes view, he doesn't care about the unused attributes. + +EXPLAIN EXTENDED+SHOW WARNINGS won't show the removed table either. + +NOTE: Before this WL, the warning text was generated after all JOIN objects +have been destroyed. This didn't allow to use information about join plan +when printing the warning. We've fixed this by keeping the JOIN objects until +the warning text has been generated. + +Table elimination removes inner sides of outer join, and logically the ON +clause is also removed. If this clause has any subqueries, they will be +also removed from EXPLAIN output. + +An exception to the above is that if we eliminate a derived table, it will +still be shown in EXPLAIN output. This comes from the fact that the FROM +subqueries are evaluated before table elimination is invoked. +TODO: Is the above ok or still remove parts of FROM subqueries? + +7. Miscellaneous adjustments +============================ + +7.1 Fix used_tables() of aggregate functions +-------------------------------------------- +Aggregate functions used to report that they depend on all tables, that is, + + item_agg_func->used_tables() == (1ULL << join->tables) - 1 + +always. Fixed it, now aggregate function reports that it depends on the +tables that its arguments depend on. In particular, COUNT(*) reports that it +depends on no tables (item_count_star->used_tables()==0). One consequence of +that is that "item->used_tables()==0" is not equivalent to +"item->const_item()==true" anymore (not sure if it's "anymore" or this has +been already so for some items). + +7.2 Make subquery predicates collect their outer references +----------------------------------------------------------- +Per-column functional dependency analysis requires us to take a + + tbl.field = func(...) + +equality and tell which columns of which tables are referred from func(...) +expression. For scalar expressions, this is accomplished by Item::walk()-based +traversal. It should be reasonably cheap (the only practical Item that can be +expensive to traverse seems to be a special case of "col IN (const1,const2, +...)". check if we traverse the long list for such items). + +For correlated subqueries, traversal can be expensive, it is cheaper to make +each subquery item have a list of its outer references. The list can be +collected at fix_fields() stage with very little extra cost, and then it could +be used for other optimizations. + + +8. Other concerns +================= + +8.1 Relationship with outer->inner joins converter +-------------------------------------------------- +One could suspect that outer->inner join conversion could get in the way +of table elimination by changing outer joins (which could be eliminated) +to inner (which we will not try to eliminate). +This concern is not valid: we make outer->inner conversions based on +predicates in WHERE. If the WHERE referred to an inner table (this is a +requirement for the conversion) then table elimination would not be +applicable anyway. + +8.2 Relationship with prepared statements +----------------------------------------- +On one hand, it's natural to desire to make table elimination a +once-per-statement operation, like outer->inner join conversion. We'll have +to limit the applicability by removing [MARK1] as that can change during +lifetime of the statement. + +The other option is to do table elimination every time. This will require to +rework operation [MARK2] to be undoable. + + +8.3 Relationship with constant table detection +---------------------------------------------- +Table elimination is performed after constant table detection (but before +the range analysis). Constant tables are technically different from +eliminated ones (e.g. the former are shown in EXPLAIN and the latter aren't). +Considering we've already done the join_read_const_table() call, is there any +real difference between constant table and eliminated one? If there is, should +we mark const tables also as eliminated? +from user/EXPLAIN point of view: no. constant table is the one that we read +one record from. eliminated table is the one that we don't acccess at all. +TODO + +9. Tests and benchmarks +======================= +Create a benchmark in sql-bench which checks if the DBMS has table +elimination. +[According to Monty] Run + - query Q1 that would use elimination + - query Q2 that is very similar to Q1 (so that they would have same + QEP, execution cost, etc) but cannot use table elimination. +then compare run times and make a conclusion about whether the used dbms +supports table elimination. -=-=(Guest - Thu, 23 Jul 2009, 20:07)=-=- Dependency created: 29 now depends on 17 -=-=(Monty - Thu, 23 Jul 2009, 09:19)=-=- Version updated. --- /tmp/wklog.17.old.24090 2009-07-23 09:19:32.000000000 +0300 +++ /tmp/wklog.17.new.24090 2009-07-23 09:19:32.000000000 +0300 @@ -1 +1 @@ -Server-9.x +Server-5.1 -=-=(Guest - Mon, 20 Jul 2009, 14:28)=-=- deukje weg Worked 1 hour and estimate 3 hours remain (original estimate increased by 4 hours). -=-=(Guest - Fri, 17 Jul 2009, 02:44)=-=- Version updated. --- /tmp/wklog.17.old.24138 2009-07-17 02:44:49.000000000 +0300 +++ /tmp/wklog.17.new.24138 2009-07-17 02:44:49.000000000 +0300 @@ -1 +1 @@ -9.x +Server-9.x -=-=(Guest - Fri, 17 Jul 2009, 02:44)=-=- Version updated. --- /tmp/wklog.17.old.24114 2009-07-17 02:44:36.000000000 +0300 +++ /tmp/wklog.17.new.24114 2009-07-17 02:44:36.000000000 +0300 @@ -1 +1 @@ -Server-5.1 +9.x -=-=(Guest - Fri, 17 Jul 2009, 02:44)=-=- Category updated. --- /tmp/wklog.17.old.24114 2009-07-17 02:44:36.000000000 +0300 +++ /tmp/wklog.17.new.24114 2009-07-17 02:44:36.000000000 +0300 @@ -1 +1 @@ -Server-Sprint +Client-BackLog -=-=(Guest - Thu, 18 Jun 2009, 04:15)=-=- Low Level Design modified. --- /tmp/wklog.17.old.29969 2009-06-18 04:15:23.000000000 +0300 +++ /tmp/wklog.17.new.29969 2009-06-18 04:15:23.000000000 +0300 @@ -158,3 +158,43 @@ from user/EXPLAIN point of view: no. constant table is the one that we read one record from. eliminated table is the one that we don't acccess at all. +* What is described above will not be able to eliminate this outer join + create unique index idx on tableB (id, fromDate); + ... + left outer join + tableB B + on + B.id = A.id + and + B.fromDate = (select max(sub.fromDate) + from tableB sub where sub.id = A.id); + + This is because condition "B.fromDate= func(tableB)" cannot be used. + Reason#1: update_ref_and_keys() does not consider such conditions to + be of any use (and indeed they are not usable for ref access) + so they are not put into KEYUSE array. + Reason#2: even if they were put there, we would need to be able to tell + between predicates like + B.fromDate= func(B.id) // guarantees only one matching row as + // B.id is already bound by B.id=A.id + // hence B.fromDate becomes bound too. + and + "B.fromDate= func(B.*)" // Can potentially have many matching + // records. + We need to + - Have update_ref_and_keys() create KEYUSE elements for such equalities + - Have eliminate_tables() and friends make a more accurate check. + The right check is to check whether all parts of a unique key are bound. + If we have keypartX to be bound, then t.keypartY=func(keypartX) makes + keypartY to be bound. + The difficulty here is that correlated subquery predicate cannot tell what + columns it depends on (it only remembers tables). + Traversing the predicate is expensive and complicated. + We're leaning towards making each subquery predicate have a List<Item> with + items that + - are in the current select + - and it depends on. + This list will be useful in certain other subquery optimizations as well, + it is cheap to collect it in fix_fields() phase, so it will be collected + for every subquery predicate. + -=-=(Guest - Thu, 18 Jun 2009, 02:48)=-=- Low Level Design modified. --- /tmp/wklog.17.old.27792 2009-06-18 02:48:45.000000000 +0300 +++ /tmp/wklog.17.new.27792 2009-06-18 02:48:45.000000000 +0300 @@ -89,14 +89,14 @@ - queries that would use elimination - queries that are very similar to one above (so that they would have same QEP, execution cost, etc) but cannot use table elimination. +then compare run times and make a conclusion about whether dbms supports table +elimination. 6. Todo, issues to resolve -------------------------- 6.1 To resolve ~~~~~~~~~~~~~~ -- Re-check how this works with equality propagation. - - Relationship with prepared statements. On one hand, it's natural to desire to make table elimination a once-per-statement operation, like outer->inner join conversion. We'll have @@ -141,8 +141,13 @@ 7. Additional issues -------------------- -* We remove ON clauses within semi-join nests. If these clauses contain +* We remove ON clauses within outer join nests. If these clauses contain subqueries, they probably should be gone from EXPLAIN output also? + Yes. Current approach: when removing an outer join nest, walk the ON clause + and mark subselects as eliminated. Then let EXPLAIN code check if the + SELECT was eliminated before the printing (EXPLAIN is generated by doing + a recursive descent, so the check will also cause children of eliminated + selects not to be printed) * Table elimination is performed after constant table detection (but before the range analysis). Constant tables are technically different from -=-=(Guest - Thu, 18 Jun 2009, 02:24)=-=- Low Level Design modified. --- /tmp/wklog.17.old.27162 2009-06-18 02:24:14.000000000 +0300 +++ /tmp/wklog.17.new.27162 2009-06-18 02:24:14.000000000 +0300 @@ -83,9 +83,12 @@ 5. Tests and benchmarks ----------------------- -Should create a benchmark in sql-bench which checks if the dbms has table +Create a benchmark in sql-bench which checks if the DBMS has table elimination. -TODO elaborate +[According to Monty] Run + - queries that would use elimination + - queries that are very similar to one above (so that they would have same + QEP, execution cost, etc) but cannot use table elimination. 6. Todo, issues to resolve -------------------------- @@ -109,33 +112,37 @@ 6.2 Resolved ~~~~~~~~~~~~ -- outer->inner join conversion is not a problem for table elimination. +* outer->inner join conversion is not a problem for table elimination. We make outer->inner conversions based on predicates in WHERE. If the WHERE referred to an inner table (requirement for OJ->IJ conversion) then table elimination would not be applicable anyway. -7. Additional issues --------------------- -* We remove ON clauses within semi-join nests. If these clauses contain - subqueries, they probably should be gone from EXPLAIN output also? +* For Multi-table UPDATEs/DELETEs, need to also analyze the SET clause: + - affected tables must not be eliminated + - tables that are used on the right side of the SET x=y assignments must + not be eliminated either. -* Aggregate functions report they depend on all tables, that is, +* Aggregate functions used to report that they depend on all tables, that is, item_agg_func->used_tables() == (1ULL << join->tables) - 1 - always. If we want table elimination to work in presence of grouping, need - to devise some other way of analyzing aggregate functions. + always. Fixed it, now aggregate function reports it depends on + tables that its arguments depend on. In particular, COUNT(*) reports + that it depends on no tables (item_count_star->used_tables()==0). + One consequence of that is that "item->used_tables()==0" is not + equivalent to "item->const_item()==true" anymore (not sure if it's + "anymore" or this has been already happening). + +* EXPLAIN EXTENDED warning text was generated after the JOIN object has + been discarded. This didn't allow to use information about join plan + when printing the warning. Fixed this by keeping the JOIN objects until + we've printed the warning (have also an intent to remove the const + tables from the join output). -* Should eliminated tables be shown in EXPLAIN EXTENDED? - - If we just ignore the question, they will be shown - - this is what happens for constant tables, too. - - I don't see how showing them could be of any use. They only make it - harder to read the rewritten query. - It turns out that - - it is easy to have EXPLAIN EXTENDED show permanent (once-per-statement - lifetime) changes. - - it is hard to have it show per-execution data. This is because the warning - text is generated after the execution structures have been destroyed. +7. Additional issues +-------------------- +* We remove ON clauses within semi-join nests. If these clauses contain + subqueries, they probably should be gone from EXPLAIN output also? * Table elimination is performed after constant table detection (but before the range analysis). Constant tables are technically different from @@ -143,8 +150,6 @@ Considering we've already done the join_read_const_table() call, is there any real difference between constant table and eliminated one? If there is, should we mark const tables also as eliminated? + from user/EXPLAIN point of view: no. constant table is the one that we read + one record from. eliminated table is the one that we don't acccess at all. -* For Multi-table UPDATEs/DELETEs, need to also analyze the SET clause: - - affected tables must not be eliminated - - tables that are used on the right side of the SET x=y assignments must - not be eliminated either. ------------------------------------------------------------ -=-=(View All Progress Notes, 26 total)=-=- http://askmonty.org/worklog/index.pl?tid=17&nolimit=1 DESCRIPTION: Eliminate not needed tables from SELECT queries.. This will speed up some views and automatically generated queries. Example: CREATE TABLE B (id int primary key); select A.colA from tableA A left outer join tableB B on B.id = A.id; In this case we can remove table B and the join from the query. HIGH-LEVEL SPECIFICATION: Here is an extended explanation of table elimination. Table elimination is a feature found in some modern query optimizers, of which Microsoft SQL Server 2005/2008 seems to have the most advanced implementation. Oracle 11g has also been confirmed to use table elimination but not to the same extent. Basically, what table elimination does, is to remove tables from the execution plan when it is unnecessary to include them. This can, of course, only happen if the right circumstances arise. Let us for example look at the following query: select A.colA from tableA A left outer join tableB B on B.id = A.id; When using A as the left table we ensure that the query will return at least as many rows as there are in that table. For rows where the join condition (B.id = A.id) is not met the selected column (A.colA) will still contain it's original value. The not seen B.* row would contain all NULL:s. However, the result set could actually contain more rows than what is found in tableA if there are duplicates of the column B.id in tableB. If A contains a row [1, "val1"] and B the rows [1, "other1a"],[1, "other1b"] then two rows will match in the join condition. The only way to know what the result will look like is to actually touch both tables during execution. Instead, let's say that tableB contains rows that make it possible to place a unique constraint on the column B.id, for example and often the case a primary key. In this situation we know that we will get exactly as many rows as there are in tableA, since joining with tableB cannot introduce any duplicates. If further, as in the example query, we do not select any columns from tableB, touching that table during execution is unnecessary. We can remove the whole join operation from the execution plan. Both SQL Server 2005/2008 and Oracle 11g will deploy table elimination in the case described above. Let us look at a more advanced query, where Oracle fails. select A.colA from tableA A left outer join tableB B on B.id = A.id and B.fromDate = ( select max(sub.fromDate) from tableB sub where sub.id = A.id ); In this example we have added another join condition, which ensures that we only pick the matching row from tableB having the latest fromDate. In this case tableB will contain duplicates of the column B.id, so in order to ensure uniqueness the primary key has to contain the fromDate column as well. In other words the primary key of tableB is (B.id, B.fromDate). Furthermore, since the subselect ensures that we only pick the latest B.fromDate for a given B.id we know that at most one row will match the join condition. We will again have the situation where joining with tableB cannot affect the number of rows in the result set. Since we do not select any columns from tableB, the whole join operation can be eliminated from the execution plan. SQL Server 2005/2008 will deploy table elimination in this situation as well. We have not found a way to make Oracle 11g use it for this type of query. Queries like these arise in two situations. Either when you have denormalized model consisting of a fact table with several related dimension tables, or when you have a highly normalized model where each attribute is stored in its own table. The example with the subselect is common whenever you store historized/versioned data. LOW-LEVEL DESIGN: The code (currently in development) is at lp: ~maria-captains/maria/maria-5.1-table-elimination tree. <contents> 1. Elimination criteria 2. No outside references check 2.1 Quick check if there are tables with no outside references 3. One-match check 3.1 Functional dependency source #1: Potential eq_ref access 3.2 Functional dependency source #2: col2=func(col1) 3.3 Functional dependency source #3: One or zero records in the table 3.4 Functional dependency check implementation 3.4.1 Equality collection: Option1 3.4.2 Equality collection: Option2 3.4.3 Functional dependency propagation - option 1 3.4.4 Functional dependency propagation - option 2 4. Removal operation properties 5. Removal operation 6. User interface 6.1 @@optimizer_switch flag 6.2 EXPLAIN [EXTENDED] 7. Miscellaneous adjustments 7.1 Fix used_tables() of aggregate functions 7.2 Make subquery predicates collect their outer references 8. Other concerns 8.1 Relationship with outer->inner joins converter 8.2 Relationship with prepared statements 8.3 Relationship with constant table detection 9. Tests and benchmarks </contents> It's not really about elimination of tables, it's about elimination of inner sides of outer joins. 1. Elimination criteria ======================= We can eliminate inner side of an outer join nest if: 1. There are no references to columns of the inner tables anywhere else in the query. 2. For each record combination of outer tables, it will always produce exactly one matching record combination. Most of effort in this WL entry is checking these two conditions. 2. No outside references check ============================== Criterion #1 means that the WHERE clause, ON clauses of embedding/subsequent outer joins, ORDER BY, GROUP BY and HAVING must have no references to inner tables of the outer join nest we're trying to remove. For multi-table UPDATE/DELETE we also must not remove tables that we're updating/deleting from or tables that are used in UPDATE's SET clause. 2.1 Quick check if there are tables with no outside references ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Before we start searching for outer join nests that could be eliminated, we'll do a quick and cheap check if there possibly could be something that could be eliminated: if (there are outer joins && (tables used in select_list | tables used in group/order by UNION | tables used in where) != bitmap_of_all_join_tables) { attempt table elimination; } 3. One-match check ================== We can eliminate inner side of outer join if it will always generate exactly one matching record combination. By definition of OUTER JOIN, a NULL-complemented record combination will be generated when the inner side of outer join has not produced any matches. What remains to be checked is that there is no possiblity that inner side of the outer join could produce more than one matching record combination. We'll refer to one-match property as "functional dependency": - A outer join nest is functionally dependent [wrt outer tables] if it will produce one matching record combination per each record combination of outer tables - A table is functionally dependent wrt certain set of dependency tables, if record combination of dependency tables uniquely identifies zero or one matching record in the table - Definitions of functional dependency of keys (=column tuples) and columns are apparent. Our goal is to prove that the entire join nest is functionally-dependent. Join nest is functionally dependent (on the otside tables) if each of its elements (those can be either base tables or join nests) is functionally dependent. Functional dependency is transitive: if table A is f-dependent on the outer tables and table B is f.dependent on {A, outer_tables} then B is functionally dependent on the outer tables. Subsequent sections list cases when we can declare a table to be functionally-dependent. 3.1 Functional dependency source #1: Potential eq_ref access ------------------------------------------------------------ This is the most practically-important case. Taking the example from the HLD of this WL entry: select A.colA from tableA A left outer join tableB B on B.id = A.id; and generalizing it: a table TBL is functionally-dependent if the ON expression allows to construct a potential eq_ref access to table TBL that uses only outer or functionally-dependent tables. In other words: table TBL will have one match if the ON expression can be converted into this form TBL.unique_key=func(one_match_tables) AND .. remainder ... (with appropriate extension for multi-part keys), where one_match_tables= { tables that are not on the inner side of the outer join in question, and functionally dependent tables } Note that this will cover constant tables, except those that are constant because they have 0/1 record or are partitioned and have no used partitions. 3.2 Functional dependency source #2: col2=func(col1) ---------------------------------------------------- This comes from the second example in the HLS: create unique index idx on tableB (id, fromDate); ... left outer join tableB B on B.id = A.id and B.fromDate = (select max(sub.fromDate) from tableB sub where sub.id = A.id); Here it is apparent that tableB can be eliminated. It is not possible to construct eq_ref access to tableB, though, because for the second part of the primary key (fromDate column) we only got a condition in this form: B.fromDate= func(tableB) (we write "func(tableB)" because ref optimizer can only determine which tables the right part of the equality depends on). In general case, equality like this doesn't guarantee functional dependency. For example, if func() == { return fromDate;}, i.e the ON expression is ... ON B.id = A.id and B.fromDate = B.fromDate then that would allow table B to have multiple matches per record of table A. In order to be able to distinguish between these two cases, we'll need to go down to column level: - A table is functionally dependent if it has a unique key that's functionally dependent - A unique key is functionally dependent when all of its columns are functionally dependent - A table column is functionally dependent if the ON clause allows to extract an AND-part in this form: tbl.column = f(functionally-dependent columns or columns of outer tables) 3.3 Functional dependency source #3: One or zero records in the table --------------------------------------------------------------------- A table with one or zero records cannot generate more than one matching record. This source is of lesser importance as one/zero-record tables are only MyISAM tables. 3.4 Functional dependency check implementation ---------------------------------------------- As shown above, we need something similar to KEYUSE structures, but not exactly that (we need things that current ref optimizer considers unusable and don't need things that it considers usable). 3.4.1 Equality collection: Option1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We could - extend KEYUSE structures to store all kinds of equalities we need - change update_ref_and_keys() and co. to collect equalities both for ref access and for table elimination = [possibly] Improve [eq_]ref access to be able to use equalities in form keypart2=func(keypart1) - process the KEYUSE array both by table elimination and by ref access optimizer. + This requires less effort. - Code will have to be changed all over sql_select.cc - update_ref_and_keys() and co. already do several unrelated things. Hooking up table elimination will make it even worse. 3.4.2 Equality collection: Option2 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Alternatively, we could process the WHERE clause totally on our own. + Table elimination is standalone and easy to detach module. - Some code duplication with update_ref_and_keys() and co. Having got the equalities, we'll to propagate functional dependency property to unique keys, tables and, ultimately, join nests. 3.4.3 Functional dependency propagation - option 1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Borrow the approach used in constant table detection code: do { converted= FALSE; for each table T in join nest { if (check_if_functionally_dependent(T)) converted= TRUE; } } while (converted == TRUE); check_if_functionally_dependent(T) { if (T has eq_ref access based on func_dep_tables) return TRUE; Apply the same do-while loop-based approach to available equalities T.column1=func(other columns) to spread the set of functionally-dependent columns. The goal is to get all columns of a certain unique key to be bound. } 3.4.4 Functional dependency propagation - option 2 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Analyze the ON expression(s) and build a list of tbl.field = expr(...) equalities. tbl here is a table that belongs to a join nest that could potentially be eliminated. besides those, add to the list - An element for each unique key in the table that needs to be eliminated - An element for each table that needs to be eliminated - An element for each join nest that can be eliminated (i.e. has no references from outside). Then, setup "reverse dependencies": each element should have pointers to elements that are functionally dependent on it: - "tbl.field=expr(...)" equality is functionally dependent on all fields that are used in "expr(...)" (here we take into account only fields that belong to tables that can potentially be eliminated). - a unique key is dependent on all of its components - a table is dependent on all of its unique keys - a join nest is dependent on all tables that it contains These pointers are stored in form of one bitmap, such that: "X depends on Y" == test( bitmap[(X's number)*n_objects + (Y's number)] ) Each object also stores a number of dependencies it needs to be satisfied before it itself is satisfied: - "tbl.field=expr(...)" needs all its underlying fields (if a field is referenced many times it is counted only once) - a unique key needs all of its key parts - a table needs only one of its unique keys - a join nest needs all of its tables (TODO: so what do we do when we've marked a table as constant? We'll need to update the "field=expr(....)" elements that use fields of that table. And the problem is that we won't know how much to decrement from the counters of those elements. Solution#1: switch to table_map() based approach. Solution#2: introduce separate elements for each involved field. field will depend on its table, "field=expr" will depend on fields. ) Besides the above, let each element have a pointer to another element, so that we can have a linked list of elements. After the above structures have been created, we start the main algorithm. The first step is to create a list of functionally-dependent elements. We walk across array of dependencies and mark those elements that are already bound (i.e. their dependencies are satisfied). At the moment those immediately-bound are only "field=expr" dependencies that don't refer to any columns that are not bound. The second step is the loop while (bound_list is not empty) { Take the first bound element F off the list. Use the bitmap to find out what other elements depended on it for each such element E { if (E becomes bound after F is bound) add E to the list; } } The last step is to walk through elements that represent the join nests. Those that are bound can be eliminated. 4. Removal operation properties =============================== * There is always one way to remove (no choice to remove either this or that) * It is always better to remove as much tables as possible (at least within our cost model). Thus, no need for any cost calculations/etc. It's an unconditional rewrite. 5. Removal operation ==================== (This depends a lot on whether we make table elimination a one-off rewrite or conditional) At the moment table elimination is re-done for each join re-execution, hence the removal operation is designed not to modify any statement's permanent members. * Remove the outer join nest's nested join structure (i.e. get the outer join's TABLE_LIST object $OJ and remove it from $OJ->embedding, $OJ->embedding->nested_join. Update table_map's of all ancestor nested joins). [MARK2] * Move the tables and their JOIN_TABs to the front of join order, like it is done with const tables, with exception that if eliminated outer join nest was within another outer join nest, that shouldn't prevent us from moving away the eliminated tables. * Update join->table_count and all-join-tables bitmap. ^ TODO: not true anymore ^ * That's it. Nothing else? 6. User interface ================= 6.1 @@optimizer_switch flag --------------------------- Argument againist adding the flag: * It is always better to perform table elimination than not to do it. Arguments for the flag: * It is always theoretically possible that the new code will cause unintended slowdowns. * Having the flag is useful for QA and comparative benchmarking. Decision so far: add the flag under #ifdef. Make the flag be present in debug builds. 6.2 EXPLAIN [EXTENDED] ---------------------- There are two possible options: 1. Show eliminated tables, like we do with const tables. 2. Do not show eliminated tables. We chose option 2, because: - the table is not accessed at all (besides locking it) - it is more natural for anchor model user - when he's querying an anchor- and attributes view, he doesn't care about the unused attributes. EXPLAIN EXTENDED+SHOW WARNINGS won't show the removed table either. NOTE: Before this WL, the warning text was generated after all JOIN objects have been destroyed. This didn't allow to use information about join plan when printing the warning. We've fixed this by keeping the JOIN objects until the warning text has been generated. Table elimination removes inner sides of outer join, and logically the ON clause is also removed. If this clause has any subqueries, they will be also removed from EXPLAIN output. An exception to the above is that if we eliminate a derived table, it will still be shown in EXPLAIN output. This comes from the fact that the FROM subqueries are evaluated before table elimination is invoked. TODO: Is the above ok or still remove parts of FROM subqueries? 7. Miscellaneous adjustments ============================ 7.1 Fix used_tables() of aggregate functions -------------------------------------------- Aggregate functions used to report that they depend on all tables, that is, item_agg_func->used_tables() == (1ULL << join->tables) - 1 always. Fixed it, now aggregate function reports that it depends on the tables that its arguments depend on. In particular, COUNT(*) reports that it depends on no tables (item_count_star->used_tables()==0). One consequence of that is that "item->used_tables()==0" is not equivalent to "item->const_item()==true" anymore (not sure if it's "anymore" or this has been already so for some items). 7.2 Make subquery predicates collect their outer references ----------------------------------------------------------- Per-column functional dependency analysis requires us to take a tbl.field = func(...) equality and tell which columns of which tables are referred from func(...) expression. For scalar expressions, this is accomplished by Item::walk()-based traversal. It should be reasonably cheap (the only practical Item that can be expensive to traverse seems to be a special case of "col IN (const1,const2, ...)". check if we traverse the long list for such items). For correlated subqueries, traversal can be expensive, it is cheaper to make each subquery item have a list of its outer references. The list can be collected at fix_fields() stage with very little extra cost, and then it could be used for other optimizations. 8. Other concerns ================= 8.1 Relationship with outer->inner joins converter -------------------------------------------------- One could suspect that outer->inner join conversion could get in the way of table elimination by changing outer joins (which could be eliminated) to inner (which we will not try to eliminate). This concern is not valid: we make outer->inner conversions based on predicates in WHERE. If the WHERE referred to an inner table (this is a requirement for the conversion) then table elimination would not be applicable anyway. 8.2 Relationship with prepared statements ----------------------------------------- On one hand, it's natural to desire to make table elimination a once-per-statement operation, like outer->inner join conversion. We'll have to limit the applicability by removing [MARK1] as that can change during lifetime of the statement. The other option is to do table elimination every time. This will require to rework operation [MARK2] to be undoable. 8.3 Relationship with constant table detection ---------------------------------------------- Table elimination is performed after constant table detection (but before the range analysis). Constant tables are technically different from eliminated ones (e.g. the former are shown in EXPLAIN and the latter aren't). Considering we've already done the join_read_const_table() call, is there any real difference between constant table and eliminated one? If there is, should we mark const tables also as eliminated? from user/EXPLAIN point of view: no. constant table is the one that we read one record from. eliminated table is the one that we don't acccess at all. TODO 9. Tests and benchmarks ======================= Create a benchmark in sql-bench which checks if the DBMS has table elimination. [According to Monty] Run - query Q1 that would use elimination - query Q2 that is very similar to Q1 (so that they would have same QEP, execution cost, etc) but cannot use table elimination. then compare run times and make a conclusion about whether the used dbms supports table elimination. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] Updated (by Guest): Table elimination (17)
by worklog-noreply＠askmonty.org 29 Jul '09

29 Jul '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Table elimination CREATION DATE..: Sun, 10 May 2009, 19:57 SUPERVISOR.....: Monty IMPLEMENTOR....: Psergey COPIES TO......: CATEGORY.......: Client-BackLog TASK ID........: 17 (http://askmonty.org/worklog/?tid=17) VERSION........: Server-5.1 STATUS.........: In-Progress PRIORITY.......: 60 WORKED HOURS...: 1 ESTIMATE.......: 3 (hours remain) ORIG. ESTIMATE.: 3 PROGRESS NOTES: -=-=(Guest - Wed, 29 Jul 2009, 21:41)=-=- Low Level Design modified. --- /tmp/wklog.17.old.26011 2009-07-29 21:41:04.000000000 +0300 +++ /tmp/wklog.17.new.26011 2009-07-29 21:41:04.000000000 +0300 @@ -2,163 +2,146 @@ ~maria-captains/maria/maria-5.1-table-elimination tree. <contents> -1. Conditions for removal -1.1 Quick check if there are candidates -2. Removal operation properties -3. Removal operation -4. User interface -5. Tests and benchmarks -6. Todo, issues to resolve -6.1 To resolve -6.2 Resolved -7. Additional issues +1. Elimination criteria +2. No outside references check +2.1 Quick check if there are tables with no outside references +3. One-match check +3.1 Functional dependency source #1: Potential eq_ref access +3.2 Functional dependency source #2: col2=func(col1) +3.3 Functional dependency source #3: One or zero records in the table +3.4 Functional dependency check implementation +3.4.1 Equality collection: Option1 +3.4.2 Equality collection: Option2 +3.4.3 Functional dependency propagation - option 1 +3.4.4 Functional dependency propagation - option 2 +4. Removal operation properties +5. Removal operation +6. User interface +6.1 @@optimizer_switch flag +6.2 EXPLAIN [EXTENDED] +7. Miscellaneous adjustments +7.1 Fix used_tables() of aggregate functions +7.2 Make subquery predicates collect their outer references +8. Other concerns +8.1 Relationship with outer->inner joins converter +8.2 Relationship with prepared statements +8.3 Relationship with constant table detection +9. Tests and benchmarks </contents> It's not really about elimination of tables, it's about elimination of inner sides of outer joins. -1. Conditions for removal -------------------------- -We can eliminate an inner side of outer join if: -1. For each record combination of outer tables, it will always produce - exactly one record. -2. There are no references to columns of the inner tables anywhere else in +1. Elimination criteria +======================= +We can eliminate inner side of an outer join nest if: + +1. There are no references to columns of the inner tables anywhere else in the query. +2. For each record combination of outer tables, it will always produce + exactly one matching record combination. + +Most of effort in this WL entry is checking these two conditions. -#1 means that every table inside the outer join nest is: - - is a constant table: - = because it can be accessed via eq_ref(const) access, or - = it is a zero-rows or one-row MyISAM-like table [MARK1] - - has an eq_ref access method candidate. - -#2 means that WHERE clause, ON clauses of embedding outer joins, ORDER BY, - GROUP BY and HAVING do not refer to the inner tables of the outer join - nest. - -1.1 Quick check if there are candidates -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Before we start to enumerate join nests, here is a quick way to check if -there *can be* something to be removed: +2. No outside references check +============================== +Criterion #1 means that the WHERE clause, ON clauses of embedding/subsequent +outer joins, ORDER BY, GROUP BY and HAVING must have no references to inner +tables of the outer join nest we're trying to remove. + +For multi-table UPDATE/DELETE we also must not remove tables that we're +updating/deleting from or tables that are used in UPDATE's SET clause. + +2.1 Quick check if there are tables with no outside references +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Before we start searching for outer join nests that could be eliminated, +we'll do a quick and cheap check if there possibly could be something that +could be eliminated: - if ((tables used in select_list | + if (there are outer joins && + (tables used in select_list | tables used in group/order by UNION | - tables used in where) != bitmap_of_all_tables) + tables used in where) != bitmap_of_all_join_tables) { attempt table elimination; } -2. Removal operation properties -------------------------------- -* There is always one way to remove (no choice to remove either this or that) -* It is always better to remove as much tables as possible (at least within - our cost model). -Thus, no need for any cost calculations/etc. It's an unconditional rewrite. -3. Removal operation --------------------- -* Remove the outer join nest's nested join structure (i.e. get the - outer join's TABLE_LIST object $OJ and remove it from $OJ->embedding, - $OJ->embedding->nested_join. Update table_map's of all ancestor nested - joins). [MARK2] +3. One-match check +================== +We can eliminate inner side of outer join if it will always generate exactly +one matching record combination. -* Move the tables and their JOIN_TABs to front like it is done with const - tables, with exception that if eliminated outer join nest was within - another outer join nest, that shouldn't prevent us from moving away the - eliminated tables. +By definition of OUTER JOIN, a NULL-complemented record combination will be +generated when the inner side of outer join has not produced any matches. -* Update join->table_count and all-join-tables bitmap. +What remains to be checked is that there is no possiblity that inner side of +the outer join could produce more than one matching record combination. -* That's it. Nothing else? +We'll refer to one-match property as "functional dependency": -4. User interface ------------------ -* We'll add an @@optimizer switch flag for table elimination. Tentative - name: 'table_elimination'. - (Note ^^ utility of the above questioned ^, as table elimination can never - be worse than no elimination. We're leaning towards not adding the flag) - -* EXPLAIN will not show the removed tables at all. This will allow to check - if tables were removed, and also will behave nicely with anchor model and - VIEWs: stuff that user doesn't care about just won't be there. +- A outer join nest is functionally dependent [wrt outer tables] if it will + produce one matching record combination per each record combination of + outer tables -5. Tests and benchmarks ------------------------ -Create a benchmark in sql-bench which checks if the DBMS has table -elimination. -[According to Monty] Run - - queries that would use elimination - - queries that are very similar to one above (so that they would have same - QEP, execution cost, etc) but cannot use table elimination. -then compare run times and make a conclusion about whether dbms supports table -elimination. +- A table is functionally dependent wrt certain set of dependency tables, if + record combination of dependency tables uniquely identifies zero or one + matching record in the table -6. Todo, issues to resolve --------------------------- +- Definitions of functional dependency of keys (=column tuples) and columns are + apparent. -6.1 To resolve -~~~~~~~~~~~~~~ -- Relationship with prepared statements. - On one hand, it's natural to desire to make table elimination a - once-per-statement operation, like outer->inner join conversion. We'll have - to limit the applicability by removing [MARK1] as that can change during - lifetime of the statement. - - The other option is to do table elimination every time. This will require to - rework operation [MARK2] to be undoable. - - I'm leaning towards doing the former. With anchor modeling, it is unlikely - that we'll meet outer joins which have N inner tables of which some are 1-row - MyISAM tables that do not have primary key. - -6.2 Resolved -~~~~~~~~~~~~ -* outer->inner join conversion is not a problem for table elimination. - We make outer->inner conversions based on predicates in WHERE. If the WHERE - referred to an inner table (requirement for OJ->IJ conversion) then table - elimination would not be applicable anyway. - -* For Multi-table UPDATEs/DELETEs, need to also analyze the SET clause: - - affected tables must not be eliminated - - tables that are used on the right side of the SET x=y assignments must - not be eliminated either. +Our goal is to prove that the entire join nest is functionally-dependent. -* Aggregate functions used to report that they depend on all tables, that is, +Join nest is functionally dependent (on the otside tables) if each of its +elements (those can be either base tables or join nests) is functionally +dependent. - item_agg_func->used_tables() == (1ULL << join->tables) - 1 +Functional dependency is transitive: if table A is f-dependent on the outer +tables and table B is f.dependent on {A, outer_tables} then B is functionally +dependent on the outer tables. + +Subsequent sections list cases when we can declare a table to be +functionally-dependent. + +3.1 Functional dependency source #1: Potential eq_ref access +------------------------------------------------------------ +This is the most practically-important case. Taking the example from the HLD +of this WL entry: + + select + A.colA + from + tableA A + left outer join + tableB B + on + B.id = A.id; - always. Fixed it, now aggregate function reports it depends on - tables that its arguments depend on. In particular, COUNT(*) reports - that it depends on no tables (item_count_star->used_tables()==0). - One consequence of that is that "item->used_tables()==0" is not - equivalent to "item->const_item()==true" anymore (not sure if it's - "anymore" or this has been already happening). - -* EXPLAIN EXTENDED warning text was generated after the JOIN object has - been discarded. This didn't allow to use information about join plan - when printing the warning. Fixed this by keeping the JOIN objects until - we've printed the warning (have also an intent to remove the const - tables from the join output). - -7. Additional issues --------------------- -* We remove ON clauses within outer join nests. If these clauses contain - subqueries, they probably should be gone from EXPLAIN output also? - Yes. Current approach: when removing an outer join nest, walk the ON clause - and mark subselects as eliminated. Then let EXPLAIN code check if the - SELECT was eliminated before the printing (EXPLAIN is generated by doing - a recursive descent, so the check will also cause children of eliminated - selects not to be printed) - -* Table elimination is performed after constant table detection (but before - the range analysis). Constant tables are technically different from - eliminated ones (e.g. the former are shown in EXPLAIN and the latter aren't). - Considering we've already done the join_read_const_table() call, is there any - real difference between constant table and eliminated one? If there is, should - we mark const tables also as eliminated? - from user/EXPLAIN point of view: no. constant table is the one that we read - one record from. eliminated table is the one that we don't acccess at all. +and generalizing it: a table TBL is functionally-dependent if the ON +expression allows to construct a potential eq_ref access to table TBL that +uses only outer or functionally-dependent tables. + +In other words: table TBL will have one match if the ON expression can be +converted into this form + + TBL.unique_key=func(one_match_tables) AND .. remainder ... + +(with appropriate extension for multi-part keys), where + + one_match_tables= { + tables that are not on the inner side of the outer join in question, and + functionally dependent tables + } + +Note that this will cover constant tables, except those that are constant because +they have 0/1 record or are partitioned and have no used partitions. + + +3.2 Functional dependency source #2: col2=func(col1) +---------------------------------------------------- +This comes from the second example in the HLS: -* What is described above will not be able to eliminate this outer join create unique index idx on tableB (id, fromDate); ... left outer join @@ -169,32 +152,331 @@ B.fromDate = (select max(sub.fromDate) from tableB sub where sub.id = A.id); - This is because condition "B.fromDate= func(tableB)" cannot be used. - Reason#1: update_ref_and_keys() does not consider such conditions to - be of any use (and indeed they are not usable for ref access) - so they are not put into KEYUSE array. - Reason#2: even if they were put there, we would need to be able to tell - between predicates like - B.fromDate= func(B.id) // guarantees only one matching row as - // B.id is already bound by B.id=A.id - // hence B.fromDate becomes bound too. - and - "B.fromDate= func(B.*)" // Can potentially have many matching - // records. - We need to - - Have update_ref_and_keys() create KEYUSE elements for such equalities - - Have eliminate_tables() and friends make a more accurate check. - The right check is to check whether all parts of a unique key are bound. - If we have keypartX to be bound, then t.keypartY=func(keypartX) makes - keypartY to be bound. - The difficulty here is that correlated subquery predicate cannot tell what - columns it depends on (it only remembers tables). - Traversing the predicate is expensive and complicated. - We're leaning towards making each subquery predicate have a List<Item> with - items that - - are in the current select - - and it depends on. - This list will be useful in certain other subquery optimizations as well, - it is cheap to collect it in fix_fields() phase, so it will be collected - for every subquery predicate. +Here it is apparent that tableB can be eliminated. It is not possible to +construct eq_ref access to tableB, though, because for the second part of the +primary key (fromDate column) we only got a condition in this form: + + B.fromDate= func(tableB) + +(we write "func(tableB)" because ref optimizer can only determine which tables +the right part of the equality depends on). + +In general case, equality like this doesn't guarantee functional dependency. +For example, if func() == { return fromDate;}, i.e the ON expression is + + ... ON B.id = A.id and B.fromDate = B.fromDate + +then that would allow table B to have multiple matches per record of table A. + +In order to be able to distinguish between these two cases, we'll need to go +down to column level: + +- A table is functionally dependent if it has a unique key that's functionally + dependent + +- A unique key is functionally dependent when all of its columns are + functionally dependent + +- A table column is functionally dependent if the ON clause allows to extract + an AND-part in this form: + + tbl.column = f(functionally-dependent columns or columns of outer tables) + +3.3 Functional dependency source #3: One or zero records in the table +--------------------------------------------------------------------- +A table with one or zero records cannot generate more than one matching +record. This source is of lesser importance as one/zero-record tables are only +MyISAM tables. + +3.4 Functional dependency check implementation +---------------------------------------------- +As shown above, we need something similar to KEYUSE structures, but not +exactly that (we need things that current ref optimizer considers unusable and +don't need things that it considers usable). + +3.4.1 Equality collection: Option1 +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +We could +- extend KEYUSE structures to store all kinds of equalities we need +- change update_ref_and_keys() and co. to collect equalities both for ref + access and for table elimination + = [possibly] Improve [eq_]ref access to be able to use equalities in + form keypart2=func(keypart1) +- process the KEYUSE array both by table elimination and by ref access + optimizer. + ++ This requires less effort. +- Code will have to be changed all over sql_select.cc +- update_ref_and_keys() and co. already do several unrelated things. Hooking + up table elimination will make it even worse. + +3.4.2 Equality collection: Option2 +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Alternatively, we could process the WHERE clause totally on our own. ++ Table elimination is standalone and easy to detach module. +- Some code duplication with update_ref_and_keys() and co. + +Having got the equalities, we'll to propagate functional dependency property +to unique keys, tables and, ultimately, join nests. + +3.4.3 Functional dependency propagation - option 1 +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Borrow the approach used in constant table detection code: + + do + { + converted= FALSE; + for each table T in join nest + { + if (check_if_functionally_dependent(T)) + converted= TRUE; + } + } while (converted == TRUE); + + check_if_functionally_dependent(T) + { + if (T has eq_ref access based on func_dep_tables) + return TRUE; + + Apply the same do-while loop-based approach to available equalities + T.column1=func(other columns) + to spread the set of functionally-dependent columns. The goal is to get + all columns of a certain unique key to be bound. + } + + +3.4.4 Functional dependency propagation - option 2 +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Analyze the ON expression(s) and build a list of + + tbl.field = expr(...) + +equalities. tbl here is a table that belongs to a join nest that could +potentially be eliminated. + +besides those, add to the list + - An element for each unique key in the table that needs to be eliminated + - An element for each table that needs to be eliminated + - An element for each join nest that can be eliminated (i.e. has no + references from outside). + +Then, setup "reverse dependencies": each element should have pointers to +elements that are functionally dependent on it: + +- "tbl.field=expr(...)" equality is functionally dependent on all fields that + are used in "expr(...)" (here we take into account only fields that belong + to tables that can potentially be eliminated). +- a unique key is dependent on all of its components +- a table is dependent on all of its unique keys +- a join nest is dependent on all tables that it contains + +These pointers are stored in form of one bitmap, such that: + + "X depends on Y" == test( bitmap[(X's number)*n_objects + (Y's number)] ) + +Each object also stores a number of dependencies it needs to be satisfied +before it itself is satisfied: + +- "tbl.field=expr(...)" needs all its underlying fields (if a field is + referenced many times it is counted only once) + +- a unique key needs all of its key parts + +- a table needs only one of its unique keys + +- a join nest needs all of its tables + +(TODO: so what do we do when we've marked a table as constant? We'll need to +update the "field=expr(....)" elements that use fields of that table. And the +problem is that we won't know how much to decrement from the counters of those +elements. + +Solution#1: switch to table_map() based approach. +Solution#2: introduce separate elements for each involved field. + field will depend on its table, + "field=expr" will depend on fields. +) + +Besides the above, let each element have a pointer to another element, so that +we can have a linked list of elements. + +After the above structures have been created, we start the main algorithm. + +The first step is to create a list of functionally-dependent elements. We walk +across array of dependencies and mark those elements that are already bound +(i.e. their dependencies are satisfied). At the moment those immediately-bound +are only "field=expr" dependencies that don't refer to any columns that are +not bound. + +The second step is the loop + + while (bound_list is not empty) + { + Take the first bound element F off the list. + Use the bitmap to find out what other elements depended on it + for each such element E + { + if (E becomes bound after F is bound) + add E to the list; + } + } + +The last step is to walk through elements that represent the join nests. Those +that are bound can be eliminated. + +4. Removal operation properties +=============================== +* There is always one way to remove (no choice to remove either this or that) +* It is always better to remove as much tables as possible (at least within + our cost model). +Thus, no need for any cost calculations/etc. It's an unconditional rewrite. + + +5. Removal operation +==================== +(This depends a lot on whether we make table elimination a one-off rewrite or +conditional) + +At the moment table elimination is re-done for each join re-execution, hence +the removal operation is designed not to modify any statement's permanent +members. + +* Remove the outer join nest's nested join structure (i.e. get the + outer join's TABLE_LIST object $OJ and remove it from $OJ->embedding, + $OJ->embedding->nested_join. Update table_map's of all ancestor nested + joins). [MARK2] + +* Move the tables and their JOIN_TABs to the front of join order, like it is + done with const tables, with exception that if eliminated outer join nest + was within another outer join nest, that shouldn't prevent us from moving + away the eliminated tables. + +* Update join->table_count and all-join-tables bitmap. + ^ TODO: not true anymore ^ + +* That's it. Nothing else? + +6. User interface +================= + +6.1 @@optimizer_switch flag +--------------------------- +Argument againist adding the flag: +* It is always better to perform table elimination than not to do it. + +Arguments for the flag: +* It is always theoretically possible that the new code will cause unintended + slowdowns. +* Having the flag is useful for QA and comparative benchmarking. + +Decision so far: add the flag under #ifdef. Make the flag be present in debug +builds. + +6.2 EXPLAIN [EXTENDED] +---------------------- +There are two possible options: +1. Show eliminated tables, like we do with const tables. +2. Do not show eliminated tables. + +We chose option 2, because: +- the table is not accessed at all (besides locking it) +- it is more natural for anchor model user - when he's querying an anchor- + and attributes view, he doesn't care about the unused attributes. + +EXPLAIN EXTENDED+SHOW WARNINGS won't show the removed table either. + +NOTE: Before this WL, the warning text was generated after all JOIN objects +have been destroyed. This didn't allow to use information about join plan +when printing the warning. We've fixed this by keeping the JOIN objects until +the warning text has been generated. + +Table elimination removes inner sides of outer join, and logically the ON +clause is also removed. If this clause has any subqueries, they will be +also removed from EXPLAIN output. + +An exception to the above is that if we eliminate a derived table, it will +still be shown in EXPLAIN output. This comes from the fact that the FROM +subqueries are evaluated before table elimination is invoked. +TODO: Is the above ok or still remove parts of FROM subqueries? + +7. Miscellaneous adjustments +============================ + +7.1 Fix used_tables() of aggregate functions +-------------------------------------------- +Aggregate functions used to report that they depend on all tables, that is, + + item_agg_func->used_tables() == (1ULL << join->tables) - 1 + +always. Fixed it, now aggregate function reports that it depends on the +tables that its arguments depend on. In particular, COUNT(*) reports that it +depends on no tables (item_count_star->used_tables()==0). One consequence of +that is that "item->used_tables()==0" is not equivalent to +"item->const_item()==true" anymore (not sure if it's "anymore" or this has +been already so for some items). + +7.2 Make subquery predicates collect their outer references +----------------------------------------------------------- +Per-column functional dependency analysis requires us to take a + + tbl.field = func(...) + +equality and tell which columns of which tables are referred from func(...) +expression. For scalar expressions, this is accomplished by Item::walk()-based +traversal. It should be reasonably cheap (the only practical Item that can be +expensive to traverse seems to be a special case of "col IN (const1,const2, +...)". check if we traverse the long list for such items). + +For correlated subqueries, traversal can be expensive, it is cheaper to make +each subquery item have a list of its outer references. The list can be +collected at fix_fields() stage with very little extra cost, and then it could +be used for other optimizations. + + +8. Other concerns +================= + +8.1 Relationship with outer->inner joins converter +-------------------------------------------------- +One could suspect that outer->inner join conversion could get in the way +of table elimination by changing outer joins (which could be eliminated) +to inner (which we will not try to eliminate). +This concern is not valid: we make outer->inner conversions based on +predicates in WHERE. If the WHERE referred to an inner table (this is a +requirement for the conversion) then table elimination would not be +applicable anyway. + +8.2 Relationship with prepared statements +----------------------------------------- +On one hand, it's natural to desire to make table elimination a +once-per-statement operation, like outer->inner join conversion. We'll have +to limit the applicability by removing [MARK1] as that can change during +lifetime of the statement. + +The other option is to do table elimination every time. This will require to +rework operation [MARK2] to be undoable. + + +8.3 Relationship with constant table detection +---------------------------------------------- +Table elimination is performed after constant table detection (but before +the range analysis). Constant tables are technically different from +eliminated ones (e.g. the former are shown in EXPLAIN and the latter aren't). +Considering we've already done the join_read_const_table() call, is there any +real difference between constant table and eliminated one? If there is, should +we mark const tables also as eliminated? +from user/EXPLAIN point of view: no. constant table is the one that we read +one record from. eliminated table is the one that we don't acccess at all. +TODO + +9. Tests and benchmarks +======================= +Create a benchmark in sql-bench which checks if the DBMS has table +elimination. +[According to Monty] Run + - query Q1 that would use elimination + - query Q2 that is very similar to Q1 (so that they would have same + QEP, execution cost, etc) but cannot use table elimination. +then compare run times and make a conclusion about whether the used dbms +supports table elimination. -=-=(Guest - Thu, 23 Jul 2009, 20:07)=-=- Dependency created: 29 now depends on 17 -=-=(Monty - Thu, 23 Jul 2009, 09:19)=-=- Version updated. --- /tmp/wklog.17.old.24090 2009-07-23 09:19:32.000000000 +0300 +++ /tmp/wklog.17.new.24090 2009-07-23 09:19:32.000000000 +0300 @@ -1 +1 @@ -Server-9.x +Server-5.1 -=-=(Guest - Mon, 20 Jul 2009, 14:28)=-=- deukje weg Worked 1 hour and estimate 3 hours remain (original estimate increased by 4 hours). -=-=(Guest - Fri, 17 Jul 2009, 02:44)=-=- Version updated. --- /tmp/wklog.17.old.24138 2009-07-17 02:44:49.000000000 +0300 +++ /tmp/wklog.17.new.24138 2009-07-17 02:44:49.000000000 +0300 @@ -1 +1 @@ -9.x +Server-9.x -=-=(Guest - Fri, 17 Jul 2009, 02:44)=-=- Version updated. --- /tmp/wklog.17.old.24114 2009-07-17 02:44:36.000000000 +0300 +++ /tmp/wklog.17.new.24114 2009-07-17 02:44:36.000000000 +0300 @@ -1 +1 @@ -Server-5.1 +9.x -=-=(Guest - Fri, 17 Jul 2009, 02:44)=-=- Category updated. --- /tmp/wklog.17.old.24114 2009-07-17 02:44:36.000000000 +0300 +++ /tmp/wklog.17.new.24114 2009-07-17 02:44:36.000000000 +0300 @@ -1 +1 @@ -Server-Sprint +Client-BackLog -=-=(Guest - Thu, 18 Jun 2009, 04:15)=-=- Low Level Design modified. --- /tmp/wklog.17.old.29969 2009-06-18 04:15:23.000000000 +0300 +++ /tmp/wklog.17.new.29969 2009-06-18 04:15:23.000000000 +0300 @@ -158,3 +158,43 @@ from user/EXPLAIN point of view: no. constant table is the one that we read one record from. eliminated table is the one that we don't acccess at all. +* What is described above will not be able to eliminate this outer join + create unique index idx on tableB (id, fromDate); + ... + left outer join + tableB B + on + B.id = A.id + and + B.fromDate = (select max(sub.fromDate) + from tableB sub where sub.id = A.id); + + This is because condition "B.fromDate= func(tableB)" cannot be used. + Reason#1: update_ref_and_keys() does not consider such conditions to + be of any use (and indeed they are not usable for ref access) + so they are not put into KEYUSE array. + Reason#2: even if they were put there, we would need to be able to tell + between predicates like + B.fromDate= func(B.id) // guarantees only one matching row as + // B.id is already bound by B.id=A.id + // hence B.fromDate becomes bound too. + and + "B.fromDate= func(B.*)" // Can potentially have many matching + // records. + We need to + - Have update_ref_and_keys() create KEYUSE elements for such equalities + - Have eliminate_tables() and friends make a more accurate check. + The right check is to check whether all parts of a unique key are bound. + If we have keypartX to be bound, then t.keypartY=func(keypartX) makes + keypartY to be bound. + The difficulty here is that correlated subquery predicate cannot tell what + columns it depends on (it only remembers tables). + Traversing the predicate is expensive and complicated. + We're leaning towards making each subquery predicate have a List<Item> with + items that + - are in the current select + - and it depends on. + This list will be useful in certain other subquery optimizations as well, + it is cheap to collect it in fix_fields() phase, so it will be collected + for every subquery predicate. + -=-=(Guest - Thu, 18 Jun 2009, 02:48)=-=- Low Level Design modified. --- /tmp/wklog.17.old.27792 2009-06-18 02:48:45.000000000 +0300 +++ /tmp/wklog.17.new.27792 2009-06-18 02:48:45.000000000 +0300 @@ -89,14 +89,14 @@ - queries that would use elimination - queries that are very similar to one above (so that they would have same QEP, execution cost, etc) but cannot use table elimination. +then compare run times and make a conclusion about whether dbms supports table +elimination. 6. Todo, issues to resolve -------------------------- 6.1 To resolve ~~~~~~~~~~~~~~ -- Re-check how this works with equality propagation. - - Relationship with prepared statements. On one hand, it's natural to desire to make table elimination a once-per-statement operation, like outer->inner join conversion. We'll have @@ -141,8 +141,13 @@ 7. Additional issues -------------------- -* We remove ON clauses within semi-join nests. If these clauses contain +* We remove ON clauses within outer join nests. If these clauses contain subqueries, they probably should be gone from EXPLAIN output also? + Yes. Current approach: when removing an outer join nest, walk the ON clause + and mark subselects as eliminated. Then let EXPLAIN code check if the + SELECT was eliminated before the printing (EXPLAIN is generated by doing + a recursive descent, so the check will also cause children of eliminated + selects not to be printed) * Table elimination is performed after constant table detection (but before the range analysis). Constant tables are technically different from -=-=(Guest - Thu, 18 Jun 2009, 02:24)=-=- Low Level Design modified. --- /tmp/wklog.17.old.27162 2009-06-18 02:24:14.000000000 +0300 +++ /tmp/wklog.17.new.27162 2009-06-18 02:24:14.000000000 +0300 @@ -83,9 +83,12 @@ 5. Tests and benchmarks ----------------------- -Should create a benchmark in sql-bench which checks if the dbms has table +Create a benchmark in sql-bench which checks if the DBMS has table elimination. -TODO elaborate +[According to Monty] Run + - queries that would use elimination + - queries that are very similar to one above (so that they would have same + QEP, execution cost, etc) but cannot use table elimination. 6. Todo, issues to resolve -------------------------- @@ -109,33 +112,37 @@ 6.2 Resolved ~~~~~~~~~~~~ -- outer->inner join conversion is not a problem for table elimination. +* outer->inner join conversion is not a problem for table elimination. We make outer->inner conversions based on predicates in WHERE. If the WHERE referred to an inner table (requirement for OJ->IJ conversion) then table elimination would not be applicable anyway. -7. Additional issues --------------------- -* We remove ON clauses within semi-join nests. If these clauses contain - subqueries, they probably should be gone from EXPLAIN output also? +* For Multi-table UPDATEs/DELETEs, need to also analyze the SET clause: + - affected tables must not be eliminated + - tables that are used on the right side of the SET x=y assignments must + not be eliminated either. -* Aggregate functions report they depend on all tables, that is, +* Aggregate functions used to report that they depend on all tables, that is, item_agg_func->used_tables() == (1ULL << join->tables) - 1 - always. If we want table elimination to work in presence of grouping, need - to devise some other way of analyzing aggregate functions. + always. Fixed it, now aggregate function reports it depends on + tables that its arguments depend on. In particular, COUNT(*) reports + that it depends on no tables (item_count_star->used_tables()==0). + One consequence of that is that "item->used_tables()==0" is not + equivalent to "item->const_item()==true" anymore (not sure if it's + "anymore" or this has been already happening). + +* EXPLAIN EXTENDED warning text was generated after the JOIN object has + been discarded. This didn't allow to use information about join plan + when printing the warning. Fixed this by keeping the JOIN objects until + we've printed the warning (have also an intent to remove the const + tables from the join output). -* Should eliminated tables be shown in EXPLAIN EXTENDED? - - If we just ignore the question, they will be shown - - this is what happens for constant tables, too. - - I don't see how showing them could be of any use. They only make it - harder to read the rewritten query. - It turns out that - - it is easy to have EXPLAIN EXTENDED show permanent (once-per-statement - lifetime) changes. - - it is hard to have it show per-execution data. This is because the warning - text is generated after the execution structures have been destroyed. +7. Additional issues +-------------------- +* We remove ON clauses within semi-join nests. If these clauses contain + subqueries, they probably should be gone from EXPLAIN output also? * Table elimination is performed after constant table detection (but before the range analysis). Constant tables are technically different from @@ -143,8 +150,6 @@ Considering we've already done the join_read_const_table() call, is there any real difference between constant table and eliminated one? If there is, should we mark const tables also as eliminated? + from user/EXPLAIN point of view: no. constant table is the one that we read + one record from. eliminated table is the one that we don't acccess at all. -* For Multi-table UPDATEs/DELETEs, need to also analyze the SET clause: - - affected tables must not be eliminated - - tables that are used on the right side of the SET x=y assignments must - not be eliminated either. ------------------------------------------------------------ -=-=(View All Progress Notes, 26 total)=-=- http://askmonty.org/worklog/index.pl?tid=17&nolimit=1 DESCRIPTION: Eliminate not needed tables from SELECT queries.. This will speed up some views and automatically generated queries. Example: CREATE TABLE B (id int primary key); select A.colA from tableA A left outer join tableB B on B.id = A.id; In this case we can remove table B and the join from the query. HIGH-LEVEL SPECIFICATION: Here is an extended explanation of table elimination. Table elimination is a feature found in some modern query optimizers, of which Microsoft SQL Server 2005/2008 seems to have the most advanced implementation. Oracle 11g has also been confirmed to use table elimination but not to the same extent. Basically, what table elimination does, is to remove tables from the execution plan when it is unnecessary to include them. This can, of course, only happen if the right circumstances arise. Let us for example look at the following query: select A.colA from tableA A left outer join tableB B on B.id = A.id; When using A as the left table we ensure that the query will return at least as many rows as there are in that table. For rows where the join condition (B.id = A.id) is not met the selected column (A.colA) will still contain it's original value. The not seen B.* row would contain all NULL:s. However, the result set could actually contain more rows than what is found in tableA if there are duplicates of the column B.id in tableB. If A contains a row [1, "val1"] and B the rows [1, "other1a"],[1, "other1b"] then two rows will match in the join condition. The only way to know what the result will look like is to actually touch both tables during execution. Instead, let's say that tableB contains rows that make it possible to place a unique constraint on the column B.id, for example and often the case a primary key. In this situation we know that we will get exactly as many rows as there are in tableA, since joining with tableB cannot introduce any duplicates. If further, as in the example query, we do not select any columns from tableB, touching that table during execution is unnecessary. We can remove the whole join operation from the execution plan. Both SQL Server 2005/2008 and Oracle 11g will deploy table elimination in the case described above. Let us look at a more advanced query, where Oracle fails. select A.colA from tableA A left outer join tableB B on B.id = A.id and B.fromDate = ( select max(sub.fromDate) from tableB sub where sub.id = A.id ); In this example we have added another join condition, which ensures that we only pick the matching row from tableB having the latest fromDate. In this case tableB will contain duplicates of the column B.id, so in order to ensure uniqueness the primary key has to contain the fromDate column as well. In other words the primary key of tableB is (B.id, B.fromDate). Furthermore, since the subselect ensures that we only pick the latest B.fromDate for a given B.id we know that at most one row will match the join condition. We will again have the situation where joining with tableB cannot affect the number of rows in the result set. Since we do not select any columns from tableB, the whole join operation can be eliminated from the execution plan. SQL Server 2005/2008 will deploy table elimination in this situation as well. We have not found a way to make Oracle 11g use it for this type of query. Queries like these arise in two situations. Either when you have denormalized model consisting of a fact table with several related dimension tables, or when you have a highly normalized model where each attribute is stored in its own table. The example with the subselect is common whenever you store historized/versioned data. LOW-LEVEL DESIGN: The code (currently in development) is at lp: ~maria-captains/maria/maria-5.1-table-elimination tree. <contents> 1. Elimination criteria 2. No outside references check 2.1 Quick check if there are tables with no outside references 3. One-match check 3.1 Functional dependency source #1: Potential eq_ref access 3.2 Functional dependency source #2: col2=func(col1) 3.3 Functional dependency source #3: One or zero records in the table 3.4 Functional dependency check implementation 3.4.1 Equality collection: Option1 3.4.2 Equality collection: Option2 3.4.3 Functional dependency propagation - option 1 3.4.4 Functional dependency propagation - option 2 4. Removal operation properties 5. Removal operation 6. User interface 6.1 @@optimizer_switch flag 6.2 EXPLAIN [EXTENDED] 7. Miscellaneous adjustments 7.1 Fix used_tables() of aggregate functions 7.2 Make subquery predicates collect their outer references 8. Other concerns 8.1 Relationship with outer->inner joins converter 8.2 Relationship with prepared statements 8.3 Relationship with constant table detection 9. Tests and benchmarks </contents> It's not really about elimination of tables, it's about elimination of inner sides of outer joins. 1. Elimination criteria ======================= We can eliminate inner side of an outer join nest if: 1. There are no references to columns of the inner tables anywhere else in the query. 2. For each record combination of outer tables, it will always produce exactly one matching record combination. Most of effort in this WL entry is checking these two conditions. 2. No outside references check ============================== Criterion #1 means that the WHERE clause, ON clauses of embedding/subsequent outer joins, ORDER BY, GROUP BY and HAVING must have no references to inner tables of the outer join nest we're trying to remove. For multi-table UPDATE/DELETE we also must not remove tables that we're updating/deleting from or tables that are used in UPDATE's SET clause. 2.1 Quick check if there are tables with no outside references ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Before we start searching for outer join nests that could be eliminated, we'll do a quick and cheap check if there possibly could be something that could be eliminated: if (there are outer joins && (tables used in select_list | tables used in group/order by UNION | tables used in where) != bitmap_of_all_join_tables) { attempt table elimination; } 3. One-match check ================== We can eliminate inner side of outer join if it will always generate exactly one matching record combination. By definition of OUTER JOIN, a NULL-complemented record combination will be generated when the inner side of outer join has not produced any matches. What remains to be checked is that there is no possiblity that inner side of the outer join could produce more than one matching record combination. We'll refer to one-match property as "functional dependency": - A outer join nest is functionally dependent [wrt outer tables] if it will produce one matching record combination per each record combination of outer tables - A table is functionally dependent wrt certain set of dependency tables, if record combination of dependency tables uniquely identifies zero or one matching record in the table - Definitions of functional dependency of keys (=column tuples) and columns are apparent. Our goal is to prove that the entire join nest is functionally-dependent. Join nest is functionally dependent (on the otside tables) if each of its elements (those can be either base tables or join nests) is functionally dependent. Functional dependency is transitive: if table A is f-dependent on the outer tables and table B is f.dependent on {A, outer_tables} then B is functionally dependent on the outer tables. Subsequent sections list cases when we can declare a table to be functionally-dependent. 3.1 Functional dependency source #1: Potential eq_ref access ------------------------------------------------------------ This is the most practically-important case. Taking the example from the HLD of this WL entry: select A.colA from tableA A left outer join tableB B on B.id = A.id; and generalizing it: a table TBL is functionally-dependent if the ON expression allows to construct a potential eq_ref access to table TBL that uses only outer or functionally-dependent tables. In other words: table TBL will have one match if the ON expression can be converted into this form TBL.unique_key=func(one_match_tables) AND .. remainder ... (with appropriate extension for multi-part keys), where one_match_tables= { tables that are not on the inner side of the outer join in question, and functionally dependent tables } Note that this will cover constant tables, except those that are constant because they have 0/1 record or are partitioned and have no used partitions. 3.2 Functional dependency source #2: col2=func(col1) ---------------------------------------------------- This comes from the second example in the HLS: create unique index idx on tableB (id, fromDate); ... left outer join tableB B on B.id = A.id and B.fromDate = (select max(sub.fromDate) from tableB sub where sub.id = A.id); Here it is apparent that tableB can be eliminated. It is not possible to construct eq_ref access to tableB, though, because for the second part of the primary key (fromDate column) we only got a condition in this form: B.fromDate= func(tableB) (we write "func(tableB)" because ref optimizer can only determine which tables the right part of the equality depends on). In general case, equality like this doesn't guarantee functional dependency. For example, if func() == { return fromDate;}, i.e the ON expression is ... ON B.id = A.id and B.fromDate = B.fromDate then that would allow table B to have multiple matches per record of table A. In order to be able to distinguish between these two cases, we'll need to go down to column level: - A table is functionally dependent if it has a unique key that's functionally dependent - A unique key is functionally dependent when all of its columns are functionally dependent - A table column is functionally dependent if the ON clause allows to extract an AND-part in this form: tbl.column = f(functionally-dependent columns or columns of outer tables) 3.3 Functional dependency source #3: One or zero records in the table --------------------------------------------------------------------- A table with one or zero records cannot generate more than one matching record. This source is of lesser importance as one/zero-record tables are only MyISAM tables. 3.4 Functional dependency check implementation ---------------------------------------------- As shown above, we need something similar to KEYUSE structures, but not exactly that (we need things that current ref optimizer considers unusable and don't need things that it considers usable). 3.4.1 Equality collection: Option1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We could - extend KEYUSE structures to store all kinds of equalities we need - change update_ref_and_keys() and co. to collect equalities both for ref access and for table elimination = [possibly] Improve [eq_]ref access to be able to use equalities in form keypart2=func(keypart1) - process the KEYUSE array both by table elimination and by ref access optimizer. + This requires less effort. - Code will have to be changed all over sql_select.cc - update_ref_and_keys() and co. already do several unrelated things. Hooking up table elimination will make it even worse. 3.4.2 Equality collection: Option2 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Alternatively, we could process the WHERE clause totally on our own. + Table elimination is standalone and easy to detach module. - Some code duplication with update_ref_and_keys() and co. Having got the equalities, we'll to propagate functional dependency property to unique keys, tables and, ultimately, join nests. 3.4.3 Functional dependency propagation - option 1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Borrow the approach used in constant table detection code: do { converted= FALSE; for each table T in join nest { if (check_if_functionally_dependent(T)) converted= TRUE; } } while (converted == TRUE); check_if_functionally_dependent(T) { if (T has eq_ref access based on func_dep_tables) return TRUE; Apply the same do-while loop-based approach to available equalities T.column1=func(other columns) to spread the set of functionally-dependent columns. The goal is to get all columns of a certain unique key to be bound. } 3.4.4 Functional dependency propagation - option 2 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Analyze the ON expression(s) and build a list of tbl.field = expr(...) equalities. tbl here is a table that belongs to a join nest that could potentially be eliminated. besides those, add to the list - An element for each unique key in the table that needs to be eliminated - An element for each table that needs to be eliminated - An element for each join nest that can be eliminated (i.e. has no references from outside). Then, setup "reverse dependencies": each element should have pointers to elements that are functionally dependent on it: - "tbl.field=expr(...)" equality is functionally dependent on all fields that are used in "expr(...)" (here we take into account only fields that belong to tables that can potentially be eliminated). - a unique key is dependent on all of its components - a table is dependent on all of its unique keys - a join nest is dependent on all tables that it contains These pointers are stored in form of one bitmap, such that: "X depends on Y" == test( bitmap[(X's number)*n_objects + (Y's number)] ) Each object also stores a number of dependencies it needs to be satisfied before it itself is satisfied: - "tbl.field=expr(...)" needs all its underlying fields (if a field is referenced many times it is counted only once) - a unique key needs all of its key parts - a table needs only one of its unique keys - a join nest needs all of its tables (TODO: so what do we do when we've marked a table as constant? We'll need to update the "field=expr(....)" elements that use fields of that table. And the problem is that we won't know how much to decrement from the counters of those elements. Solution#1: switch to table_map() based approach. Solution#2: introduce separate elements for each involved field. field will depend on its table, "field=expr" will depend on fields. ) Besides the above, let each element have a pointer to another element, so that we can have a linked list of elements. After the above structures have been created, we start the main algorithm. The first step is to create a list of functionally-dependent elements. We walk across array of dependencies and mark those elements that are already bound (i.e. their dependencies are satisfied). At the moment those immediately-bound are only "field=expr" dependencies that don't refer to any columns that are not bound. The second step is the loop while (bound_list is not empty) { Take the first bound element F off the list. Use the bitmap to find out what other elements depended on it for each such element E { if (E becomes bound after F is bound) add E to the list; } } The last step is to walk through elements that represent the join nests. Those that are bound can be eliminated. 4. Removal operation properties =============================== * There is always one way to remove (no choice to remove either this or that) * It is always better to remove as much tables as possible (at least within our cost model). Thus, no need for any cost calculations/etc. It's an unconditional rewrite. 5. Removal operation ==================== (This depends a lot on whether we make table elimination a one-off rewrite or conditional) At the moment table elimination is re-done for each join re-execution, hence the removal operation is designed not to modify any statement's permanent members. * Remove the outer join nest's nested join structure (i.e. get the outer join's TABLE_LIST object $OJ and remove it from $OJ->embedding, $OJ->embedding->nested_join. Update table_map's of all ancestor nested joins). [MARK2] * Move the tables and their JOIN_TABs to the front of join order, like it is done with const tables, with exception that if eliminated outer join nest was within another outer join nest, that shouldn't prevent us from moving away the eliminated tables. * Update join->table_count and all-join-tables bitmap. ^ TODO: not true anymore ^ * That's it. Nothing else? 6. User interface ================= 6.1 @@optimizer_switch flag --------------------------- Argument againist adding the flag: * It is always better to perform table elimination than not to do it. Arguments for the flag: * It is always theoretically possible that the new code will cause unintended slowdowns. * Having the flag is useful for QA and comparative benchmarking. Decision so far: add the flag under #ifdef. Make the flag be present in debug builds. 6.2 EXPLAIN [EXTENDED] ---------------------- There are two possible options: 1. Show eliminated tables, like we do with const tables. 2. Do not show eliminated tables. We chose option 2, because: - the table is not accessed at all (besides locking it) - it is more natural for anchor model user - when he's querying an anchor- and attributes view, he doesn't care about the unused attributes. EXPLAIN EXTENDED+SHOW WARNINGS won't show the removed table either. NOTE: Before this WL, the warning text was generated after all JOIN objects have been destroyed. This didn't allow to use information about join plan when printing the warning. We've fixed this by keeping the JOIN objects until the warning text has been generated. Table elimination removes inner sides of outer join, and logically the ON clause is also removed. If this clause has any subqueries, they will be also removed from EXPLAIN output. An exception to the above is that if we eliminate a derived table, it will still be shown in EXPLAIN output. This comes from the fact that the FROM subqueries are evaluated before table elimination is invoked. TODO: Is the above ok or still remove parts of FROM subqueries? 7. Miscellaneous adjustments ============================ 7.1 Fix used_tables() of aggregate functions -------------------------------------------- Aggregate functions used to report that they depend on all tables, that is, item_agg_func->used_tables() == (1ULL << join->tables) - 1 always. Fixed it, now aggregate function reports that it depends on the tables that its arguments depend on. In particular, COUNT(*) reports that it depends on no tables (item_count_star->used_tables()==0). One consequence of that is that "item->used_tables()==0" is not equivalent to "item->const_item()==true" anymore (not sure if it's "anymore" or this has been already so for some items). 7.2 Make subquery predicates collect their outer references ----------------------------------------------------------- Per-column functional dependency analysis requires us to take a tbl.field = func(...) equality and tell which columns of which tables are referred from func(...) expression. For scalar expressions, this is accomplished by Item::walk()-based traversal. It should be reasonably cheap (the only practical Item that can be expensive to traverse seems to be a special case of "col IN (const1,const2, ...)". check if we traverse the long list for such items). For correlated subqueries, traversal can be expensive, it is cheaper to make each subquery item have a list of its outer references. The list can be collected at fix_fields() stage with very little extra cost, and then it could be used for other optimizations. 8. Other concerns ================= 8.1 Relationship with outer->inner joins converter -------------------------------------------------- One could suspect that outer->inner join conversion could get in the way of table elimination by changing outer joins (which could be eliminated) to inner (which we will not try to eliminate). This concern is not valid: we make outer->inner conversions based on predicates in WHERE. If the WHERE referred to an inner table (this is a requirement for the conversion) then table elimination would not be applicable anyway. 8.2 Relationship with prepared statements ----------------------------------------- On one hand, it's natural to desire to make table elimination a once-per-statement operation, like outer->inner join conversion. We'll have to limit the applicability by removing [MARK1] as that can change during lifetime of the statement. The other option is to do table elimination every time. This will require to rework operation [MARK2] to be undoable. 8.3 Relationship with constant table detection ---------------------------------------------- Table elimination is performed after constant table detection (but before the range analysis). Constant tables are technically different from eliminated ones (e.g. the former are shown in EXPLAIN and the latter aren't). Considering we've already done the join_read_const_table() call, is there any real difference between constant table and eliminated one? If there is, should we mark const tables also as eliminated? from user/EXPLAIN point of view: no. constant table is the one that we read one record from. eliminated table is the one that we don't acccess at all. TODO 9. Tests and benchmarks ======================= Create a benchmark in sql-bench which checks if the DBMS has table elimination. [According to Monty] Run - query Q1 that would use elimination - query Q2 that is very similar to Q1 (so that they would have same QEP, execution cost, etc) but cannot use table elimination. then compare run times and make a conclusion about whether the used dbms supports table elimination. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] Rev 2820: Apply Evgen's fix: in file:///home/psergey/dev/mysql-next-fix-subq-r2/
by Sergey Petrunya 29 Jul '09

29 Jul '09

At file:///home/psergey/dev/mysql-next-fix-subq-r2/ ------------------------------------------------------------ revno: 2820 revision-id: psergey(a)askmonty.org-20090729161849-ynumr03ety244ueu parent: psergey(a)askmonty.org-20090708174703-dz9uf5b0m6pcvtl6 committer: Sergey Petrunya <psergey(a)askmonty.org> branch nick: mysql-next-fix-subq-r2 timestamp: Wed 2009-07-29 20:18:49 +0400 message: Apply Evgen's fix: Bug#45174: Incorrectly applied equality propagation caused wrong result on a query with a materialized semi-join. Equality propagation is done after query execution plan is chosen. It substitutes fields from tables being retrieved later for fields from tables being retrieved earlier. Materialized semi-joins are exception to this rule. For field which belongs to a table within a materialized semi-join, we can only pick fields from the same semi-join. Example: suppose we have a join order: ot1 ot2 SJ-Mat(it1 it2 it3) ot3 and equality ot2.col = it1.col = it2.col If we're looking for best substitute for 'it2.col', we should pick it1.col and not ot2.col. For a field that is not in a materialized semi-join we must pick a field that's not embedded in a materialized semi-join. Example: suppose we have a join order: SJ-Mat(it1 it2) ot1 ot2 and equality ot2.col = ot1.col = it2.col If we're looking for best substitute for 'ot2.col', we should pick ot1.col and not it2.col, because when we run a join between ot1 and ot2 execution of SJ-Mat(...) has already finished and we can't rely on the value of it*.*. Now the Item_equal::get_first function accepts as a parameter a field being substituted and checks whether it belongs to a materialized semi-join. Depending on the check result a field to substitute for or NULL is returned. The is_sj_materialization_strategy method is added to the JOIN_TAB class to check whether JOIN_TAB belongs to a materialized semi-join. === modified file 'mysql-test/r/subselect3.result' --- a/mysql-test/r/subselect3.result 2009-04-30 19:37:21 +0000 +++ b/mysql-test/r/subselect3.result 2009-07-29 16:18:49 +0000 @@ -1081,8 +1081,8 @@ insert into t3 select A.a + 10*B.a, 'filler' from t0 A, t0 B; explain select * from t3 where a in (select a from t2) and (a > 5 or a < 10); id select_type table type possible_keys key key_len ref rows Extra -1 PRIMARY t2 ALL NULL NULL NULL NULL 2 Using where; Materialize; Scan -1 PRIMARY t3 ref a a 5 test.t2.a 1 +1 PRIMARY t2 ALL NULL NULL NULL NULL 2 Materialize; Scan +1 PRIMARY t3 ref a a 5 test.t2.a 1 Using index condition select * from t3 where a in (select a from t2); a filler 1 filler @@ -1129,8 +1129,8 @@ explain select * from t1, t3 where t3.a in (select a from t2) and (t3.a < 10 or t3.a >30) and t1.a =3; id select_type table type possible_keys key key_len ref rows Extra 1 PRIMARY t1 ALL NULL NULL NULL NULL 10 Using where -1 PRIMARY t2 ALL NULL NULL NULL NULL 10 Using where; Materialize; Scan -1 PRIMARY t3 ref a a 5 test.t2.a 10 +1 PRIMARY t2 ALL NULL NULL NULL NULL 10 Materialize; Scan +1 PRIMARY t3 ref a a 5 test.t2.a 10 Using index condition explain select straight_join * from t1 A, t1 B where A.a in (select a from t2); id select_type table type possible_keys key key_len ref rows Extra 1 PRIMARY A ALL NULL NULL NULL NULL 10 Using where @@ -1158,14 +1158,14 @@ explain select * from t0, t3 where t3.a in (select a from t2) and (t3.a < 10 or t3.a >30); id select_type table type possible_keys key key_len ref rows Extra 1 PRIMARY t0 system NULL NULL NULL NULL 1 -1 PRIMARY t2 ALL NULL NULL NULL NULL 10 Using where; Materialize; Scan -1 PRIMARY t3 ref a a 5 test.t2.a 10 +1 PRIMARY t2 ALL NULL NULL NULL NULL 10 Materialize; Scan +1 PRIMARY t3 ref a a 5 test.t2.a 10 Using index condition create table t4 as select a as x, a as y from t1; explain select * from t0, t3 where (t3.a, t3.b) in (select x,y from t4) and (t3.a < 10 or t3.a >30); id select_type table type possible_keys key key_len ref rows Extra 1 PRIMARY t0 system NULL NULL NULL NULL 1 -1 PRIMARY t4 ALL NULL NULL NULL NULL 10 Using where; Materialize; Scan -1 PRIMARY t3 ref a a 5 test.t4.x 10 Using where +1 PRIMARY t4 ALL NULL NULL NULL NULL 10 Materialize; Scan +1 PRIMARY t3 ref a a 5 test.t4.x 10 Using index condition; Using where drop table t0,t1,t2,t3,t4; create table t0 (a int); insert into t0 values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9); === modified file 'mysql-test/r/subselect3_jcl6.result' --- a/mysql-test/r/subselect3_jcl6.result 2009-04-30 19:37:21 +0000 +++ b/mysql-test/r/subselect3_jcl6.result 2009-07-29 16:18:49 +0000 @@ -1086,8 +1086,8 @@ insert into t3 select A.a + 10*B.a, 'filler' from t0 A, t0 B; explain select * from t3 where a in (select a from t2) and (a > 5 or a < 10); id select_type table type possible_keys key key_len ref rows Extra -1 PRIMARY t2 ALL NULL NULL NULL NULL 2 Using where; Materialize; Scan -1 PRIMARY t3 ref a a 5 test.t2.a 1 Using join buffer +1 PRIMARY t2 ALL NULL NULL NULL NULL 2 Materialize; Scan +1 PRIMARY t3 ref a a 5 test.t2.a 1 Using index condition; Using join buffer select * from t3 where a in (select a from t2); a filler 1 filler @@ -1134,8 +1134,8 @@ explain select * from t1, t3 where t3.a in (select a from t2) and (t3.a < 10 or t3.a >30) and t1.a =3; id select_type table type possible_keys key key_len ref rows Extra 1 PRIMARY t1 ALL NULL NULL NULL NULL 10 Using where -1 PRIMARY t2 ALL NULL NULL NULL NULL 10 Using where; Materialize; Scan -1 PRIMARY t3 ref a a 5 test.t2.a 10 Using join buffer +1 PRIMARY t2 ALL NULL NULL NULL NULL 10 Materialize; Scan +1 PRIMARY t3 ref a a 5 test.t2.a 10 Using index condition; Using join buffer explain select straight_join * from t1 A, t1 B where A.a in (select a from t2); id select_type table type possible_keys key key_len ref rows Extra 1 PRIMARY A ALL NULL NULL NULL NULL 10 Using where @@ -1163,14 +1163,14 @@ explain select * from t0, t3 where t3.a in (select a from t2) and (t3.a < 10 or t3.a >30); id select_type table type possible_keys key key_len ref rows Extra 1 PRIMARY t0 system NULL NULL NULL NULL 1 -1 PRIMARY t2 ALL NULL NULL NULL NULL 10 Using where; Materialize; Scan -1 PRIMARY t3 ref a a 5 test.t2.a 10 Using join buffer +1 PRIMARY t2 ALL NULL NULL NULL NULL 10 Materialize; Scan +1 PRIMARY t3 ref a a 5 test.t2.a 10 Using index condition; Using join buffer create table t4 as select a as x, a as y from t1; explain select * from t0, t3 where (t3.a, t3.b) in (select x,y from t4) and (t3.a < 10 or t3.a >30); id select_type table type possible_keys key key_len ref rows Extra 1 PRIMARY t0 system NULL NULL NULL NULL 1 -1 PRIMARY t4 ALL NULL NULL NULL NULL 10 Using where; Materialize; Scan -1 PRIMARY t3 ref a a 5 test.t4.x 10 Using where; Using join buffer +1 PRIMARY t4 ALL NULL NULL NULL NULL 10 Materialize; Scan +1 PRIMARY t3 ref a a 5 test.t4.x 10 Using index condition; Using where; Using join buffer drop table t0,t1,t2,t3,t4; create table t0 (a int); insert into t0 values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9); === modified file 'mysql-test/r/subselect_sj.result' --- a/mysql-test/r/subselect_sj.result 2009-07-06 07:57:39 +0000 +++ b/mysql-test/r/subselect_sj.result 2009-07-29 16:18:49 +0000 @@ -372,3 +372,39 @@ 3 2 drop table t1, t2, t3; +# +# Bug#45174: Incorrectly applied equality propagation caused wrong +# result on a query with a materialized semi-join. +# +CREATE TABLE `CC` ( +`pk` int(11) NOT NULL AUTO_INCREMENT, +`varchar_key` varchar(1) NOT NULL, +`varchar_nokey` varchar(1) NOT NULL, +PRIMARY KEY (`pk`), +KEY `varchar_key` (`varchar_key`) +); +INSERT INTO `CC` VALUES (11,'m','m'),(12,'j','j'),(13,'z','z'),(14,'a','a'),(15,'',''),(16,'e','e'),(17,'t','t'),(19,'b','b'),(20,'w','w'),(21,'m','m'),(23,'',''),(24,'w','w'),(26,'e','e'),(27,'e','e'),(28,'p','p'); +CREATE TABLE `C` ( +`varchar_nokey` varchar(1) NOT NULL +); +INSERT INTO `C` VALUES ('v'),('u'),('n'),('l'),('h'),('u'),('n'),('j'),('k'),('e'),('i'),('u'),('n'),('b'),('x'),(''),('q'),('u'); +EXPLAIN EXTENDED SELECT varchar_nokey +FROM C +WHERE ( `varchar_nokey` , `varchar_nokey` ) IN ( +SELECT `varchar_key` , `varchar_nokey` +FROM CC +WHERE `varchar_nokey` < 'n' XOR `pk` ) ; +id select_type table type possible_keys key key_len ref rows filtered Extra +1 PRIMARY C ALL NULL NULL NULL NULL 18 100.00 +1 PRIMARY CC ALL varchar_key NULL NULL NULL 15 100.00 Using where; Materialize +Warnings: +Note 1003 select `test`.`C`.`varchar_nokey` AS `varchar_nokey` from `test`.`C` semi join (`test`.`CC`) where ((`test`.`CC`.`varchar_key` = `test`.`C`.`varchar_nokey`) and (`test`.`CC`.`varchar_nokey` = `test`.`CC`.`varchar_key`) and ((`test`.`CC`.`varchar_nokey` < 'n') xor `test`.`CC`.`pk`)) +SELECT varchar_nokey +FROM C +WHERE ( `varchar_nokey` , `varchar_nokey` ) IN ( +SELECT `varchar_key` , `varchar_nokey` +FROM CC +WHERE `varchar_nokey` < 'n' XOR `pk` ) ; +varchar_nokey +DROP TABLE CC, C; +# End of the test for bug#45174. === modified file 'mysql-test/r/subselect_sj_jcl6.result' --- a/mysql-test/r/subselect_sj_jcl6.result 2009-07-06 07:57:39 +0000 +++ b/mysql-test/r/subselect_sj_jcl6.result 2009-07-29 16:18:49 +0000 @@ -376,6 +376,42 @@ 3 2 drop table t1, t2, t3; +# +# Bug#45174: Incorrectly applied equality propagation caused wrong +# result on a query with a materialized semi-join. +# +CREATE TABLE `CC` ( +`pk` int(11) NOT NULL AUTO_INCREMENT, +`varchar_key` varchar(1) NOT NULL, +`varchar_nokey` varchar(1) NOT NULL, +PRIMARY KEY (`pk`), +KEY `varchar_key` (`varchar_key`) +); +INSERT INTO `CC` VALUES (11,'m','m'),(12,'j','j'),(13,'z','z'),(14,'a','a'),(15,'',''),(16,'e','e'),(17,'t','t'),(19,'b','b'),(20,'w','w'),(21,'m','m'),(23,'',''),(24,'w','w'),(26,'e','e'),(27,'e','e'),(28,'p','p'); +CREATE TABLE `C` ( +`varchar_nokey` varchar(1) NOT NULL +); +INSERT INTO `C` VALUES ('v'),('u'),('n'),('l'),('h'),('u'),('n'),('j'),('k'),('e'),('i'),('u'),('n'),('b'),('x'),(''),('q'),('u'); +EXPLAIN EXTENDED SELECT varchar_nokey +FROM C +WHERE ( `varchar_nokey` , `varchar_nokey` ) IN ( +SELECT `varchar_key` , `varchar_nokey` +FROM CC +WHERE `varchar_nokey` < 'n' XOR `pk` ) ; +id select_type table type possible_keys key key_len ref rows filtered Extra +1 PRIMARY C ALL NULL NULL NULL NULL 18 100.00 +1 PRIMARY CC ALL varchar_key NULL NULL NULL 15 100.00 Using where; Materialize +Warnings: +Note 1003 select `test`.`C`.`varchar_nokey` AS `varchar_nokey` from `test`.`C` semi join (`test`.`CC`) where ((`test`.`CC`.`varchar_key` = `test`.`C`.`varchar_nokey`) and (`test`.`CC`.`varchar_nokey` = `test`.`CC`.`varchar_key`) and ((`test`.`CC`.`varchar_nokey` < 'n') xor `test`.`CC`.`pk`)) +SELECT varchar_nokey +FROM C +WHERE ( `varchar_nokey` , `varchar_nokey` ) IN ( +SELECT `varchar_key` , `varchar_nokey` +FROM CC +WHERE `varchar_nokey` < 'n' XOR `pk` ) ; +varchar_nokey +DROP TABLE CC, C; +# End of the test for bug#45174. set join_cache_level=default; show variables like 'join_cache_level'; Variable_name Value === modified file 'mysql-test/t/subselect_sj.test' --- a/mysql-test/t/subselect_sj.test 2009-07-06 07:57:39 +0000 +++ b/mysql-test/t/subselect_sj.test 2009-07-29 16:18:49 +0000 @@ -22,7 +22,6 @@ create table t12 like t10; insert into t12 select * from t10; - --echo Flattened because of dependency, t10=func(t1) explain select * from t1 where a in (select pk from t10); select * from t1 where a in (select pk from t10); @@ -252,3 +251,43 @@ where a in (select c from t2 where d >= some(select e from t3 where b=e)); drop table t1, t2, t3; + +--echo # +--echo # Bug#45174: Incorrectly applied equality propagation caused wrong +--echo # result on a query with a materialized semi-join. +--echo # + +CREATE TABLE `CC` ( + `pk` int(11) NOT NULL AUTO_INCREMENT, + `varchar_key` varchar(1) NOT NULL, + `varchar_nokey` varchar(1) NOT NULL, + PRIMARY KEY (`pk`), + KEY `varchar_key` (`varchar_key`) +); + +INSERT INTO `CC` VALUES (11,'m','m'),(12,'j','j'),(13,'z','z'),(14,'a','a'),(15,'',''),(16,'e','e'),(17,'t','t'),(19,'b','b'),(20,'w','w'),(21,'m','m'),(23,'',''),(24,'w','w'),(26,'e','e'),(27,'e','e'),(28,'p','p'); + +CREATE TABLE `C` ( + `varchar_nokey` varchar(1) NOT NULL +); + +INSERT INTO `C` VALUES ('v'),('u'),('n'),('l'),('h'),('u'),('n'),('j'),('k'),('e'),('i'),('u'),('n'),('b'),('x'),(''),('q'),('u'); + +EXPLAIN EXTENDED SELECT varchar_nokey +FROM C +WHERE ( `varchar_nokey` , `varchar_nokey` ) IN ( +SELECT `varchar_key` , `varchar_nokey` +FROM CC +WHERE `varchar_nokey` < 'n' XOR `pk` ) ; + +SELECT varchar_nokey +FROM C +WHERE ( `varchar_nokey` , `varchar_nokey` ) IN ( +SELECT `varchar_key` , `varchar_nokey` +FROM CC +WHERE `varchar_nokey` < 'n' XOR `pk` ) ; + +DROP TABLE CC, C; + +--echo # End of the test for bug#45174. + === modified file 'sql/item.cc' --- a/sql/item.cc 2009-07-06 07:57:39 +0000 +++ b/sql/item.cc 2009-07-29 16:18:49 +0000 @@ -4895,7 +4895,7 @@ return this; return const_item; } - Item_field *subst= item_equal->get_first(); + Item_field *subst= item_equal->get_first(this); if (subst && field->table != subst->field->table && !field->eq(subst->field)) return subst; } === modified file 'sql/item_cmpfunc.cc' --- a/sql/item_cmpfunc.cc 2009-07-06 07:57:39 +0000 +++ b/sql/item_cmpfunc.cc 2009-07-29 16:18:49 +0000 @@ -5377,7 +5377,7 @@ void Item_equal::fix_length_and_dec() { - Item *item= get_first(); + Item *item= get_first(NULL); eval_item= cmp_item::get_comparator(item->result_type(), item->collation.collation); } @@ -5440,3 +5440,107 @@ str->append(')'); } + +/* + @brief Get the first field of multiple equality. + @param[in] field the field to get equal field to + + @details Get the first field of multiple equality that is equal to the + given field. In order to make semi-join materialization strategy work + correctly we can't propagate equal fields from upper select to the semi-join. + Thus the fields is returned according to following rules: + + 1) If the given field belongs to a semi-join then the first field in + multiple equality which belong to the same semi-join is returned. + Otherwise NULL is returned. + 2) If no field is given or the field doesn't belong to a semi-join then + the first field in the multiple equality is returned. + + @retval Found first field in the multiple equality. + @retval 0 if no field found. +*/ + +Item_field* Item_equal::get_first(Item_field *field) +{ + List_iterator<Item_field> it(fields); + Item_field *item; + JOIN_TAB *field_tab; + + if (!field) + return fields.head(); + /* + Of all equal fields, return the first one we can use. Normally, this is the + field which belongs to the table that is the first in the join order. + + There is one exception to this: When semi-join materialization strategy is + used, and the given field belongs to a table within the semi-join nest, we + must pick the first field in the semi-join nest. + + Example: suppose we have a join order: + + ot1 ot2 SJ-Mat(it1 it2 it3) ot3 + + and equality ot2.col = it1.col = it2.col + If we're looking for best substitute for 'it2.col', we should pick it1.col + and not ot2.col. + */ + + field_tab= field->field->table->reginfo.join_tab; + if (field_tab->is_sj_materialization_strategy()) + { + /* + It's a field from an materialized semi-join. We can substitute it only + for a field from the same semi-join. + */ + JOIN_TAB *first; + JOIN *join= field_tab->join; + uint tab_idx= field_tab - field_tab->join->join_tab; + /* Find first table of this semi-join. */ + for (int i=tab_idx; i >= join->const_tables; i--) + { + if (join->best_positions[i].sj_strategy == SJ_OPT_MATERIALIZE || + join->best_positions[i].sj_strategy == SJ_OPT_MATERIALIZE_SCAN) + first= join->join_tab + i; + else + // Found first tab that doesn't belong to current SJ. + break; + } + /* Find an item to substitute for. */ + while ((item= it++)) + { + if (item->field->table->reginfo.join_tab >= first) + { + /* + If we found given field then return NULL to avoid unnecessary + substitution. + */ + return (item != field) ? item : NULL; + } + } + } + else + { + /* + The field is not in SJ-Materialization nest. We must return the first field + that's not embedded in a SJ-Materialization nest. + Example: suppose we have a join order: + + SJ-Mat(it1 it2) ot1 ot2 + + and equality ot2.col = ot1.col = it2.col + If we're looking for best substitute for 'ot2.col', we should pick ot1.col + and not it2.col, because when we run a join between ot1 and ot2 + execution of SJ-Mat(...) has already finished and we can't rely on the + value of it*.*. + */ + while ((item= it++)) + { + field_tab= item->field->table->reginfo.join_tab; + if (!field_tab->is_sj_materialization_strategy()) + return item; + } + } + // Shouldn't get here. + DBUG_ASSERT(0); + return NULL; +} === modified file 'sql/item_cmpfunc.h' --- a/sql/item_cmpfunc.h 2009-07-06 07:57:39 +0000 +++ b/sql/item_cmpfunc.h 2009-07-29 16:18:49 +0000 @@ -1593,7 +1593,7 @@ void add(Item_field *f); uint members(); bool contains(Field *field); - Item_field* get_first() { return fields.head(); } + Item_field* get_first(Item_field *field); void merge(Item_equal *item); void update_const(); enum Functype functype() const { return MULT_EQUAL_FUNC; } === modified file 'sql/sql_select.cc' --- a/sql/sql_select.cc 2009-07-06 14:33:29 +0000 +++ b/sql/sql_select.cc 2009-07-29 16:18:49 +0000 @@ -10379,6 +10379,21 @@ /** + Check whether the JOIN_TAB belongs to a materialized semi-join. +*/ + +bool JOIN_TAB::is_sj_materialization_strategy() +{ + uint tab_idx= this - join->join_tab; + + return (emb_sj_nest && + ((join->best_positions[tab_idx].sj_strategy == SJ_OPT_MATERIALIZE || + join->best_positions[tab_idx].sj_strategy == SJ_OPT_MATERIALIZE_SCAN))); + +} + + +/** Partially cleanup JOIN after it has executed: close index or rnd read (table cursors), free quick selects. @@ -11720,7 +11735,7 @@ head= item_const; else { - head= item_equal->get_first(); + head= item_equal->get_first(NULL); it++; } Item_field *item_field; === modified file 'sql/sql_select.h' --- a/sql/sql_select.h 2009-05-07 20:48:24 +0000 +++ b/sql/sql_select.h 2009-07-29 16:18:49 +0000 @@ -332,6 +332,7 @@ return first_inner; return first_sj_inner_tab; } + bool is_sj_materialization_strategy(); } JOIN_TAB; /*

1 0

[Maria-developers] new buildbot report at askmonty
by Bryan Alsdorf 28 Jul '09

28 Jul '09

Hi all, Last week I pushed a new report for buildbot live and after a few tweaks wanted to send out an email on it. The report is located here: http://askmonty.org/buildbot/reports/ At the moment we only have the "cross reference" report, more reports will be added in the future. The cross reference report lists all test failures matching your search (to view all data just click "search" with no parameters"). You can click on a row to display the full test failure text. Currently there are only a few test failures listed but all future failures will appear on this report. If you have any questions or comments feel free to email the list, me directly or hop on #maria Best Regards, -- Bryan Alsdorf, Lead Web Developer Monty Program, AB. http://askmonty.org

3 3

[Maria-developers] Rev 2717: WL#4800: Optimizer trace in file:///home/psergey/dev/maria-5.1-opt-trace/
by Sergey Petrunya 23 Jul '09

23 Jul '09

At file:///home/psergey/dev/maria-5.1-opt-trace/ ------------------------------------------------------------ revno: 2717 revision-id: psergey(a)askmonty.org-20090723174522-99j6if4ay9r341qg parent: knielsen(a)knielsen-hq.org-20090707111924-e44ycwmckomk13qz committer: Sergey Petrunya <psergey(a)askmonty.org> branch nick: maria-5.1-opt-trace timestamp: Thu 2009-07-23 21:45:22 +0400 message: WL#4800: Optimizer trace - Port current state to MariaDB Diff too large for email (1909 lines, the limit is 1000).

1 0

[Maria-developers] Rev 2717: WL#4800: Optimizer trace in file:///home/psergey/dev/maria-5.1-opt-trace/
by Sergey Petrunya 23 Jul '09

23 Jul '09

At file:///home/psergey/dev/maria-5.1-opt-trace/ ------------------------------------------------------------ revno: 2717 revision-id: psergey(a)askmonty.org-20090723174047-982pmyty704c5bgu parent: knielsen(a)knielsen-hq.org-20090707111924-e44ycwmckomk13qz committer: Sergey Petrunya <psergey(a)askmonty.org> branch nick: maria-5.1-opt-trace timestamp: Thu 2009-07-23 21:40:47 +0400 message: WL#4800: Optimizer trace - Port of current state to mariadb-5.1 Diff too large for email (1354 lines, the limit is 1000).

1 0

[Maria-developers] Updated (by Guest): Table elimination: all tasks (29)
by worklog-noreply＠askmonty.org 23 Jul '09

23 Jul '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Table elimination: all tasks CREATION DATE..: Wed, 03 Jun 2009, 12:07 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: Psergey CATEGORY.......: Server-Sprint TASK ID........: 29 (http://askmonty.org/worklog/?tid=29) VERSION........: Server-5.1 STATUS.........: In-Progress PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: -=-=(Guest - Thu, 23 Jul 2009, 20:13)=-=- Version updated. --- /tmp/wklog.29.old.17550 2009-07-23 20:13:44.000000000 +0300 +++ /tmp/wklog.29.new.17550 2009-07-23 20:13:44.000000000 +0300 @@ -1 +1 @@ -Server-4.0 +Server-5.1 -=-=(Guest - Thu, 23 Jul 2009, 20:09)=-=- Version updated. --- /tmp/wklog.29.old.17326 2009-07-23 20:09:38.000000000 +0300 +++ /tmp/wklog.29.new.17326 2009-07-23 20:09:38.000000000 +0300 @@ -1 +1 @@ -Server-9.x +Server-4.0 -=-=(Guest - Thu, 23 Jul 2009, 20:07)=-=- Dependency created: 29 now depends on 17 -=-=(Guest - Tue, 16 Jun 2009, 17:03)=-=- Dependency deleted: 29 no longer depends on 20 -=-=(Guest - Tue, 16 Jun 2009, 17:01)=-=- Dependency deleted: 29 no longer depends on 17 -=-=(Psergey - Wed, 03 Jun 2009, 12:07)=-=- Dependency created: 29 now depends on 20 -=-=(Psergey - Wed, 03 Jun 2009, 12:07)=-=- Dependency created: 29 now depends on 17 DESCRIPTION: This WL entry groups all table elimination tasks. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] Updated (by Guest): Table elimination: all tasks (29)
by worklog-noreply＠askmonty.org 23 Jul '09

23 Jul '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Table elimination: all tasks CREATION DATE..: Wed, 03 Jun 2009, 12:07 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: Psergey CATEGORY.......: Server-Sprint TASK ID........: 29 (http://askmonty.org/worklog/?tid=29) VERSION........: Server-5.1 STATUS.........: In-Progress PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: -=-=(Guest - Thu, 23 Jul 2009, 20:13)=-=- Version updated. --- /tmp/wklog.29.old.17550 2009-07-23 20:13:44.000000000 +0300 +++ /tmp/wklog.29.new.17550 2009-07-23 20:13:44.000000000 +0300 @@ -1 +1 @@ -Server-4.0 +Server-5.1 -=-=(Guest - Thu, 23 Jul 2009, 20:09)=-=- Version updated. --- /tmp/wklog.29.old.17326 2009-07-23 20:09:38.000000000 +0300 +++ /tmp/wklog.29.new.17326 2009-07-23 20:09:38.000000000 +0300 @@ -1 +1 @@ -Server-9.x +Server-4.0 -=-=(Guest - Thu, 23 Jul 2009, 20:07)=-=- Dependency created: 29 now depends on 17 -=-=(Guest - Tue, 16 Jun 2009, 17:03)=-=- Dependency deleted: 29 no longer depends on 20 -=-=(Guest - Tue, 16 Jun 2009, 17:01)=-=- Dependency deleted: 29 no longer depends on 17 -=-=(Psergey - Wed, 03 Jun 2009, 12:07)=-=- Dependency created: 29 now depends on 20 -=-=(Psergey - Wed, 03 Jun 2009, 12:07)=-=- Dependency created: 29 now depends on 17 DESCRIPTION: This WL entry groups all table elimination tasks. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] Updated (by Guest): Table elimination: all tasks (29)
by worklog-noreply＠askmonty.org 23 Jul '09

23 Jul '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Table elimination: all tasks CREATION DATE..: Wed, 03 Jun 2009, 12:07 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: Psergey CATEGORY.......: Server-Sprint TASK ID........: 29 (https://askmonty.org/worklog/?tid=29) VERSION........: Server-4.0 STATUS.........: In-Progress PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: -=-=(Guest - Thu, 23 Jul 2009, 20:09)=-=- Version updated. --- /tmp/wklog.29.old.17326 2009-07-23 20:09:38.000000000 +0300 +++ /tmp/wklog.29.new.17326 2009-07-23 20:09:38.000000000 +0300 @@ -1 +1 @@ -Server-9.x +Server-4.0 -=-=(Guest - Thu, 23 Jul 2009, 20:07)=-=- Dependency created: 29 now depends on 17 -=-=(Guest - Tue, 16 Jun 2009, 17:03)=-=- Dependency deleted: 29 no longer depends on 20 -=-=(Guest - Tue, 16 Jun 2009, 17:01)=-=- Dependency deleted: 29 no longer depends on 17 -=-=(Psergey - Wed, 03 Jun 2009, 12:07)=-=- Dependency created: 29 now depends on 20 -=-=(Psergey - Wed, 03 Jun 2009, 12:07)=-=- Dependency created: 29 now depends on 17 DESCRIPTION: This WL entry groups all table elimination tasks. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] Updated (by Guest): Table elimination: all tasks (29)
by worklog-noreply＠askmonty.org 23 Jul '09

23 Jul '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Table elimination: all tasks CREATION DATE..: Wed, 03 Jun 2009, 12:07 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: Psergey CATEGORY.......: Server-Sprint TASK ID........: 29 (https://askmonty.org/worklog/?tid=29) VERSION........: Server-4.0 STATUS.........: In-Progress PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: -=-=(Guest - Thu, 23 Jul 2009, 20:09)=-=- Version updated. --- /tmp/wklog.29.old.17326 2009-07-23 20:09:38.000000000 +0300 +++ /tmp/wklog.29.new.17326 2009-07-23 20:09:38.000000000 +0300 @@ -1 +1 @@ -Server-9.x +Server-4.0 -=-=(Guest - Thu, 23 Jul 2009, 20:07)=-=- Dependency created: 29 now depends on 17 -=-=(Guest - Tue, 16 Jun 2009, 17:03)=-=- Dependency deleted: 29 no longer depends on 20 -=-=(Guest - Tue, 16 Jun 2009, 17:01)=-=- Dependency deleted: 29 no longer depends on 17 -=-=(Psergey - Wed, 03 Jun 2009, 12:07)=-=- Dependency created: 29 now depends on 20 -=-=(Psergey - Wed, 03 Jun 2009, 12:07)=-=- Dependency created: 29 now depends on 17 DESCRIPTION: This WL entry groups all table elimination tasks. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] Updated (by Monty): Table elimination (17)
by worklog-noreply＠askmonty.org 23 Jul '09

23 Jul '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Table elimination CREATION DATE..: Sun, 10 May 2009, 19:57 SUPERVISOR.....: Monty IMPLEMENTOR....: Psergey COPIES TO......: CATEGORY.......: Client-BackLog TASK ID........: 17 (http://askmonty.org/worklog/?tid=17) VERSION........: Server-5.1 STATUS.........: In-Progress PRIORITY.......: 60 WORKED HOURS...: 1 ESTIMATE.......: 3 (hours remain) ORIG. ESTIMATE.: 3 PROGRESS NOTES: -=-=(Monty - Thu, 23 Jul 2009, 09:19)=-=- Version updated. --- /tmp/wklog.17.old.24090 2009-07-23 09:19:32.000000000 +0300 +++ /tmp/wklog.17.new.24090 2009-07-23 09:19:32.000000000 +0300 @@ -1 +1 @@ -Server-9.x +Server-5.1 -=-=(Guest - Mon, 20 Jul 2009, 14:28)=-=- deukje weg Worked 1 hour and estimate 3 hours remain (original estimate increased by 4 hours). -=-=(Guest - Fri, 17 Jul 2009, 02:44)=-=- Version updated. --- /tmp/wklog.17.old.24138 2009-07-17 02:44:49.000000000 +0300 +++ /tmp/wklog.17.new.24138 2009-07-17 02:44:49.000000000 +0300 @@ -1 +1 @@ -9.x +Server-9.x -=-=(Guest - Fri, 17 Jul 2009, 02:44)=-=- Version updated. --- /tmp/wklog.17.old.24114 2009-07-17 02:44:36.000000000 +0300 +++ /tmp/wklog.17.new.24114 2009-07-17 02:44:36.000000000 +0300 @@ -1 +1 @@ -Server-5.1 +9.x -=-=(Guest - Fri, 17 Jul 2009, 02:44)=-=- Category updated. --- /tmp/wklog.17.old.24114 2009-07-17 02:44:36.000000000 +0300 +++ /tmp/wklog.17.new.24114 2009-07-17 02:44:36.000000000 +0300 @@ -1 +1 @@ -Server-Sprint +Client-BackLog -=-=(Guest - Thu, 18 Jun 2009, 04:15)=-=- Low Level Design modified. --- /tmp/wklog.17.old.29969 2009-06-18 04:15:23.000000000 +0300 +++ /tmp/wklog.17.new.29969 2009-06-18 04:15:23.000000000 +0300 @@ -158,3 +158,43 @@ from user/EXPLAIN point of view: no. constant table is the one that we read one record from. eliminated table is the one that we don't acccess at all. +* What is described above will not be able to eliminate this outer join + create unique index idx on tableB (id, fromDate); + ... + left outer join + tableB B + on + B.id = A.id + and + B.fromDate = (select max(sub.fromDate) + from tableB sub where sub.id = A.id); + + This is because condition "B.fromDate= func(tableB)" cannot be used. + Reason#1: update_ref_and_keys() does not consider such conditions to + be of any use (and indeed they are not usable for ref access) + so they are not put into KEYUSE array. + Reason#2: even if they were put there, we would need to be able to tell + between predicates like + B.fromDate= func(B.id) // guarantees only one matching row as + // B.id is already bound by B.id=A.id + // hence B.fromDate becomes bound too. + and + "B.fromDate= func(B.*)" // Can potentially have many matching + // records. + We need to + - Have update_ref_and_keys() create KEYUSE elements for such equalities + - Have eliminate_tables() and friends make a more accurate check. + The right check is to check whether all parts of a unique key are bound. + If we have keypartX to be bound, then t.keypartY=func(keypartX) makes + keypartY to be bound. + The difficulty here is that correlated subquery predicate cannot tell what + columns it depends on (it only remembers tables). + Traversing the predicate is expensive and complicated. + We're leaning towards making each subquery predicate have a List<Item> with + items that + - are in the current select + - and it depends on. + This list will be useful in certain other subquery optimizations as well, + it is cheap to collect it in fix_fields() phase, so it will be collected + for every subquery predicate. + -=-=(Guest - Thu, 18 Jun 2009, 02:48)=-=- Low Level Design modified. --- /tmp/wklog.17.old.27792 2009-06-18 02:48:45.000000000 +0300 +++ /tmp/wklog.17.new.27792 2009-06-18 02:48:45.000000000 +0300 @@ -89,14 +89,14 @@ - queries that would use elimination - queries that are very similar to one above (so that they would have same QEP, execution cost, etc) but cannot use table elimination. +then compare run times and make a conclusion about whether dbms supports table +elimination. 6. Todo, issues to resolve -------------------------- 6.1 To resolve ~~~~~~~~~~~~~~ -- Re-check how this works with equality propagation. - - Relationship with prepared statements. On one hand, it's natural to desire to make table elimination a once-per-statement operation, like outer->inner join conversion. We'll have @@ -141,8 +141,13 @@ 7. Additional issues -------------------- -* We remove ON clauses within semi-join nests. If these clauses contain +* We remove ON clauses within outer join nests. If these clauses contain subqueries, they probably should be gone from EXPLAIN output also? + Yes. Current approach: when removing an outer join nest, walk the ON clause + and mark subselects as eliminated. Then let EXPLAIN code check if the + SELECT was eliminated before the printing (EXPLAIN is generated by doing + a recursive descent, so the check will also cause children of eliminated + selects not to be printed) * Table elimination is performed after constant table detection (but before the range analysis). Constant tables are technically different from -=-=(Guest - Thu, 18 Jun 2009, 02:24)=-=- Low Level Design modified. --- /tmp/wklog.17.old.27162 2009-06-18 02:24:14.000000000 +0300 +++ /tmp/wklog.17.new.27162 2009-06-18 02:24:14.000000000 +0300 @@ -83,9 +83,12 @@ 5. Tests and benchmarks ----------------------- -Should create a benchmark in sql-bench which checks if the dbms has table +Create a benchmark in sql-bench which checks if the DBMS has table elimination. -TODO elaborate +[According to Monty] Run + - queries that would use elimination + - queries that are very similar to one above (so that they would have same + QEP, execution cost, etc) but cannot use table elimination. 6. Todo, issues to resolve -------------------------- @@ -109,33 +112,37 @@ 6.2 Resolved ~~~~~~~~~~~~ -- outer->inner join conversion is not a problem for table elimination. +* outer->inner join conversion is not a problem for table elimination. We make outer->inner conversions based on predicates in WHERE. If the WHERE referred to an inner table (requirement for OJ->IJ conversion) then table elimination would not be applicable anyway. -7. Additional issues --------------------- -* We remove ON clauses within semi-join nests. If these clauses contain - subqueries, they probably should be gone from EXPLAIN output also? +* For Multi-table UPDATEs/DELETEs, need to also analyze the SET clause: + - affected tables must not be eliminated + - tables that are used on the right side of the SET x=y assignments must + not be eliminated either. -* Aggregate functions report they depend on all tables, that is, +* Aggregate functions used to report that they depend on all tables, that is, item_agg_func->used_tables() == (1ULL << join->tables) - 1 - always. If we want table elimination to work in presence of grouping, need - to devise some other way of analyzing aggregate functions. + always. Fixed it, now aggregate function reports it depends on + tables that its arguments depend on. In particular, COUNT(*) reports + that it depends on no tables (item_count_star->used_tables()==0). + One consequence of that is that "item->used_tables()==0" is not + equivalent to "item->const_item()==true" anymore (not sure if it's + "anymore" or this has been already happening). + +* EXPLAIN EXTENDED warning text was generated after the JOIN object has + been discarded. This didn't allow to use information about join plan + when printing the warning. Fixed this by keeping the JOIN objects until + we've printed the warning (have also an intent to remove the const + tables from the join output). -* Should eliminated tables be shown in EXPLAIN EXTENDED? - - If we just ignore the question, they will be shown - - this is what happens for constant tables, too. - - I don't see how showing them could be of any use. They only make it - harder to read the rewritten query. - It turns out that - - it is easy to have EXPLAIN EXTENDED show permanent (once-per-statement - lifetime) changes. - - it is hard to have it show per-execution data. This is because the warning - text is generated after the execution structures have been destroyed. +7. Additional issues +-------------------- +* We remove ON clauses within semi-join nests. If these clauses contain + subqueries, they probably should be gone from EXPLAIN output also? * Table elimination is performed after constant table detection (but before the range analysis). Constant tables are technically different from @@ -143,8 +150,6 @@ Considering we've already done the join_read_const_table() call, is there any real difference between constant table and eliminated one? If there is, should we mark const tables also as eliminated? + from user/EXPLAIN point of view: no. constant table is the one that we read + one record from. eliminated table is the one that we don't acccess at all. -* For Multi-table UPDATEs/DELETEs, need to also analyze the SET clause: - - affected tables must not be eliminated - - tables that are used on the right side of the SET x=y assignments must - not be eliminated either. -=-=(Guest - Tue, 16 Jun 2009, 17:01)=-=- Dependency deleted: 29 no longer depends on 17 -=-=(Guest - Wed, 10 Jun 2009, 01:23)=-=- Low Level Design modified. --- /tmp/wklog.17.old.1842 2009-06-10 01:23:42.000000000 +0300 +++ /tmp/wklog.17.new.1842 2009-06-10 01:23:42.000000000 +0300 @@ -131,6 +131,11 @@ - this is what happens for constant tables, too. - I don't see how showing them could be of any use. They only make it harder to read the rewritten query. + It turns out that + - it is easy to have EXPLAIN EXTENDED show permanent (once-per-statement + lifetime) changes. + - it is hard to have it show per-execution data. This is because the warning + text is generated after the execution structures have been destroyed. * Table elimination is performed after constant table detection (but before the range analysis). Constant tables are technically different from ------------------------------------------------------------ -=-=(View All Progress Notes, 24 total)=-=- http://askmonty.org/worklog/index.pl?tid=17&nolimit=1 DESCRIPTION: Eliminate not needed tables from SELECT queries.. This will speed up some views and automatically generated queries. Example: CREATE TABLE B (id int primary key); select A.colA from tableA A left outer join tableB B on B.id = A.id; In this case we can remove table B and the join from the query. HIGH-LEVEL SPECIFICATION: Here is an extended explanation of table elimination. Table elimination is a feature found in some modern query optimizers, of which Microsoft SQL Server 2005/2008 seems to have the most advanced implementation. Oracle 11g has also been confirmed to use table elimination but not to the same extent. Basically, what table elimination does, is to remove tables from the execution plan when it is unnecessary to include them. This can, of course, only happen if the right circumstances arise. Let us for example look at the following query: select A.colA from tableA A left outer join tableB B on B.id = A.id; When using A as the left table we ensure that the query will return at least as many rows as there are in that table. For rows where the join condition (B.id = A.id) is not met the selected column (A.colA) will still contain it's original value. The not seen B.* row would contain all NULL:s. However, the result set could actually contain more rows than what is found in tableA if there are duplicates of the column B.id in tableB. If A contains a row [1, "val1"] and B the rows [1, "other1a"],[1, "other1b"] then two rows will match in the join condition. The only way to know what the result will look like is to actually touch both tables during execution. Instead, let's say that tableB contains rows that make it possible to place a unique constraint on the column B.id, for example and often the case a primary key. In this situation we know that we will get exactly as many rows as there are in tableA, since joining with tableB cannot introduce any duplicates. If further, as in the example query, we do not select any columns from tableB, touching that table during execution is unnecessary. We can remove the whole join operation from the execution plan. Both SQL Server 2005/2008 and Oracle 11g will deploy table elimination in the case described above. Let us look at a more advanced query, where Oracle fails. select A.colA from tableA A left outer join tableB B on B.id = A.id and B.fromDate = ( select max(sub.fromDate) from tableB sub where sub.id = A.id ); In this example we have added another join condition, which ensures that we only pick the matching row from tableB having the latest fromDate. In this case tableB will contain duplicates of the column B.id, so in order to ensure uniqueness the primary key has to contain the fromDate column as well. In other words the primary key of tableB is (B.id, B.fromDate). Furthermore, since the subselect ensures that we only pick the latest B.fromDate for a given B.id we know that at most one row will match the join condition. We will again have the situation where joining with tableB cannot affect the number of rows in the result set. Since we do not select any columns from tableB, the whole join operation can be eliminated from the execution plan. SQL Server 2005/2008 will deploy table elimination in this situation as well. We have not found a way to make Oracle 11g use it for this type of query. Queries like these arise in two situations. Either when you have denormalized model consisting of a fact table with several related dimension tables, or when you have a highly normalized model where each attribute is stored in its own table. The example with the subselect is common whenever you store historized/versioned data. LOW-LEVEL DESIGN: The code (currently in development) is at lp: ~maria-captains/maria/maria-5.1-table-elimination tree. <contents> 1. Conditions for removal 1.1 Quick check if there are candidates 2. Removal operation properties 3. Removal operation 4. User interface 5. Tests and benchmarks 6. Todo, issues to resolve 6.1 To resolve 6.2 Resolved 7. Additional issues </contents> It's not really about elimination of tables, it's about elimination of inner sides of outer joins. 1. Conditions for removal ------------------------- We can eliminate an inner side of outer join if: 1. For each record combination of outer tables, it will always produce exactly one record. 2. There are no references to columns of the inner tables anywhere else in the query. #1 means that every table inside the outer join nest is: - is a constant table: = because it can be accessed via eq_ref(const) access, or = it is a zero-rows or one-row MyISAM-like table [MARK1] - has an eq_ref access method candidate. #2 means that WHERE clause, ON clauses of embedding outer joins, ORDER BY, GROUP BY and HAVING do not refer to the inner tables of the outer join nest. 1.1 Quick check if there are candidates ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Before we start to enumerate join nests, here is a quick way to check if there *can be* something to be removed: if ((tables used in select_list | tables used in group/order by UNION | tables used in where) != bitmap_of_all_tables) { attempt table elimination; } 2. Removal operation properties ------------------------------- * There is always one way to remove (no choice to remove either this or that) * It is always better to remove as much tables as possible (at least within our cost model). Thus, no need for any cost calculations/etc. It's an unconditional rewrite. 3. Removal operation -------------------- * Remove the outer join nest's nested join structure (i.e. get the outer join's TABLE_LIST object $OJ and remove it from $OJ->embedding, $OJ->embedding->nested_join. Update table_map's of all ancestor nested joins). [MARK2] * Move the tables and their JOIN_TABs to front like it is done with const tables, with exception that if eliminated outer join nest was within another outer join nest, that shouldn't prevent us from moving away the eliminated tables. * Update join->table_count and all-join-tables bitmap. * That's it. Nothing else? 4. User interface ----------------- * We'll add an @@optimizer switch flag for table elimination. Tentative name: 'table_elimination'. (Note ^^ utility of the above questioned ^, as table elimination can never be worse than no elimination. We're leaning towards not adding the flag) * EXPLAIN will not show the removed tables at all. This will allow to check if tables were removed, and also will behave nicely with anchor model and VIEWs: stuff that user doesn't care about just won't be there. 5. Tests and benchmarks ----------------------- Create a benchmark in sql-bench which checks if the DBMS has table elimination. [According to Monty] Run - queries that would use elimination - queries that are very similar to one above (so that they would have same QEP, execution cost, etc) but cannot use table elimination. then compare run times and make a conclusion about whether dbms supports table elimination. 6. Todo, issues to resolve -------------------------- 6.1 To resolve ~~~~~~~~~~~~~~ - Relationship with prepared statements. On one hand, it's natural to desire to make table elimination a once-per-statement operation, like outer->inner join conversion. We'll have to limit the applicability by removing [MARK1] as that can change during lifetime of the statement. The other option is to do table elimination every time. This will require to rework operation [MARK2] to be undoable. I'm leaning towards doing the former. With anchor modeling, it is unlikely that we'll meet outer joins which have N inner tables of which some are 1-row MyISAM tables that do not have primary key. 6.2 Resolved ~~~~~~~~~~~~ * outer->inner join conversion is not a problem for table elimination. We make outer->inner conversions based on predicates in WHERE. If the WHERE referred to an inner table (requirement for OJ->IJ conversion) then table elimination would not be applicable anyway. * For Multi-table UPDATEs/DELETEs, need to also analyze the SET clause: - affected tables must not be eliminated - tables that are used on the right side of the SET x=y assignments must not be eliminated either. * Aggregate functions used to report that they depend on all tables, that is, item_agg_func->used_tables() == (1ULL << join->tables) - 1 always. Fixed it, now aggregate function reports it depends on tables that its arguments depend on. In particular, COUNT(*) reports that it depends on no tables (item_count_star->used_tables()==0). One consequence of that is that "item->used_tables()==0" is not equivalent to "item->const_item()==true" anymore (not sure if it's "anymore" or this has been already happening). * EXPLAIN EXTENDED warning text was generated after the JOIN object has been discarded. This didn't allow to use information about join plan when printing the warning. Fixed this by keeping the JOIN objects until we've printed the warning (have also an intent to remove the const tables from the join output). 7. Additional issues -------------------- * We remove ON clauses within outer join nests. If these clauses contain subqueries, they probably should be gone from EXPLAIN output also? Yes. Current approach: when removing an outer join nest, walk the ON clause and mark subselects as eliminated. Then let EXPLAIN code check if the SELECT was eliminated before the printing (EXPLAIN is generated by doing a recursive descent, so the check will also cause children of eliminated selects not to be printed) * Table elimination is performed after constant table detection (but before the range analysis). Constant tables are technically different from eliminated ones (e.g. the former are shown in EXPLAIN and the latter aren't). Considering we've already done the join_read_const_table() call, is there any real difference between constant table and eliminated one? If there is, should we mark const tables also as eliminated? from user/EXPLAIN point of view: no. constant table is the one that we read one record from. eliminated table is the one that we don't acccess at all. * What is described above will not be able to eliminate this outer join create unique index idx on tableB (id, fromDate); ... left outer join tableB B on B.id = A.id and B.fromDate = (select max(sub.fromDate) from tableB sub where sub.id = A.id); This is because condition "B.fromDate= func(tableB)" cannot be used. Reason#1: update_ref_and_keys() does not consider such conditions to be of any use (and indeed they are not usable for ref access) so they are not put into KEYUSE array. Reason#2: even if they were put there, we would need to be able to tell between predicates like B.fromDate= func(B.id) // guarantees only one matching row as // B.id is already bound by B.id=A.id // hence B.fromDate becomes bound too. and "B.fromDate= func(B.*)" // Can potentially have many matching // records. We need to - Have update_ref_and_keys() create KEYUSE elements for such equalities - Have eliminate_tables() and friends make a more accurate check. The right check is to check whether all parts of a unique key are bound. If we have keypartX to be bound, then t.keypartY=func(keypartX) makes keypartY to be bound. The difficulty here is that correlated subquery predicate cannot tell what columns it depends on (it only remembers tables). Traversing the predicate is expensive and complicated. We're leaning towards making each subquery predicate have a List<Item> with items that - are in the current select - and it depends on. This list will be useful in certain other subquery optimizations as well, it is cheap to collect it in fix_fields() phase, so it will be collected for every subquery predicate. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] Updated (by Monty): Table elimination (17)
by worklog-noreply＠askmonty.org 23 Jul '09

23 Jul '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Table elimination CREATION DATE..: Sun, 10 May 2009, 19:57 SUPERVISOR.....: Monty IMPLEMENTOR....: Psergey COPIES TO......: CATEGORY.......: Client-BackLog TASK ID........: 17 (http://askmonty.org/worklog/?tid=17) VERSION........: Server-5.1 STATUS.........: In-Progress PRIORITY.......: 60 WORKED HOURS...: 1 ESTIMATE.......: 3 (hours remain) ORIG. ESTIMATE.: 3 PROGRESS NOTES: -=-=(Monty - Thu, 23 Jul 2009, 09:19)=-=- Version updated. --- /tmp/wklog.17.old.24090 2009-07-23 09:19:32.000000000 +0300 +++ /tmp/wklog.17.new.24090 2009-07-23 09:19:32.000000000 +0300 @@ -1 +1 @@ -Server-9.x +Server-5.1 -=-=(Guest - Mon, 20 Jul 2009, 14:28)=-=- deukje weg Worked 1 hour and estimate 3 hours remain (original estimate increased by 4 hours). -=-=(Guest - Fri, 17 Jul 2009, 02:44)=-=- Version updated. --- /tmp/wklog.17.old.24138 2009-07-17 02:44:49.000000000 +0300 +++ /tmp/wklog.17.new.24138 2009-07-17 02:44:49.000000000 +0300 @@ -1 +1 @@ -9.x +Server-9.x -=-=(Guest - Fri, 17 Jul 2009, 02:44)=-=- Version updated. --- /tmp/wklog.17.old.24114 2009-07-17 02:44:36.000000000 +0300 +++ /tmp/wklog.17.new.24114 2009-07-17 02:44:36.000000000 +0300 @@ -1 +1 @@ -Server-5.1 +9.x -=-=(Guest - Fri, 17 Jul 2009, 02:44)=-=- Category updated. --- /tmp/wklog.17.old.24114 2009-07-17 02:44:36.000000000 +0300 +++ /tmp/wklog.17.new.24114 2009-07-17 02:44:36.000000000 +0300 @@ -1 +1 @@ -Server-Sprint +Client-BackLog -=-=(Guest - Thu, 18 Jun 2009, 04:15)=-=- Low Level Design modified. --- /tmp/wklog.17.old.29969 2009-06-18 04:15:23.000000000 +0300 +++ /tmp/wklog.17.new.29969 2009-06-18 04:15:23.000000000 +0300 @@ -158,3 +158,43 @@ from user/EXPLAIN point of view: no. constant table is the one that we read one record from. eliminated table is the one that we don't acccess at all. +* What is described above will not be able to eliminate this outer join + create unique index idx on tableB (id, fromDate); + ... + left outer join + tableB B + on + B.id = A.id + and + B.fromDate = (select max(sub.fromDate) + from tableB sub where sub.id = A.id); + + This is because condition "B.fromDate= func(tableB)" cannot be used. + Reason#1: update_ref_and_keys() does not consider such conditions to + be of any use (and indeed they are not usable for ref access) + so they are not put into KEYUSE array. + Reason#2: even if they were put there, we would need to be able to tell + between predicates like + B.fromDate= func(B.id) // guarantees only one matching row as + // B.id is already bound by B.id=A.id + // hence B.fromDate becomes bound too. + and + "B.fromDate= func(B.*)" // Can potentially have many matching + // records. + We need to + - Have update_ref_and_keys() create KEYUSE elements for such equalities + - Have eliminate_tables() and friends make a more accurate check. + The right check is to check whether all parts of a unique key are bound. + If we have keypartX to be bound, then t.keypartY=func(keypartX) makes + keypartY to be bound. + The difficulty here is that correlated subquery predicate cannot tell what + columns it depends on (it only remembers tables). + Traversing the predicate is expensive and complicated. + We're leaning towards making each subquery predicate have a List<Item> with + items that + - are in the current select + - and it depends on. + This list will be useful in certain other subquery optimizations as well, + it is cheap to collect it in fix_fields() phase, so it will be collected + for every subquery predicate. + -=-=(Guest - Thu, 18 Jun 2009, 02:48)=-=- Low Level Design modified. --- /tmp/wklog.17.old.27792 2009-06-18 02:48:45.000000000 +0300 +++ /tmp/wklog.17.new.27792 2009-06-18 02:48:45.000000000 +0300 @@ -89,14 +89,14 @@ - queries that would use elimination - queries that are very similar to one above (so that they would have same QEP, execution cost, etc) but cannot use table elimination. +then compare run times and make a conclusion about whether dbms supports table +elimination. 6. Todo, issues to resolve -------------------------- 6.1 To resolve ~~~~~~~~~~~~~~ -- Re-check how this works with equality propagation. - - Relationship with prepared statements. On one hand, it's natural to desire to make table elimination a once-per-statement operation, like outer->inner join conversion. We'll have @@ -141,8 +141,13 @@ 7. Additional issues -------------------- -* We remove ON clauses within semi-join nests. If these clauses contain +* We remove ON clauses within outer join nests. If these clauses contain subqueries, they probably should be gone from EXPLAIN output also? + Yes. Current approach: when removing an outer join nest, walk the ON clause + and mark subselects as eliminated. Then let EXPLAIN code check if the + SELECT was eliminated before the printing (EXPLAIN is generated by doing + a recursive descent, so the check will also cause children of eliminated + selects not to be printed) * Table elimination is performed after constant table detection (but before the range analysis). Constant tables are technically different from -=-=(Guest - Thu, 18 Jun 2009, 02:24)=-=- Low Level Design modified. --- /tmp/wklog.17.old.27162 2009-06-18 02:24:14.000000000 +0300 +++ /tmp/wklog.17.new.27162 2009-06-18 02:24:14.000000000 +0300 @@ -83,9 +83,12 @@ 5. Tests and benchmarks ----------------------- -Should create a benchmark in sql-bench which checks if the dbms has table +Create a benchmark in sql-bench which checks if the DBMS has table elimination. -TODO elaborate +[According to Monty] Run + - queries that would use elimination + - queries that are very similar to one above (so that they would have same + QEP, execution cost, etc) but cannot use table elimination. 6. Todo, issues to resolve -------------------------- @@ -109,33 +112,37 @@ 6.2 Resolved ~~~~~~~~~~~~ -- outer->inner join conversion is not a problem for table elimination. +* outer->inner join conversion is not a problem for table elimination. We make outer->inner conversions based on predicates in WHERE. If the WHERE referred to an inner table (requirement for OJ->IJ conversion) then table elimination would not be applicable anyway. -7. Additional issues --------------------- -* We remove ON clauses within semi-join nests. If these clauses contain - subqueries, they probably should be gone from EXPLAIN output also? +* For Multi-table UPDATEs/DELETEs, need to also analyze the SET clause: + - affected tables must not be eliminated + - tables that are used on the right side of the SET x=y assignments must + not be eliminated either. -* Aggregate functions report they depend on all tables, that is, +* Aggregate functions used to report that they depend on all tables, that is, item_agg_func->used_tables() == (1ULL << join->tables) - 1 - always. If we want table elimination to work in presence of grouping, need - to devise some other way of analyzing aggregate functions. + always. Fixed it, now aggregate function reports it depends on + tables that its arguments depend on. In particular, COUNT(*) reports + that it depends on no tables (item_count_star->used_tables()==0). + One consequence of that is that "item->used_tables()==0" is not + equivalent to "item->const_item()==true" anymore (not sure if it's + "anymore" or this has been already happening). + +* EXPLAIN EXTENDED warning text was generated after the JOIN object has + been discarded. This didn't allow to use information about join plan + when printing the warning. Fixed this by keeping the JOIN objects until + we've printed the warning (have also an intent to remove the const + tables from the join output). -* Should eliminated tables be shown in EXPLAIN EXTENDED? - - If we just ignore the question, they will be shown - - this is what happens for constant tables, too. - - I don't see how showing them could be of any use. They only make it - harder to read the rewritten query. - It turns out that - - it is easy to have EXPLAIN EXTENDED show permanent (once-per-statement - lifetime) changes. - - it is hard to have it show per-execution data. This is because the warning - text is generated after the execution structures have been destroyed. +7. Additional issues +-------------------- +* We remove ON clauses within semi-join nests. If these clauses contain + subqueries, they probably should be gone from EXPLAIN output also? * Table elimination is performed after constant table detection (but before the range analysis). Constant tables are technically different from @@ -143,8 +150,6 @@ Considering we've already done the join_read_const_table() call, is there any real difference between constant table and eliminated one? If there is, should we mark const tables also as eliminated? + from user/EXPLAIN point of view: no. constant table is the one that we read + one record from. eliminated table is the one that we don't acccess at all. -* For Multi-table UPDATEs/DELETEs, need to also analyze the SET clause: - - affected tables must not be eliminated - - tables that are used on the right side of the SET x=y assignments must - not be eliminated either. -=-=(Guest - Tue, 16 Jun 2009, 17:01)=-=- Dependency deleted: 29 no longer depends on 17 -=-=(Guest - Wed, 10 Jun 2009, 01:23)=-=- Low Level Design modified. --- /tmp/wklog.17.old.1842 2009-06-10 01:23:42.000000000 +0300 +++ /tmp/wklog.17.new.1842 2009-06-10 01:23:42.000000000 +0300 @@ -131,6 +131,11 @@ - this is what happens for constant tables, too. - I don't see how showing them could be of any use. They only make it harder to read the rewritten query. + It turns out that + - it is easy to have EXPLAIN EXTENDED show permanent (once-per-statement + lifetime) changes. + - it is hard to have it show per-execution data. This is because the warning + text is generated after the execution structures have been destroyed. * Table elimination is performed after constant table detection (but before the range analysis). Constant tables are technically different from ------------------------------------------------------------ -=-=(View All Progress Notes, 24 total)=-=- http://askmonty.org/worklog/index.pl?tid=17&nolimit=1 DESCRIPTION: Eliminate not needed tables from SELECT queries.. This will speed up some views and automatically generated queries. Example: CREATE TABLE B (id int primary key); select A.colA from tableA A left outer join tableB B on B.id = A.id; In this case we can remove table B and the join from the query. HIGH-LEVEL SPECIFICATION: Here is an extended explanation of table elimination. Table elimination is a feature found in some modern query optimizers, of which Microsoft SQL Server 2005/2008 seems to have the most advanced implementation. Oracle 11g has also been confirmed to use table elimination but not to the same extent. Basically, what table elimination does, is to remove tables from the execution plan when it is unnecessary to include them. This can, of course, only happen if the right circumstances arise. Let us for example look at the following query: select A.colA from tableA A left outer join tableB B on B.id = A.id; When using A as the left table we ensure that the query will return at least as many rows as there are in that table. For rows where the join condition (B.id = A.id) is not met the selected column (A.colA) will still contain it's original value. The not seen B.* row would contain all NULL:s. However, the result set could actually contain more rows than what is found in tableA if there are duplicates of the column B.id in tableB. If A contains a row [1, "val1"] and B the rows [1, "other1a"],[1, "other1b"] then two rows will match in the join condition. The only way to know what the result will look like is to actually touch both tables during execution. Instead, let's say that tableB contains rows that make it possible to place a unique constraint on the column B.id, for example and often the case a primary key. In this situation we know that we will get exactly as many rows as there are in tableA, since joining with tableB cannot introduce any duplicates. If further, as in the example query, we do not select any columns from tableB, touching that table during execution is unnecessary. We can remove the whole join operation from the execution plan. Both SQL Server 2005/2008 and Oracle 11g will deploy table elimination in the case described above. Let us look at a more advanced query, where Oracle fails. select A.colA from tableA A left outer join tableB B on B.id = A.id and B.fromDate = ( select max(sub.fromDate) from tableB sub where sub.id = A.id ); In this example we have added another join condition, which ensures that we only pick the matching row from tableB having the latest fromDate. In this case tableB will contain duplicates of the column B.id, so in order to ensure uniqueness the primary key has to contain the fromDate column as well. In other words the primary key of tableB is (B.id, B.fromDate). Furthermore, since the subselect ensures that we only pick the latest B.fromDate for a given B.id we know that at most one row will match the join condition. We will again have the situation where joining with tableB cannot affect the number of rows in the result set. Since we do not select any columns from tableB, the whole join operation can be eliminated from the execution plan. SQL Server 2005/2008 will deploy table elimination in this situation as well. We have not found a way to make Oracle 11g use it for this type of query. Queries like these arise in two situations. Either when you have denormalized model consisting of a fact table with several related dimension tables, or when you have a highly normalized model where each attribute is stored in its own table. The example with the subselect is common whenever you store historized/versioned data. LOW-LEVEL DESIGN: The code (currently in development) is at lp: ~maria-captains/maria/maria-5.1-table-elimination tree. <contents> 1. Conditions for removal 1.1 Quick check if there are candidates 2. Removal operation properties 3. Removal operation 4. User interface 5. Tests and benchmarks 6. Todo, issues to resolve 6.1 To resolve 6.2 Resolved 7. Additional issues </contents> It's not really about elimination of tables, it's about elimination of inner sides of outer joins. 1. Conditions for removal ------------------------- We can eliminate an inner side of outer join if: 1. For each record combination of outer tables, it will always produce exactly one record. 2. There are no references to columns of the inner tables anywhere else in the query. #1 means that every table inside the outer join nest is: - is a constant table: = because it can be accessed via eq_ref(const) access, or = it is a zero-rows or one-row MyISAM-like table [MARK1] - has an eq_ref access method candidate. #2 means that WHERE clause, ON clauses of embedding outer joins, ORDER BY, GROUP BY and HAVING do not refer to the inner tables of the outer join nest. 1.1 Quick check if there are candidates ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Before we start to enumerate join nests, here is a quick way to check if there *can be* something to be removed: if ((tables used in select_list | tables used in group/order by UNION | tables used in where) != bitmap_of_all_tables) { attempt table elimination; } 2. Removal operation properties ------------------------------- * There is always one way to remove (no choice to remove either this or that) * It is always better to remove as much tables as possible (at least within our cost model). Thus, no need for any cost calculations/etc. It's an unconditional rewrite. 3. Removal operation -------------------- * Remove the outer join nest's nested join structure (i.e. get the outer join's TABLE_LIST object $OJ and remove it from $OJ->embedding, $OJ->embedding->nested_join. Update table_map's of all ancestor nested joins). [MARK2] * Move the tables and their JOIN_TABs to front like it is done with const tables, with exception that if eliminated outer join nest was within another outer join nest, that shouldn't prevent us from moving away the eliminated tables. * Update join->table_count and all-join-tables bitmap. * That's it. Nothing else? 4. User interface ----------------- * We'll add an @@optimizer switch flag for table elimination. Tentative name: 'table_elimination'. (Note ^^ utility of the above questioned ^, as table elimination can never be worse than no elimination. We're leaning towards not adding the flag) * EXPLAIN will not show the removed tables at all. This will allow to check if tables were removed, and also will behave nicely with anchor model and VIEWs: stuff that user doesn't care about just won't be there. 5. Tests and benchmarks ----------------------- Create a benchmark in sql-bench which checks if the DBMS has table elimination. [According to Monty] Run - queries that would use elimination - queries that are very similar to one above (so that they would have same QEP, execution cost, etc) but cannot use table elimination. then compare run times and make a conclusion about whether dbms supports table elimination. 6. Todo, issues to resolve -------------------------- 6.1 To resolve ~~~~~~~~~~~~~~ - Relationship with prepared statements. On one hand, it's natural to desire to make table elimination a once-per-statement operation, like outer->inner join conversion. We'll have to limit the applicability by removing [MARK1] as that can change during lifetime of the statement. The other option is to do table elimination every time. This will require to rework operation [MARK2] to be undoable. I'm leaning towards doing the former. With anchor modeling, it is unlikely that we'll meet outer joins which have N inner tables of which some are 1-row MyISAM tables that do not have primary key. 6.2 Resolved ~~~~~~~~~~~~ * outer->inner join conversion is not a problem for table elimination. We make outer->inner conversions based on predicates in WHERE. If the WHERE referred to an inner table (requirement for OJ->IJ conversion) then table elimination would not be applicable anyway. * For Multi-table UPDATEs/DELETEs, need to also analyze the SET clause: - affected tables must not be eliminated - tables that are used on the right side of the SET x=y assignments must not be eliminated either. * Aggregate functions used to report that they depend on all tables, that is, item_agg_func->used_tables() == (1ULL << join->tables) - 1 always. Fixed it, now aggregate function reports it depends on tables that its arguments depend on. In particular, COUNT(*) reports that it depends on no tables (item_count_star->used_tables()==0). One consequence of that is that "item->used_tables()==0" is not equivalent to "item->const_item()==true" anymore (not sure if it's "anymore" or this has been already happening). * EXPLAIN EXTENDED warning text was generated after the JOIN object has been discarded. This didn't allow to use information about join plan when printing the warning. Fixed this by keeping the JOIN objects until we've printed the warning (have also an intent to remove the const tables from the join output). 7. Additional issues -------------------- * We remove ON clauses within outer join nests. If these clauses contain subqueries, they probably should be gone from EXPLAIN output also? Yes. Current approach: when removing an outer join nest, walk the ON clause and mark subselects as eliminated. Then let EXPLAIN code check if the SELECT was eliminated before the printing (EXPLAIN is generated by doing a recursive descent, so the check will also cause children of eliminated selects not to be printed) * Table elimination is performed after constant table detection (but before the range analysis). Constant tables are technically different from eliminated ones (e.g. the former are shown in EXPLAIN and the latter aren't). Considering we've already done the join_read_const_table() call, is there any real difference between constant table and eliminated one? If there is, should we mark const tables also as eliminated? from user/EXPLAIN point of view: no. constant table is the one that we read one record from. eliminated table is the one that we don't acccess at all. * What is described above will not be able to eliminate this outer join create unique index idx on tableB (id, fromDate); ... left outer join tableB B on B.id = A.id and B.fromDate = (select max(sub.fromDate) from tableB sub where sub.id = A.id); This is because condition "B.fromDate= func(tableB)" cannot be used. Reason#1: update_ref_and_keys() does not consider such conditions to be of any use (and indeed they are not usable for ref access) so they are not put into KEYUSE array. Reason#2: even if they were put there, we would need to be able to tell between predicates like B.fromDate= func(B.id) // guarantees only one matching row as // B.id is already bound by B.id=A.id // hence B.fromDate becomes bound too. and "B.fromDate= func(B.*)" // Can potentially have many matching // records. We need to - Have update_ref_and_keys() create KEYUSE elements for such equalities - Have eliminate_tables() and friends make a more accurate check. The right check is to check whether all parts of a unique key are bound. If we have keypartX to be bound, then t.keypartY=func(keypartX) makes keypartY to be bound. The difficulty here is that correlated subquery predicate cannot tell what columns it depends on (it only remembers tables). Traversing the predicate is expensive and complicated. We're leaning towards making each subquery predicate have a List<Item> with items that - are in the current select - and it depends on. This list will be useful in certain other subquery optimizations as well, it is cheap to collect it in fix_fields() phase, so it will be collected for every subquery predicate. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] Bug tracker for MariaDB
by Bryan Alsdorf 22 Jul '09

22 Jul '09

Hi, I am in the process of setting up Eventum as the bug tracker at Monty Program AB to track bugs in MariaDB. I would like input on the values we need for the following fields: * Status * Category * Operating Systems Also please suggest any additional fields you think should be in this system. The current fields I have are: * Summary * Severity * Category * Status * Operating System * Description * How to Repeat * Suggested Fix * Product * Product Version * BZR Tree (URL to relevant tree with bug fix or that illustrates the problem) Some of these fields (and my current values for them) are inspired by bugs.mysql.com. However, since we are setting up a brand new system we should try make it fit our needs instead of doing what others are doing. Any suggestions or comments are appreciated, once I finish the initial setup I will get a test site up for everyone to look over. Best Regards, -- Bryan Alsdorf, Lead Web Developer Monty Program, AB. http://askmonty.org

8 15

[Maria-developers] Updated (by Psergey): Add support for google protocol buffers (34)
by worklog-noreply＠askmonty.org 21 Jul '09

21 Jul '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Add support for google protocol buffers CREATION DATE..: Tue, 21 Jul 2009, 21:11 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: CATEGORY.......: Server-Sprint TASK ID........: 34 (http://askmonty.org/worklog/?tid=34) VERSION........: WorkLog-3.4 STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: -=-=(Psergey - Tue, 21 Jul 2009, 21:13)=-=- Low Level Design modified. --- /tmp/wklog.34.old.6462 2009-07-21 21:13:13.000000000 +0300 +++ /tmp/wklog.34.new.6462 2009-07-21 21:13:13.000000000 +0300 @@ -1 +1,4 @@ +* GPB tarball contains a protocol definition for .proto file structure itself + and a parser for text form of .proto file which then exposes the parsed + file via standard GPB message navigation API. -=-=(Psergey - Tue, 21 Jul 2009, 21:12)=-=- High-Level Specification modified. --- /tmp/wklog.34.old.6399 2009-07-21 21:12:23.000000000 +0300 +++ /tmp/wklog.34.new.6399 2009-07-21 21:12:23.000000000 +0300 @@ -1 +1,78 @@ +<contents> +1. GPB Encoding overview +2. GPB in an SQL database +2.1 Informing server about GPB field names and types +2.2 Addressing GPB fields +2.2.1 Option1: SQL Function +2.2.2 Option2: SQL columns +</contents> + + +1. GPB Encoding overview +======================== + +GBB is a compact encoding for structured and typed data. A unit of GPB data +(it is called message) is only partially self-describing: it's possible to +iterate over its parts, but, quoting the spec + +http://code.google.com/apis/protocolbuffers/docs/encoding.html: + " the name and declared type for each field can only be determined on the + decoding end by referencing the message type's definition (i.e. the .proto + file). " + +2. GPB in an SQL database +========================= + +It is possible to store GPB data in MariaDB today - one can declare a binary +blob column and use it to store GPB messages. Storing and retrieving entire +messages will be the only available operations, though, as the server has no +idea about the GPB format. +It is apparent that ability to peek inside GPB data from SQL layer would be of +great advantage: one would be able to +- select only certain fields or parts of GPB messages +- filter records based on the values of GPB fields +- etc +performing such operations at SQL layer will allow to reduce client<->server +traffic right away, and will open path to getting the best possible +performance. + +2.1 Informing server about GPB field names and types +---------------------------------------------------- +User-friendly/meaningful access to GPB fields requires knowledge of GPB field +names and types, which are not available from GPB message itself (see "GPB +encoding overview" section). + +So the first issue to be addressed is to get the server to know the definition +of stored messages. We intend to assume that all records have GPB messages +that conform to a certain single definition, which gives one definition per +GPB field. + +DecisionToMake: How to pass the server the GPB definition? +First idea: add a CREATE TABLE parameter which will specify either the +definition itself or path to .proto file with the definition. + +2.2 Addressing GPB fields +------------------------- +We'll need to provide a way to access GPB fields. This can be complicated as +structures that are encoded in GPB message can be nested and recursive. + +2.2.1 Option1: SQL Function +~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Introduce an SQL function GPB_FIELD(path) which will return contents of the +field. +- Return type of the function will be determined from GPB message definition. +- For path, we can use XPath selector (a subset of XPath) syntax. + +(TODO ^ the above needs to be specified in more detail. is the selector as +simple as filesystem path or we allow quantifiers (with predicates?)?) + +2.2.2 Option2: SQL columns +~~~~~~~~~~~~~~~~~~~~~~~~~~ +Make GPB columns to be accessible as SQL columns. +This approach has problems: +- It might be hard to implement code-wise + - (TODO will Virtual columns patch help??) +- It is not clear how to access fields from nested structures. Should we allow + quoted names like `foo/bar[2]/baz' ? + DESCRIPTION: Add support for Google Protocol Buffers (further GPB). It should be possible to have columns that store GPB-encoded data, as well as use SQL constructs to extract parts of GPB data for use in select list, for filtering, and so forth. Any support for indexing GPB data is outside of scope of this WL entry. HIGH-LEVEL SPECIFICATION: <contents> 1. GPB Encoding overview 2. GPB in an SQL database 2.1 Informing server about GPB field names and types 2.2 Addressing GPB fields 2.2.1 Option1: SQL Function 2.2.2 Option2: SQL columns </contents> 1. GPB Encoding overview ======================== GBB is a compact encoding for structured and typed data. A unit of GPB data (it is called message) is only partially self-describing: it's possible to iterate over its parts, but, quoting the spec http://code.google.com/apis/protocolbuffers/docs/encoding.html: " the name and declared type for each field can only be determined on the decoding end by referencing the message type's definition (i.e. the .proto file). " 2. GPB in an SQL database ========================= It is possible to store GPB data in MariaDB today - one can declare a binary blob column and use it to store GPB messages. Storing and retrieving entire messages will be the only available operations, though, as the server has no idea about the GPB format. It is apparent that ability to peek inside GPB data from SQL layer would be of great advantage: one would be able to - select only certain fields or parts of GPB messages - filter records based on the values of GPB fields - etc performing such operations at SQL layer will allow to reduce client<->server traffic right away, and will open path to getting the best possible performance. 2.1 Informing server about GPB field names and types ---------------------------------------------------- User-friendly/meaningful access to GPB fields requires knowledge of GPB field names and types, which are not available from GPB message itself (see "GPB encoding overview" section). So the first issue to be addressed is to get the server to know the definition of stored messages. We intend to assume that all records have GPB messages that conform to a certain single definition, which gives one definition per GPB field. DecisionToMake: How to pass the server the GPB definition? First idea: add a CREATE TABLE parameter which will specify either the definition itself or path to .proto file with the definition. 2.2 Addressing GPB fields ------------------------- We'll need to provide a way to access GPB fields. This can be complicated as structures that are encoded in GPB message can be nested and recursive. 2.2.1 Option1: SQL Function ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Introduce an SQL function GPB_FIELD(path) which will return contents of the field. - Return type of the function will be determined from GPB message definition. - For path, we can use XPath selector (a subset of XPath) syntax. (TODO ^ the above needs to be specified in more detail. is the selector as simple as filesystem path or we allow quantifiers (with predicates?)?) 2.2.2 Option2: SQL columns ~~~~~~~~~~~~~~~~~~~~~~~~~~ Make GPB columns to be accessible as SQL columns. This approach has problems: - It might be hard to implement code-wise - (TODO will Virtual columns patch help??) - It is not clear how to access fields from nested structures. Should we allow quoted names like `foo/bar[2]/baz' ? LOW-LEVEL DESIGN: * GPB tarball contains a protocol definition for .proto file structure itself and a parser for text form of .proto file which then exposes the parsed file via standard GPB message navigation API. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] Updated (by Psergey): Add support for google protocol buffers (34)
by worklog-noreply＠askmonty.org 21 Jul '09

21 Jul '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Add support for google protocol buffers CREATION DATE..: Tue, 21 Jul 2009, 21:11 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: CATEGORY.......: Server-Sprint TASK ID........: 34 (http://askmonty.org/worklog/?tid=34) VERSION........: WorkLog-3.4 STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: -=-=(Psergey - Tue, 21 Jul 2009, 21:13)=-=- Low Level Design modified. --- /tmp/wklog.34.old.6462 2009-07-21 21:13:13.000000000 +0300 +++ /tmp/wklog.34.new.6462 2009-07-21 21:13:13.000000000 +0300 @@ -1 +1,4 @@ +* GPB tarball contains a protocol definition for .proto file structure itself + and a parser for text form of .proto file which then exposes the parsed + file via standard GPB message navigation API. -=-=(Psergey - Tue, 21 Jul 2009, 21:12)=-=- High-Level Specification modified. --- /tmp/wklog.34.old.6399 2009-07-21 21:12:23.000000000 +0300 +++ /tmp/wklog.34.new.6399 2009-07-21 21:12:23.000000000 +0300 @@ -1 +1,78 @@ +<contents> +1. GPB Encoding overview +2. GPB in an SQL database +2.1 Informing server about GPB field names and types +2.2 Addressing GPB fields +2.2.1 Option1: SQL Function +2.2.2 Option2: SQL columns +</contents> + + +1. GPB Encoding overview +======================== + +GBB is a compact encoding for structured and typed data. A unit of GPB data +(it is called message) is only partially self-describing: it's possible to +iterate over its parts, but, quoting the spec + +http://code.google.com/apis/protocolbuffers/docs/encoding.html: + " the name and declared type for each field can only be determined on the + decoding end by referencing the message type's definition (i.e. the .proto + file). " + +2. GPB in an SQL database +========================= + +It is possible to store GPB data in MariaDB today - one can declare a binary +blob column and use it to store GPB messages. Storing and retrieving entire +messages will be the only available operations, though, as the server has no +idea about the GPB format. +It is apparent that ability to peek inside GPB data from SQL layer would be of +great advantage: one would be able to +- select only certain fields or parts of GPB messages +- filter records based on the values of GPB fields +- etc +performing such operations at SQL layer will allow to reduce client<->server +traffic right away, and will open path to getting the best possible +performance. + +2.1 Informing server about GPB field names and types +---------------------------------------------------- +User-friendly/meaningful access to GPB fields requires knowledge of GPB field +names and types, which are not available from GPB message itself (see "GPB +encoding overview" section). + +So the first issue to be addressed is to get the server to know the definition +of stored messages. We intend to assume that all records have GPB messages +that conform to a certain single definition, which gives one definition per +GPB field. + +DecisionToMake: How to pass the server the GPB definition? +First idea: add a CREATE TABLE parameter which will specify either the +definition itself or path to .proto file with the definition. + +2.2 Addressing GPB fields +------------------------- +We'll need to provide a way to access GPB fields. This can be complicated as +structures that are encoded in GPB message can be nested and recursive. + +2.2.1 Option1: SQL Function +~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Introduce an SQL function GPB_FIELD(path) which will return contents of the +field. +- Return type of the function will be determined from GPB message definition. +- For path, we can use XPath selector (a subset of XPath) syntax. + +(TODO ^ the above needs to be specified in more detail. is the selector as +simple as filesystem path or we allow quantifiers (with predicates?)?) + +2.2.2 Option2: SQL columns +~~~~~~~~~~~~~~~~~~~~~~~~~~ +Make GPB columns to be accessible as SQL columns. +This approach has problems: +- It might be hard to implement code-wise + - (TODO will Virtual columns patch help??) +- It is not clear how to access fields from nested structures. Should we allow + quoted names like `foo/bar[2]/baz' ? + DESCRIPTION: Add support for Google Protocol Buffers (further GPB). It should be possible to have columns that store GPB-encoded data, as well as use SQL constructs to extract parts of GPB data for use in select list, for filtering, and so forth. Any support for indexing GPB data is outside of scope of this WL entry. HIGH-LEVEL SPECIFICATION: <contents> 1. GPB Encoding overview 2. GPB in an SQL database 2.1 Informing server about GPB field names and types 2.2 Addressing GPB fields 2.2.1 Option1: SQL Function 2.2.2 Option2: SQL columns </contents> 1. GPB Encoding overview ======================== GBB is a compact encoding for structured and typed data. A unit of GPB data (it is called message) is only partially self-describing: it's possible to iterate over its parts, but, quoting the spec http://code.google.com/apis/protocolbuffers/docs/encoding.html: " the name and declared type for each field can only be determined on the decoding end by referencing the message type's definition (i.e. the .proto file). " 2. GPB in an SQL database ========================= It is possible to store GPB data in MariaDB today - one can declare a binary blob column and use it to store GPB messages. Storing and retrieving entire messages will be the only available operations, though, as the server has no idea about the GPB format. It is apparent that ability to peek inside GPB data from SQL layer would be of great advantage: one would be able to - select only certain fields or parts of GPB messages - filter records based on the values of GPB fields - etc performing such operations at SQL layer will allow to reduce client<->server traffic right away, and will open path to getting the best possible performance. 2.1 Informing server about GPB field names and types ---------------------------------------------------- User-friendly/meaningful access to GPB fields requires knowledge of GPB field names and types, which are not available from GPB message itself (see "GPB encoding overview" section). So the first issue to be addressed is to get the server to know the definition of stored messages. We intend to assume that all records have GPB messages that conform to a certain single definition, which gives one definition per GPB field. DecisionToMake: How to pass the server the GPB definition? First idea: add a CREATE TABLE parameter which will specify either the definition itself or path to .proto file with the definition. 2.2 Addressing GPB fields ------------------------- We'll need to provide a way to access GPB fields. This can be complicated as structures that are encoded in GPB message can be nested and recursive. 2.2.1 Option1: SQL Function ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Introduce an SQL function GPB_FIELD(path) which will return contents of the field. - Return type of the function will be determined from GPB message definition. - For path, we can use XPath selector (a subset of XPath) syntax. (TODO ^ the above needs to be specified in more detail. is the selector as simple as filesystem path or we allow quantifiers (with predicates?)?) 2.2.2 Option2: SQL columns ~~~~~~~~~~~~~~~~~~~~~~~~~~ Make GPB columns to be accessible as SQL columns. This approach has problems: - It might be hard to implement code-wise - (TODO will Virtual columns patch help??) - It is not clear how to access fields from nested structures. Should we allow quoted names like `foo/bar[2]/baz' ? LOW-LEVEL DESIGN: * GPB tarball contains a protocol definition for .proto file structure itself and a parser for text form of .proto file which then exposes the parsed file via standard GPB message navigation API. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] Updated (by Psergey): Add support for google protocol buffers (34)
by worklog-noreply＠askmonty.org 21 Jul '09

21 Jul '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Add support for google protocol buffers CREATION DATE..: Tue, 21 Jul 2009, 21:11 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: CATEGORY.......: Server-Sprint TASK ID........: 34 (http://askmonty.org/worklog/?tid=34) VERSION........: WorkLog-3.4 STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: -=-=(Psergey - Tue, 21 Jul 2009, 21:12)=-=- High-Level Specification modified. --- /tmp/wklog.34.old.6399 2009-07-21 21:12:23.000000000 +0300 +++ /tmp/wklog.34.new.6399 2009-07-21 21:12:23.000000000 +0300 @@ -1 +1,78 @@ +<contents> +1. GPB Encoding overview +2. GPB in an SQL database +2.1 Informing server about GPB field names and types +2.2 Addressing GPB fields +2.2.1 Option1: SQL Function +2.2.2 Option2: SQL columns +</contents> + + +1. GPB Encoding overview +======================== + +GBB is a compact encoding for structured and typed data. A unit of GPB data +(it is called message) is only partially self-describing: it's possible to +iterate over its parts, but, quoting the spec + +http://code.google.com/apis/protocolbuffers/docs/encoding.html: + " the name and declared type for each field can only be determined on the + decoding end by referencing the message type's definition (i.e. the .proto + file). " + +2. GPB in an SQL database +========================= + +It is possible to store GPB data in MariaDB today - one can declare a binary +blob column and use it to store GPB messages. Storing and retrieving entire +messages will be the only available operations, though, as the server has no +idea about the GPB format. +It is apparent that ability to peek inside GPB data from SQL layer would be of +great advantage: one would be able to +- select only certain fields or parts of GPB messages +- filter records based on the values of GPB fields +- etc +performing such operations at SQL layer will allow to reduce client<->server +traffic right away, and will open path to getting the best possible +performance. + +2.1 Informing server about GPB field names and types +---------------------------------------------------- +User-friendly/meaningful access to GPB fields requires knowledge of GPB field +names and types, which are not available from GPB message itself (see "GPB +encoding overview" section). + +So the first issue to be addressed is to get the server to know the definition +of stored messages. We intend to assume that all records have GPB messages +that conform to a certain single definition, which gives one definition per +GPB field. + +DecisionToMake: How to pass the server the GPB definition? +First idea: add a CREATE TABLE parameter which will specify either the +definition itself or path to .proto file with the definition. + +2.2 Addressing GPB fields +------------------------- +We'll need to provide a way to access GPB fields. This can be complicated as +structures that are encoded in GPB message can be nested and recursive. + +2.2.1 Option1: SQL Function +~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Introduce an SQL function GPB_FIELD(path) which will return contents of the +field. +- Return type of the function will be determined from GPB message definition. +- For path, we can use XPath selector (a subset of XPath) syntax. + +(TODO ^ the above needs to be specified in more detail. is the selector as +simple as filesystem path or we allow quantifiers (with predicates?)?) + +2.2.2 Option2: SQL columns +~~~~~~~~~~~~~~~~~~~~~~~~~~ +Make GPB columns to be accessible as SQL columns. +This approach has problems: +- It might be hard to implement code-wise + - (TODO will Virtual columns patch help??) +- It is not clear how to access fields from nested structures. Should we allow + quoted names like `foo/bar[2]/baz' ? + DESCRIPTION: Add support for Google Protocol Buffers (further GPB). It should be possible to have columns that store GPB-encoded data, as well as use SQL constructs to extract parts of GPB data for use in select list, for filtering, and so forth. Any support for indexing GPB data is outside of scope of this WL entry. HIGH-LEVEL SPECIFICATION: <contents> 1. GPB Encoding overview 2. GPB in an SQL database 2.1 Informing server about GPB field names and types 2.2 Addressing GPB fields 2.2.1 Option1: SQL Function 2.2.2 Option2: SQL columns </contents> 1. GPB Encoding overview ======================== GBB is a compact encoding for structured and typed data. A unit of GPB data (it is called message) is only partially self-describing: it's possible to iterate over its parts, but, quoting the spec http://code.google.com/apis/protocolbuffers/docs/encoding.html: " the name and declared type for each field can only be determined on the decoding end by referencing the message type's definition (i.e. the .proto file). " 2. GPB in an SQL database ========================= It is possible to store GPB data in MariaDB today - one can declare a binary blob column and use it to store GPB messages. Storing and retrieving entire messages will be the only available operations, though, as the server has no idea about the GPB format. It is apparent that ability to peek inside GPB data from SQL layer would be of great advantage: one would be able to - select only certain fields or parts of GPB messages - filter records based on the values of GPB fields - etc performing such operations at SQL layer will allow to reduce client<->server traffic right away, and will open path to getting the best possible performance. 2.1 Informing server about GPB field names and types ---------------------------------------------------- User-friendly/meaningful access to GPB fields requires knowledge of GPB field names and types, which are not available from GPB message itself (see "GPB encoding overview" section). So the first issue to be addressed is to get the server to know the definition of stored messages. We intend to assume that all records have GPB messages that conform to a certain single definition, which gives one definition per GPB field. DecisionToMake: How to pass the server the GPB definition? First idea: add a CREATE TABLE parameter which will specify either the definition itself or path to .proto file with the definition. 2.2 Addressing GPB fields ------------------------- We'll need to provide a way to access GPB fields. This can be complicated as structures that are encoded in GPB message can be nested and recursive. 2.2.1 Option1: SQL Function ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Introduce an SQL function GPB_FIELD(path) which will return contents of the field. - Return type of the function will be determined from GPB message definition. - For path, we can use XPath selector (a subset of XPath) syntax. (TODO ^ the above needs to be specified in more detail. is the selector as simple as filesystem path or we allow quantifiers (with predicates?)?) 2.2.2 Option2: SQL columns ~~~~~~~~~~~~~~~~~~~~~~~~~~ Make GPB columns to be accessible as SQL columns. This approach has problems: - It might be hard to implement code-wise - (TODO will Virtual columns patch help??) - It is not clear how to access fields from nested structures. Should we allow quoted names like `foo/bar[2]/baz' ? ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] Updated (by Psergey): Add support for google protocol buffers (34)
by worklog-noreply＠askmonty.org 21 Jul '09

21 Jul '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Add support for google protocol buffers CREATION DATE..: Tue, 21 Jul 2009, 21:11 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: CATEGORY.......: Server-Sprint TASK ID........: 34 (http://askmonty.org/worklog/?tid=34) VERSION........: WorkLog-3.4 STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: -=-=(Psergey - Tue, 21 Jul 2009, 21:12)=-=- High-Level Specification modified. --- /tmp/wklog.34.old.6399 2009-07-21 21:12:23.000000000 +0300 +++ /tmp/wklog.34.new.6399 2009-07-21 21:12:23.000000000 +0300 @@ -1 +1,78 @@ +<contents> +1. GPB Encoding overview +2. GPB in an SQL database +2.1 Informing server about GPB field names and types +2.2 Addressing GPB fields +2.2.1 Option1: SQL Function +2.2.2 Option2: SQL columns +</contents> + + +1. GPB Encoding overview +======================== + +GBB is a compact encoding for structured and typed data. A unit of GPB data +(it is called message) is only partially self-describing: it's possible to +iterate over its parts, but, quoting the spec + +http://code.google.com/apis/protocolbuffers/docs/encoding.html: + " the name and declared type for each field can only be determined on the + decoding end by referencing the message type's definition (i.e. the .proto + file). " + +2. GPB in an SQL database +========================= + +It is possible to store GPB data in MariaDB today - one can declare a binary +blob column and use it to store GPB messages. Storing and retrieving entire +messages will be the only available operations, though, as the server has no +idea about the GPB format. +It is apparent that ability to peek inside GPB data from SQL layer would be of +great advantage: one would be able to +- select only certain fields or parts of GPB messages +- filter records based on the values of GPB fields +- etc +performing such operations at SQL layer will allow to reduce client<->server +traffic right away, and will open path to getting the best possible +performance. + +2.1 Informing server about GPB field names and types +---------------------------------------------------- +User-friendly/meaningful access to GPB fields requires knowledge of GPB field +names and types, which are not available from GPB message itself (see "GPB +encoding overview" section). + +So the first issue to be addressed is to get the server to know the definition +of stored messages. We intend to assume that all records have GPB messages +that conform to a certain single definition, which gives one definition per +GPB field. + +DecisionToMake: How to pass the server the GPB definition? +First idea: add a CREATE TABLE parameter which will specify either the +definition itself or path to .proto file with the definition. + +2.2 Addressing GPB fields +------------------------- +We'll need to provide a way to access GPB fields. This can be complicated as +structures that are encoded in GPB message can be nested and recursive. + +2.2.1 Option1: SQL Function +~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Introduce an SQL function GPB_FIELD(path) which will return contents of the +field. +- Return type of the function will be determined from GPB message definition. +- For path, we can use XPath selector (a subset of XPath) syntax. + +(TODO ^ the above needs to be specified in more detail. is the selector as +simple as filesystem path or we allow quantifiers (with predicates?)?) + +2.2.2 Option2: SQL columns +~~~~~~~~~~~~~~~~~~~~~~~~~~ +Make GPB columns to be accessible as SQL columns. +This approach has problems: +- It might be hard to implement code-wise + - (TODO will Virtual columns patch help??) +- It is not clear how to access fields from nested structures. Should we allow + quoted names like `foo/bar[2]/baz' ? + DESCRIPTION: Add support for Google Protocol Buffers (further GPB). It should be possible to have columns that store GPB-encoded data, as well as use SQL constructs to extract parts of GPB data for use in select list, for filtering, and so forth. Any support for indexing GPB data is outside of scope of this WL entry. HIGH-LEVEL SPECIFICATION: <contents> 1. GPB Encoding overview 2. GPB in an SQL database 2.1 Informing server about GPB field names and types 2.2 Addressing GPB fields 2.2.1 Option1: SQL Function 2.2.2 Option2: SQL columns </contents> 1. GPB Encoding overview ======================== GBB is a compact encoding for structured and typed data. A unit of GPB data (it is called message) is only partially self-describing: it's possible to iterate over its parts, but, quoting the spec http://code.google.com/apis/protocolbuffers/docs/encoding.html: " the name and declared type for each field can only be determined on the decoding end by referencing the message type's definition (i.e. the .proto file). " 2. GPB in an SQL database ========================= It is possible to store GPB data in MariaDB today - one can declare a binary blob column and use it to store GPB messages. Storing and retrieving entire messages will be the only available operations, though, as the server has no idea about the GPB format. It is apparent that ability to peek inside GPB data from SQL layer would be of great advantage: one would be able to - select only certain fields or parts of GPB messages - filter records based on the values of GPB fields - etc performing such operations at SQL layer will allow to reduce client<->server traffic right away, and will open path to getting the best possible performance. 2.1 Informing server about GPB field names and types ---------------------------------------------------- User-friendly/meaningful access to GPB fields requires knowledge of GPB field names and types, which are not available from GPB message itself (see "GPB encoding overview" section). So the first issue to be addressed is to get the server to know the definition of stored messages. We intend to assume that all records have GPB messages that conform to a certain single definition, which gives one definition per GPB field. DecisionToMake: How to pass the server the GPB definition? First idea: add a CREATE TABLE parameter which will specify either the definition itself or path to .proto file with the definition. 2.2 Addressing GPB fields ------------------------- We'll need to provide a way to access GPB fields. This can be complicated as structures that are encoded in GPB message can be nested and recursive. 2.2.1 Option1: SQL Function ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Introduce an SQL function GPB_FIELD(path) which will return contents of the field. - Return type of the function will be determined from GPB message definition. - For path, we can use XPath selector (a subset of XPath) syntax. (TODO ^ the above needs to be specified in more detail. is the selector as simple as filesystem path or we allow quantifiers (with predicates?)?) 2.2.2 Option2: SQL columns ~~~~~~~~~~~~~~~~~~~~~~~~~~ Make GPB columns to be accessible as SQL columns. This approach has problems: - It might be hard to implement code-wise - (TODO will Virtual columns patch help??) - It is not clear how to access fields from nested structures. Should we allow quoted names like `foo/bar[2]/baz' ? ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] New (by Psergey): Add support for google protocol buffers (34)
by worklog-noreply＠askmonty.org 21 Jul '09

21 Jul '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Add support for google protocol buffers CREATION DATE..: Tue, 21 Jul 2009, 21:11 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: CATEGORY.......: Server-Sprint TASK ID........: 34 (http://askmonty.org/worklog/?tid=34) VERSION........: WorkLog-3.4 STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: DESCRIPTION: Add support for Google Protocol Buffers (further GPB). It should be possible to have columns that store GPB-encoded data, as well as use SQL constructs to extract parts of GPB data for use in select list, for filtering, and so forth. Any support for indexing GPB data is outside of scope of this WL entry. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] New (by Psergey): Add support for google protocol buffers (34)
by worklog-noreply＠askmonty.org 21 Jul '09

21 Jul '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Add support for google protocol buffers CREATION DATE..: Tue, 21 Jul 2009, 21:11 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: CATEGORY.......: Server-Sprint TASK ID........: 34 (http://askmonty.org/worklog/?tid=34) VERSION........: WorkLog-3.4 STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: DESCRIPTION: Add support for Google Protocol Buffers (further GPB). It should be possible to have columns that store GPB-encoded data, as well as use SQL constructs to extract parts of GPB data for use in select list, for filtering, and so forth. Any support for indexing GPB data is outside of scope of this WL entry. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] Progress (by Guest): Table elimination (17)
by worklog-noreply＠askmonty.org 20 Jul '09

20 Jul '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Table elimination CREATION DATE..: Sun, 10 May 2009, 19:57 SUPERVISOR.....: Monty IMPLEMENTOR....: Psergey COPIES TO......: CATEGORY.......: Client-BackLog TASK ID........: 17 (http://askmonty.org/worklog/?tid=17) VERSION........: Server-9.x STATUS.........: In-Progress PRIORITY.......: 60 WORKED HOURS...: 1 ESTIMATE.......: 3 (hours remain) ORIG. ESTIMATE.: 3 PROGRESS NOTES: -=-=(Guest - Mon, 20 Jul 2009, 14:28)=-=- deukje weg Worked 1 hour and estimate 3 hours remain (original estimate increased by 4 hours). -=-=(Guest - Fri, 17 Jul 2009, 02:44)=-=- Version updated. --- /tmp/wklog.17.old.24138 2009-07-17 02:44:49.000000000 +0300 +++ /tmp/wklog.17.new.24138 2009-07-17 02:44:49.000000000 +0300 @@ -1 +1 @@ -9.x +Server-9.x -=-=(Guest - Fri, 17 Jul 2009, 02:44)=-=- Version updated. --- /tmp/wklog.17.old.24114 2009-07-17 02:44:36.000000000 +0300 +++ /tmp/wklog.17.new.24114 2009-07-17 02:44:36.000000000 +0300 @@ -1 +1 @@ -Server-5.1 +9.x -=-=(Guest - Fri, 17 Jul 2009, 02:44)=-=- Category updated. --- /tmp/wklog.17.old.24114 2009-07-17 02:44:36.000000000 +0300 +++ /tmp/wklog.17.new.24114 2009-07-17 02:44:36.000000000 +0300 @@ -1 +1 @@ -Server-Sprint +Client-BackLog -=-=(Guest - Thu, 18 Jun 2009, 04:15)=-=- Low Level Design modified. --- /tmp/wklog.17.old.29969 2009-06-18 04:15:23.000000000 +0300 +++ /tmp/wklog.17.new.29969 2009-06-18 04:15:23.000000000 +0300 @@ -158,3 +158,43 @@ from user/EXPLAIN point of view: no. constant table is the one that we read one record from. eliminated table is the one that we don't acccess at all. +* What is described above will not be able to eliminate this outer join + create unique index idx on tableB (id, fromDate); + ... + left outer join + tableB B + on + B.id = A.id + and + B.fromDate = (select max(sub.fromDate) + from tableB sub where sub.id = A.id); + + This is because condition "B.fromDate= func(tableB)" cannot be used. + Reason#1: update_ref_and_keys() does not consider such conditions to + be of any use (and indeed they are not usable for ref access) + so they are not put into KEYUSE array. + Reason#2: even if they were put there, we would need to be able to tell + between predicates like + B.fromDate= func(B.id) // guarantees only one matching row as + // B.id is already bound by B.id=A.id + // hence B.fromDate becomes bound too. + and + "B.fromDate= func(B.*)" // Can potentially have many matching + // records. + We need to + - Have update_ref_and_keys() create KEYUSE elements for such equalities + - Have eliminate_tables() and friends make a more accurate check. + The right check is to check whether all parts of a unique key are bound. + If we have keypartX to be bound, then t.keypartY=func(keypartX) makes + keypartY to be bound. + The difficulty here is that correlated subquery predicate cannot tell what + columns it depends on (it only remembers tables). + Traversing the predicate is expensive and complicated. + We're leaning towards making each subquery predicate have a List<Item> with + items that + - are in the current select + - and it depends on. + This list will be useful in certain other subquery optimizations as well, + it is cheap to collect it in fix_fields() phase, so it will be collected + for every subquery predicate. + -=-=(Guest - Thu, 18 Jun 2009, 02:48)=-=- Low Level Design modified. --- /tmp/wklog.17.old.27792 2009-06-18 02:48:45.000000000 +0300 +++ /tmp/wklog.17.new.27792 2009-06-18 02:48:45.000000000 +0300 @@ -89,14 +89,14 @@ - queries that would use elimination - queries that are very similar to one above (so that they would have same QEP, execution cost, etc) but cannot use table elimination. +then compare run times and make a conclusion about whether dbms supports table +elimination. 6. Todo, issues to resolve -------------------------- 6.1 To resolve ~~~~~~~~~~~~~~ -- Re-check how this works with equality propagation. - - Relationship with prepared statements. On one hand, it's natural to desire to make table elimination a once-per-statement operation, like outer->inner join conversion. We'll have @@ -141,8 +141,13 @@ 7. Additional issues -------------------- -* We remove ON clauses within semi-join nests. If these clauses contain +* We remove ON clauses within outer join nests. If these clauses contain subqueries, they probably should be gone from EXPLAIN output also? + Yes. Current approach: when removing an outer join nest, walk the ON clause + and mark subselects as eliminated. Then let EXPLAIN code check if the + SELECT was eliminated before the printing (EXPLAIN is generated by doing + a recursive descent, so the check will also cause children of eliminated + selects not to be printed) * Table elimination is performed after constant table detection (but before the range analysis). Constant tables are technically different from -=-=(Guest - Thu, 18 Jun 2009, 02:24)=-=- Low Level Design modified. --- /tmp/wklog.17.old.27162 2009-06-18 02:24:14.000000000 +0300 +++ /tmp/wklog.17.new.27162 2009-06-18 02:24:14.000000000 +0300 @@ -83,9 +83,12 @@ 5. Tests and benchmarks ----------------------- -Should create a benchmark in sql-bench which checks if the dbms has table +Create a benchmark in sql-bench which checks if the DBMS has table elimination. -TODO elaborate +[According to Monty] Run + - queries that would use elimination + - queries that are very similar to one above (so that they would have same + QEP, execution cost, etc) but cannot use table elimination. 6. Todo, issues to resolve -------------------------- @@ -109,33 +112,37 @@ 6.2 Resolved ~~~~~~~~~~~~ -- outer->inner join conversion is not a problem for table elimination. +* outer->inner join conversion is not a problem for table elimination. We make outer->inner conversions based on predicates in WHERE. If the WHERE referred to an inner table (requirement for OJ->IJ conversion) then table elimination would not be applicable anyway. -7. Additional issues --------------------- -* We remove ON clauses within semi-join nests. If these clauses contain - subqueries, they probably should be gone from EXPLAIN output also? +* For Multi-table UPDATEs/DELETEs, need to also analyze the SET clause: + - affected tables must not be eliminated + - tables that are used on the right side of the SET x=y assignments must + not be eliminated either. -* Aggregate functions report they depend on all tables, that is, +* Aggregate functions used to report that they depend on all tables, that is, item_agg_func->used_tables() == (1ULL << join->tables) - 1 - always. If we want table elimination to work in presence of grouping, need - to devise some other way of analyzing aggregate functions. + always. Fixed it, now aggregate function reports it depends on + tables that its arguments depend on. In particular, COUNT(*) reports + that it depends on no tables (item_count_star->used_tables()==0). + One consequence of that is that "item->used_tables()==0" is not + equivalent to "item->const_item()==true" anymore (not sure if it's + "anymore" or this has been already happening). + +* EXPLAIN EXTENDED warning text was generated after the JOIN object has + been discarded. This didn't allow to use information about join plan + when printing the warning. Fixed this by keeping the JOIN objects until + we've printed the warning (have also an intent to remove the const + tables from the join output). -* Should eliminated tables be shown in EXPLAIN EXTENDED? - - If we just ignore the question, they will be shown - - this is what happens for constant tables, too. - - I don't see how showing them could be of any use. They only make it - harder to read the rewritten query. - It turns out that - - it is easy to have EXPLAIN EXTENDED show permanent (once-per-statement - lifetime) changes. - - it is hard to have it show per-execution data. This is because the warning - text is generated after the execution structures have been destroyed. +7. Additional issues +-------------------- +* We remove ON clauses within semi-join nests. If these clauses contain + subqueries, they probably should be gone from EXPLAIN output also? * Table elimination is performed after constant table detection (but before the range analysis). Constant tables are technically different from @@ -143,8 +150,6 @@ Considering we've already done the join_read_const_table() call, is there any real difference between constant table and eliminated one? If there is, should we mark const tables also as eliminated? + from user/EXPLAIN point of view: no. constant table is the one that we read + one record from. eliminated table is the one that we don't acccess at all. -* For Multi-table UPDATEs/DELETEs, need to also analyze the SET clause: - - affected tables must not be eliminated - - tables that are used on the right side of the SET x=y assignments must - not be eliminated either. -=-=(Guest - Tue, 16 Jun 2009, 17:01)=-=- Dependency deleted: 29 no longer depends on 17 -=-=(Guest - Wed, 10 Jun 2009, 01:23)=-=- Low Level Design modified. --- /tmp/wklog.17.old.1842 2009-06-10 01:23:42.000000000 +0300 +++ /tmp/wklog.17.new.1842 2009-06-10 01:23:42.000000000 +0300 @@ -131,6 +131,11 @@ - this is what happens for constant tables, too. - I don't see how showing them could be of any use. They only make it harder to read the rewritten query. + It turns out that + - it is easy to have EXPLAIN EXTENDED show permanent (once-per-statement + lifetime) changes. + - it is hard to have it show per-execution data. This is because the warning + text is generated after the execution structures have been destroyed. * Table elimination is performed after constant table detection (but before the range analysis). Constant tables are technically different from -=-=(Guest - Wed, 03 Jun 2009, 22:01)=-=- Low Level Design modified. --- /tmp/wklog.17.old.21801 2009-06-03 22:01:34.000000000 +0300 +++ /tmp/wklog.17.new.21801 2009-06-03 22:01:34.000000000 +0300 @@ -1,3 +1,6 @@ +The code (currently in development) is at lp: +~maria-captains/maria/maria-5.1-table-elimination tree. + <contents> 1. Conditions for removal 1.1 Quick check if there are candidates ------------------------------------------------------------ -=-=(View All Progress Notes, 23 total)=-=- http://askmonty.org/worklog/index.pl?tid=17&nolimit=1 DESCRIPTION: Eliminate not needed tables from SELECT queries.. This will speed up some views and automatically generated queries. Example: CREATE TABLE B (id int primary key); select A.colA from tableA A left outer join tableB B on B.id = A.id; In this case we can remove table B and the join from the query. HIGH-LEVEL SPECIFICATION: Here is an extended explanation of table elimination. Table elimination is a feature found in some modern query optimizers, of which Microsoft SQL Server 2005/2008 seems to have the most advanced implementation. Oracle 11g has also been confirmed to use table elimination but not to the same extent. Basically, what table elimination does, is to remove tables from the execution plan when it is unnecessary to include them. This can, of course, only happen if the right circumstances arise. Let us for example look at the following query: select A.colA from tableA A left outer join tableB B on B.id = A.id; When using A as the left table we ensure that the query will return at least as many rows as there are in that table. For rows where the join condition (B.id = A.id) is not met the selected column (A.colA) will still contain it's original value. The not seen B.* row would contain all NULL:s. However, the result set could actually contain more rows than what is found in tableA if there are duplicates of the column B.id in tableB. If A contains a row [1, "val1"] and B the rows [1, "other1a"],[1, "other1b"] then two rows will match in the join condition. The only way to know what the result will look like is to actually touch both tables during execution. Instead, let's say that tableB contains rows that make it possible to place a unique constraint on the column B.id, for example and often the case a primary key. In this situation we know that we will get exactly as many rows as there are in tableA, since joining with tableB cannot introduce any duplicates. If further, as in the example query, we do not select any columns from tableB, touching that table during execution is unnecessary. We can remove the whole join operation from the execution plan. Both SQL Server 2005/2008 and Oracle 11g will deploy table elimination in the case described above. Let us look at a more advanced query, where Oracle fails. select A.colA from tableA A left outer join tableB B on B.id = A.id and B.fromDate = ( select max(sub.fromDate) from tableB sub where sub.id = A.id ); In this example we have added another join condition, which ensures that we only pick the matching row from tableB having the latest fromDate. In this case tableB will contain duplicates of the column B.id, so in order to ensure uniqueness the primary key has to contain the fromDate column as well. In other words the primary key of tableB is (B.id, B.fromDate). Furthermore, since the subselect ensures that we only pick the latest B.fromDate for a given B.id we know that at most one row will match the join condition. We will again have the situation where joining with tableB cannot affect the number of rows in the result set. Since we do not select any columns from tableB, the whole join operation can be eliminated from the execution plan. SQL Server 2005/2008 will deploy table elimination in this situation as well. We have not found a way to make Oracle 11g use it for this type of query. Queries like these arise in two situations. Either when you have denormalized model consisting of a fact table with several related dimension tables, or when you have a highly normalized model where each attribute is stored in its own table. The example with the subselect is common whenever you store historized/versioned data. LOW-LEVEL DESIGN: The code (currently in development) is at lp: ~maria-captains/maria/maria-5.1-table-elimination tree. <contents> 1. Conditions for removal 1.1 Quick check if there are candidates 2. Removal operation properties 3. Removal operation 4. User interface 5. Tests and benchmarks 6. Todo, issues to resolve 6.1 To resolve 6.2 Resolved 7. Additional issues </contents> It's not really about elimination of tables, it's about elimination of inner sides of outer joins. 1. Conditions for removal ------------------------- We can eliminate an inner side of outer join if: 1. For each record combination of outer tables, it will always produce exactly one record. 2. There are no references to columns of the inner tables anywhere else in the query. #1 means that every table inside the outer join nest is: - is a constant table: = because it can be accessed via eq_ref(const) access, or = it is a zero-rows or one-row MyISAM-like table [MARK1] - has an eq_ref access method candidate. #2 means that WHERE clause, ON clauses of embedding outer joins, ORDER BY, GROUP BY and HAVING do not refer to the inner tables of the outer join nest. 1.1 Quick check if there are candidates ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Before we start to enumerate join nests, here is a quick way to check if there *can be* something to be removed: if ((tables used in select_list | tables used in group/order by UNION | tables used in where) != bitmap_of_all_tables) { attempt table elimination; } 2. Removal operation properties ------------------------------- * There is always one way to remove (no choice to remove either this or that) * It is always better to remove as much tables as possible (at least within our cost model). Thus, no need for any cost calculations/etc. It's an unconditional rewrite. 3. Removal operation -------------------- * Remove the outer join nest's nested join structure (i.e. get the outer join's TABLE_LIST object $OJ and remove it from $OJ->embedding, $OJ->embedding->nested_join. Update table_map's of all ancestor nested joins). [MARK2] * Move the tables and their JOIN_TABs to front like it is done with const tables, with exception that if eliminated outer join nest was within another outer join nest, that shouldn't prevent us from moving away the eliminated tables. * Update join->table_count and all-join-tables bitmap. * That's it. Nothing else? 4. User interface ----------------- * We'll add an @@optimizer switch flag for table elimination. Tentative name: 'table_elimination'. (Note ^^ utility of the above questioned ^, as table elimination can never be worse than no elimination. We're leaning towards not adding the flag) * EXPLAIN will not show the removed tables at all. This will allow to check if tables were removed, and also will behave nicely with anchor model and VIEWs: stuff that user doesn't care about just won't be there. 5. Tests and benchmarks ----------------------- Create a benchmark in sql-bench which checks if the DBMS has table elimination. [According to Monty] Run - queries that would use elimination - queries that are very similar to one above (so that they would have same QEP, execution cost, etc) but cannot use table elimination. then compare run times and make a conclusion about whether dbms supports table elimination. 6. Todo, issues to resolve -------------------------- 6.1 To resolve ~~~~~~~~~~~~~~ - Relationship with prepared statements. On one hand, it's natural to desire to make table elimination a once-per-statement operation, like outer->inner join conversion. We'll have to limit the applicability by removing [MARK1] as that can change during lifetime of the statement. The other option is to do table elimination every time. This will require to rework operation [MARK2] to be undoable. I'm leaning towards doing the former. With anchor modeling, it is unlikely that we'll meet outer joins which have N inner tables of which some are 1-row MyISAM tables that do not have primary key. 6.2 Resolved ~~~~~~~~~~~~ * outer->inner join conversion is not a problem for table elimination. We make outer->inner conversions based on predicates in WHERE. If the WHERE referred to an inner table (requirement for OJ->IJ conversion) then table elimination would not be applicable anyway. * For Multi-table UPDATEs/DELETEs, need to also analyze the SET clause: - affected tables must not be eliminated - tables that are used on the right side of the SET x=y assignments must not be eliminated either. * Aggregate functions used to report that they depend on all tables, that is, item_agg_func->used_tables() == (1ULL << join->tables) - 1 always. Fixed it, now aggregate function reports it depends on tables that its arguments depend on. In particular, COUNT(*) reports that it depends on no tables (item_count_star->used_tables()==0). One consequence of that is that "item->used_tables()==0" is not equivalent to "item->const_item()==true" anymore (not sure if it's "anymore" or this has been already happening). * EXPLAIN EXTENDED warning text was generated after the JOIN object has been discarded. This didn't allow to use information about join plan when printing the warning. Fixed this by keeping the JOIN objects until we've printed the warning (have also an intent to remove the const tables from the join output). 7. Additional issues -------------------- * We remove ON clauses within outer join nests. If these clauses contain subqueries, they probably should be gone from EXPLAIN output also? Yes. Current approach: when removing an outer join nest, walk the ON clause and mark subselects as eliminated. Then let EXPLAIN code check if the SELECT was eliminated before the printing (EXPLAIN is generated by doing a recursive descent, so the check will also cause children of eliminated selects not to be printed) * Table elimination is performed after constant table detection (but before the range analysis). Constant tables are technically different from eliminated ones (e.g. the former are shown in EXPLAIN and the latter aren't). Considering we've already done the join_read_const_table() call, is there any real difference between constant table and eliminated one? If there is, should we mark const tables also as eliminated? from user/EXPLAIN point of view: no. constant table is the one that we read one record from. eliminated table is the one that we don't acccess at all. * What is described above will not be able to eliminate this outer join create unique index idx on tableB (id, fromDate); ... left outer join tableB B on B.id = A.id and B.fromDate = (select max(sub.fromDate) from tableB sub where sub.id = A.id); This is because condition "B.fromDate= func(tableB)" cannot be used. Reason#1: update_ref_and_keys() does not consider such conditions to be of any use (and indeed they are not usable for ref access) so they are not put into KEYUSE array. Reason#2: even if they were put there, we would need to be able to tell between predicates like B.fromDate= func(B.id) // guarantees only one matching row as // B.id is already bound by B.id=A.id // hence B.fromDate becomes bound too. and "B.fromDate= func(B.*)" // Can potentially have many matching // records. We need to - Have update_ref_and_keys() create KEYUSE elements for such equalities - Have eliminate_tables() and friends make a more accurate check. The right check is to check whether all parts of a unique key are bound. If we have keypartX to be bound, then t.keypartY=func(keypartX) makes keypartY to be bound. The difficulty here is that correlated subquery predicate cannot tell what columns it depends on (it only remembers tables). Traversing the predicate is expensive and complicated. We're leaning towards making each subquery predicate have a List<Item> with items that - are in the current select - and it depends on. This list will be useful in certain other subquery optimizations as well, it is cheap to collect it in fix_fields() phase, so it will be collected for every subquery predicate. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] Progress (by Guest): Table elimination (17)
by worklog-noreply＠askmonty.org 20 Jul '09

20 Jul '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Table elimination CREATION DATE..: Sun, 10 May 2009, 19:57 SUPERVISOR.....: Monty IMPLEMENTOR....: Psergey COPIES TO......: CATEGORY.......: Client-BackLog TASK ID........: 17 (http://askmonty.org/worklog/?tid=17) VERSION........: Server-9.x STATUS.........: In-Progress PRIORITY.......: 60 WORKED HOURS...: 1 ESTIMATE.......: 3 (hours remain) ORIG. ESTIMATE.: 3 PROGRESS NOTES: -=-=(Guest - Mon, 20 Jul 2009, 14:28)=-=- deukje weg Worked 1 hour and estimate 3 hours remain (original estimate increased by 4 hours). -=-=(Guest - Fri, 17 Jul 2009, 02:44)=-=- Version updated. --- /tmp/wklog.17.old.24138 2009-07-17 02:44:49.000000000 +0300 +++ /tmp/wklog.17.new.24138 2009-07-17 02:44:49.000000000 +0300 @@ -1 +1 @@ -9.x +Server-9.x -=-=(Guest - Fri, 17 Jul 2009, 02:44)=-=- Version updated. --- /tmp/wklog.17.old.24114 2009-07-17 02:44:36.000000000 +0300 +++ /tmp/wklog.17.new.24114 2009-07-17 02:44:36.000000000 +0300 @@ -1 +1 @@ -Server-5.1 +9.x -=-=(Guest - Fri, 17 Jul 2009, 02:44)=-=- Category updated. --- /tmp/wklog.17.old.24114 2009-07-17 02:44:36.000000000 +0300 +++ /tmp/wklog.17.new.24114 2009-07-17 02:44:36.000000000 +0300 @@ -1 +1 @@ -Server-Sprint +Client-BackLog -=-=(Guest - Thu, 18 Jun 2009, 04:15)=-=- Low Level Design modified. --- /tmp/wklog.17.old.29969 2009-06-18 04:15:23.000000000 +0300 +++ /tmp/wklog.17.new.29969 2009-06-18 04:15:23.000000000 +0300 @@ -158,3 +158,43 @@ from user/EXPLAIN point of view: no. constant table is the one that we read one record from. eliminated table is the one that we don't acccess at all. +* What is described above will not be able to eliminate this outer join + create unique index idx on tableB (id, fromDate); + ... + left outer join + tableB B + on + B.id = A.id + and + B.fromDate = (select max(sub.fromDate) + from tableB sub where sub.id = A.id); + + This is because condition "B.fromDate= func(tableB)" cannot be used. + Reason#1: update_ref_and_keys() does not consider such conditions to + be of any use (and indeed they are not usable for ref access) + so they are not put into KEYUSE array. + Reason#2: even if they were put there, we would need to be able to tell + between predicates like + B.fromDate= func(B.id) // guarantees only one matching row as + // B.id is already bound by B.id=A.id + // hence B.fromDate becomes bound too. + and + "B.fromDate= func(B.*)" // Can potentially have many matching + // records. + We need to + - Have update_ref_and_keys() create KEYUSE elements for such equalities + - Have eliminate_tables() and friends make a more accurate check. + The right check is to check whether all parts of a unique key are bound. + If we have keypartX to be bound, then t.keypartY=func(keypartX) makes + keypartY to be bound. + The difficulty here is that correlated subquery predicate cannot tell what + columns it depends on (it only remembers tables). + Traversing the predicate is expensive and complicated. + We're leaning towards making each subquery predicate have a List<Item> with + items that + - are in the current select + - and it depends on. + This list will be useful in certain other subquery optimizations as well, + it is cheap to collect it in fix_fields() phase, so it will be collected + for every subquery predicate. + -=-=(Guest - Thu, 18 Jun 2009, 02:48)=-=- Low Level Design modified. --- /tmp/wklog.17.old.27792 2009-06-18 02:48:45.000000000 +0300 +++ /tmp/wklog.17.new.27792 2009-06-18 02:48:45.000000000 +0300 @@ -89,14 +89,14 @@ - queries that would use elimination - queries that are very similar to one above (so that they would have same QEP, execution cost, etc) but cannot use table elimination. +then compare run times and make a conclusion about whether dbms supports table +elimination. 6. Todo, issues to resolve -------------------------- 6.1 To resolve ~~~~~~~~~~~~~~ -- Re-check how this works with equality propagation. - - Relationship with prepared statements. On one hand, it's natural to desire to make table elimination a once-per-statement operation, like outer->inner join conversion. We'll have @@ -141,8 +141,13 @@ 7. Additional issues -------------------- -* We remove ON clauses within semi-join nests. If these clauses contain +* We remove ON clauses within outer join nests. If these clauses contain subqueries, they probably should be gone from EXPLAIN output also? + Yes. Current approach: when removing an outer join nest, walk the ON clause + and mark subselects as eliminated. Then let EXPLAIN code check if the + SELECT was eliminated before the printing (EXPLAIN is generated by doing + a recursive descent, so the check will also cause children of eliminated + selects not to be printed) * Table elimination is performed after constant table detection (but before the range analysis). Constant tables are technically different from -=-=(Guest - Thu, 18 Jun 2009, 02:24)=-=- Low Level Design modified. --- /tmp/wklog.17.old.27162 2009-06-18 02:24:14.000000000 +0300 +++ /tmp/wklog.17.new.27162 2009-06-18 02:24:14.000000000 +0300 @@ -83,9 +83,12 @@ 5. Tests and benchmarks ----------------------- -Should create a benchmark in sql-bench which checks if the dbms has table +Create a benchmark in sql-bench which checks if the DBMS has table elimination. -TODO elaborate +[According to Monty] Run + - queries that would use elimination + - queries that are very similar to one above (so that they would have same + QEP, execution cost, etc) but cannot use table elimination. 6. Todo, issues to resolve -------------------------- @@ -109,33 +112,37 @@ 6.2 Resolved ~~~~~~~~~~~~ -- outer->inner join conversion is not a problem for table elimination. +* outer->inner join conversion is not a problem for table elimination. We make outer->inner conversions based on predicates in WHERE. If the WHERE referred to an inner table (requirement for OJ->IJ conversion) then table elimination would not be applicable anyway. -7. Additional issues --------------------- -* We remove ON clauses within semi-join nests. If these clauses contain - subqueries, they probably should be gone from EXPLAIN output also? +* For Multi-table UPDATEs/DELETEs, need to also analyze the SET clause: + - affected tables must not be eliminated + - tables that are used on the right side of the SET x=y assignments must + not be eliminated either. -* Aggregate functions report they depend on all tables, that is, +* Aggregate functions used to report that they depend on all tables, that is, item_agg_func->used_tables() == (1ULL << join->tables) - 1 - always. If we want table elimination to work in presence of grouping, need - to devise some other way of analyzing aggregate functions. + always. Fixed it, now aggregate function reports it depends on + tables that its arguments depend on. In particular, COUNT(*) reports + that it depends on no tables (item_count_star->used_tables()==0). + One consequence of that is that "item->used_tables()==0" is not + equivalent to "item->const_item()==true" anymore (not sure if it's + "anymore" or this has been already happening). + +* EXPLAIN EXTENDED warning text was generated after the JOIN object has + been discarded. This didn't allow to use information about join plan + when printing the warning. Fixed this by keeping the JOIN objects until + we've printed the warning (have also an intent to remove the const + tables from the join output). -* Should eliminated tables be shown in EXPLAIN EXTENDED? - - If we just ignore the question, they will be shown - - this is what happens for constant tables, too. - - I don't see how showing them could be of any use. They only make it - harder to read the rewritten query. - It turns out that - - it is easy to have EXPLAIN EXTENDED show permanent (once-per-statement - lifetime) changes. - - it is hard to have it show per-execution data. This is because the warning - text is generated after the execution structures have been destroyed. +7. Additional issues +-------------------- +* We remove ON clauses within semi-join nests. If these clauses contain + subqueries, they probably should be gone from EXPLAIN output also? * Table elimination is performed after constant table detection (but before the range analysis). Constant tables are technically different from @@ -143,8 +150,6 @@ Considering we've already done the join_read_const_table() call, is there any real difference between constant table and eliminated one? If there is, should we mark const tables also as eliminated? + from user/EXPLAIN point of view: no. constant table is the one that we read + one record from. eliminated table is the one that we don't acccess at all. -* For Multi-table UPDATEs/DELETEs, need to also analyze the SET clause: - - affected tables must not be eliminated - - tables that are used on the right side of the SET x=y assignments must - not be eliminated either. -=-=(Guest - Tue, 16 Jun 2009, 17:01)=-=- Dependency deleted: 29 no longer depends on 17 -=-=(Guest - Wed, 10 Jun 2009, 01:23)=-=- Low Level Design modified. --- /tmp/wklog.17.old.1842 2009-06-10 01:23:42.000000000 +0300 +++ /tmp/wklog.17.new.1842 2009-06-10 01:23:42.000000000 +0300 @@ -131,6 +131,11 @@ - this is what happens for constant tables, too. - I don't see how showing them could be of any use. They only make it harder to read the rewritten query. + It turns out that + - it is easy to have EXPLAIN EXTENDED show permanent (once-per-statement + lifetime) changes. + - it is hard to have it show per-execution data. This is because the warning + text is generated after the execution structures have been destroyed. * Table elimination is performed after constant table detection (but before the range analysis). Constant tables are technically different from -=-=(Guest - Wed, 03 Jun 2009, 22:01)=-=- Low Level Design modified. --- /tmp/wklog.17.old.21801 2009-06-03 22:01:34.000000000 +0300 +++ /tmp/wklog.17.new.21801 2009-06-03 22:01:34.000000000 +0300 @@ -1,3 +1,6 @@ +The code (currently in development) is at lp: +~maria-captains/maria/maria-5.1-table-elimination tree. + <contents> 1. Conditions for removal 1.1 Quick check if there are candidates ------------------------------------------------------------ -=-=(View All Progress Notes, 23 total)=-=- http://askmonty.org/worklog/index.pl?tid=17&nolimit=1 DESCRIPTION: Eliminate not needed tables from SELECT queries.. This will speed up some views and automatically generated queries. Example: CREATE TABLE B (id int primary key); select A.colA from tableA A left outer join tableB B on B.id = A.id; In this case we can remove table B and the join from the query. HIGH-LEVEL SPECIFICATION: Here is an extended explanation of table elimination. Table elimination is a feature found in some modern query optimizers, of which Microsoft SQL Server 2005/2008 seems to have the most advanced implementation. Oracle 11g has also been confirmed to use table elimination but not to the same extent. Basically, what table elimination does, is to remove tables from the execution plan when it is unnecessary to include them. This can, of course, only happen if the right circumstances arise. Let us for example look at the following query: select A.colA from tableA A left outer join tableB B on B.id = A.id; When using A as the left table we ensure that the query will return at least as many rows as there are in that table. For rows where the join condition (B.id = A.id) is not met the selected column (A.colA) will still contain it's original value. The not seen B.* row would contain all NULL:s. However, the result set could actually contain more rows than what is found in tableA if there are duplicates of the column B.id in tableB. If A contains a row [1, "val1"] and B the rows [1, "other1a"],[1, "other1b"] then two rows will match in the join condition. The only way to know what the result will look like is to actually touch both tables during execution. Instead, let's say that tableB contains rows that make it possible to place a unique constraint on the column B.id, for example and often the case a primary key. In this situation we know that we will get exactly as many rows as there are in tableA, since joining with tableB cannot introduce any duplicates. If further, as in the example query, we do not select any columns from tableB, touching that table during execution is unnecessary. We can remove the whole join operation from the execution plan. Both SQL Server 2005/2008 and Oracle 11g will deploy table elimination in the case described above. Let us look at a more advanced query, where Oracle fails. select A.colA from tableA A left outer join tableB B on B.id = A.id and B.fromDate = ( select max(sub.fromDate) from tableB sub where sub.id = A.id ); In this example we have added another join condition, which ensures that we only pick the matching row from tableB having the latest fromDate. In this case tableB will contain duplicates of the column B.id, so in order to ensure uniqueness the primary key has to contain the fromDate column as well. In other words the primary key of tableB is (B.id, B.fromDate). Furthermore, since the subselect ensures that we only pick the latest B.fromDate for a given B.id we know that at most one row will match the join condition. We will again have the situation where joining with tableB cannot affect the number of rows in the result set. Since we do not select any columns from tableB, the whole join operation can be eliminated from the execution plan. SQL Server 2005/2008 will deploy table elimination in this situation as well. We have not found a way to make Oracle 11g use it for this type of query. Queries like these arise in two situations. Either when you have denormalized model consisting of a fact table with several related dimension tables, or when you have a highly normalized model where each attribute is stored in its own table. The example with the subselect is common whenever you store historized/versioned data. LOW-LEVEL DESIGN: The code (currently in development) is at lp: ~maria-captains/maria/maria-5.1-table-elimination tree. <contents> 1. Conditions for removal 1.1 Quick check if there are candidates 2. Removal operation properties 3. Removal operation 4. User interface 5. Tests and benchmarks 6. Todo, issues to resolve 6.1 To resolve 6.2 Resolved 7. Additional issues </contents> It's not really about elimination of tables, it's about elimination of inner sides of outer joins. 1. Conditions for removal ------------------------- We can eliminate an inner side of outer join if: 1. For each record combination of outer tables, it will always produce exactly one record. 2. There are no references to columns of the inner tables anywhere else in the query. #1 means that every table inside the outer join nest is: - is a constant table: = because it can be accessed via eq_ref(const) access, or = it is a zero-rows or one-row MyISAM-like table [MARK1] - has an eq_ref access method candidate. #2 means that WHERE clause, ON clauses of embedding outer joins, ORDER BY, GROUP BY and HAVING do not refer to the inner tables of the outer join nest. 1.1 Quick check if there are candidates ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Before we start to enumerate join nests, here is a quick way to check if there *can be* something to be removed: if ((tables used in select_list | tables used in group/order by UNION | tables used in where) != bitmap_of_all_tables) { attempt table elimination; } 2. Removal operation properties ------------------------------- * There is always one way to remove (no choice to remove either this or that) * It is always better to remove as much tables as possible (at least within our cost model). Thus, no need for any cost calculations/etc. It's an unconditional rewrite. 3. Removal operation -------------------- * Remove the outer join nest's nested join structure (i.e. get the outer join's TABLE_LIST object $OJ and remove it from $OJ->embedding, $OJ->embedding->nested_join. Update table_map's of all ancestor nested joins). [MARK2] * Move the tables and their JOIN_TABs to front like it is done with const tables, with exception that if eliminated outer join nest was within another outer join nest, that shouldn't prevent us from moving away the eliminated tables. * Update join->table_count and all-join-tables bitmap. * That's it. Nothing else? 4. User interface ----------------- * We'll add an @@optimizer switch flag for table elimination. Tentative name: 'table_elimination'. (Note ^^ utility of the above questioned ^, as table elimination can never be worse than no elimination. We're leaning towards not adding the flag) * EXPLAIN will not show the removed tables at all. This will allow to check if tables were removed, and also will behave nicely with anchor model and VIEWs: stuff that user doesn't care about just won't be there. 5. Tests and benchmarks ----------------------- Create a benchmark in sql-bench which checks if the DBMS has table elimination. [According to Monty] Run - queries that would use elimination - queries that are very similar to one above (so that they would have same QEP, execution cost, etc) but cannot use table elimination. then compare run times and make a conclusion about whether dbms supports table elimination. 6. Todo, issues to resolve -------------------------- 6.1 To resolve ~~~~~~~~~~~~~~ - Relationship with prepared statements. On one hand, it's natural to desire to make table elimination a once-per-statement operation, like outer->inner join conversion. We'll have to limit the applicability by removing [MARK1] as that can change during lifetime of the statement. The other option is to do table elimination every time. This will require to rework operation [MARK2] to be undoable. I'm leaning towards doing the former. With anchor modeling, it is unlikely that we'll meet outer joins which have N inner tables of which some are 1-row MyISAM tables that do not have primary key. 6.2 Resolved ~~~~~~~~~~~~ * outer->inner join conversion is not a problem for table elimination. We make outer->inner conversions based on predicates in WHERE. If the WHERE referred to an inner table (requirement for OJ->IJ conversion) then table elimination would not be applicable anyway. * For Multi-table UPDATEs/DELETEs, need to also analyze the SET clause: - affected tables must not be eliminated - tables that are used on the right side of the SET x=y assignments must not be eliminated either. * Aggregate functions used to report that they depend on all tables, that is, item_agg_func->used_tables() == (1ULL << join->tables) - 1 always. Fixed it, now aggregate function reports it depends on tables that its arguments depend on. In particular, COUNT(*) reports that it depends on no tables (item_count_star->used_tables()==0). One consequence of that is that "item->used_tables()==0" is not equivalent to "item->const_item()==true" anymore (not sure if it's "anymore" or this has been already happening). * EXPLAIN EXTENDED warning text was generated after the JOIN object has been discarded. This didn't allow to use information about join plan when printing the warning. Fixed this by keeping the JOIN objects until we've printed the warning (have also an intent to remove the const tables from the join output). 7. Additional issues -------------------- * We remove ON clauses within outer join nests. If these clauses contain subqueries, they probably should be gone from EXPLAIN output also? Yes. Current approach: when removing an outer join nest, walk the ON clause and mark subselects as eliminated. Then let EXPLAIN code check if the SELECT was eliminated before the printing (EXPLAIN is generated by doing a recursive descent, so the check will also cause children of eliminated selects not to be printed) * Table elimination is performed after constant table detection (but before the range analysis). Constant tables are technically different from eliminated ones (e.g. the former are shown in EXPLAIN and the latter aren't). Considering we've already done the join_read_const_table() call, is there any real difference between constant table and eliminated one? If there is, should we mark const tables also as eliminated? from user/EXPLAIN point of view: no. constant table is the one that we read one record from. eliminated table is the one that we don't acccess at all. * What is described above will not be able to eliminate this outer join create unique index idx on tableB (id, fromDate); ... left outer join tableB B on B.id = A.id and B.fromDate = (select max(sub.fromDate) from tableB sub where sub.id = A.id); This is because condition "B.fromDate= func(tableB)" cannot be used. Reason#1: update_ref_and_keys() does not consider such conditions to be of any use (and indeed they are not usable for ref access) so they are not put into KEYUSE array. Reason#2: even if they were put there, we would need to be able to tell between predicates like B.fromDate= func(B.id) // guarantees only one matching row as // B.id is already bound by B.id=A.id // hence B.fromDate becomes bound too. and "B.fromDate= func(B.*)" // Can potentially have many matching // records. We need to - Have update_ref_and_keys() create KEYUSE elements for such equalities - Have eliminate_tables() and friends make a more accurate check. The right check is to check whether all parts of a unique key are bound. If we have keypartX to be bound, then t.keypartY=func(keypartX) makes keypartY to be bound. The difficulty here is that correlated subquery predicate cannot tell what columns it depends on (it only remembers tables). Traversing the predicate is expensive and complicated. We're leaning towards making each subquery predicate have a List<Item> with items that - are in the current select - and it depends on. This list will be useful in certain other subquery optimizations as well, it is cheap to collect it in fix_fields() phase, so it will be collected for every subquery predicate. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] TYPE= disappears again in MySQL 5.4.4
by Arjen Lentz 17 Jul '09

17 Jul '09

Hi all, Just a heads up, please don't merge in this change from MySQL 5.4.4 See http://openquery.com/blog/type-disappears-mysql-544 for background, also http://bugs.mysql.com/bug.php?id=17501 Sergei Golubchik is still assigned to that bug, btw ;-) Cheers, Arjen. -- Arjen Lentz, Director @ Open Query (http://openquery.com) Exceptional Services for MySQL at a fixed budget. Follow our blog at http://openquery.com/blog/ OurDelta: free enhanced builds for MySQL @ http://ourdelta.org

1 0

[Maria-developers] Updated (by Guest): Table elimination (17)
by worklog-noreply＠askmonty.org 16 Jul '09

16 Jul '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Table elimination CREATION DATE..: Sun, 10 May 2009, 19:57 SUPERVISOR.....: Monty IMPLEMENTOR....: Psergey COPIES TO......: CATEGORY.......: Client-BackLog TASK ID........: 17 (http://askmonty.org/worklog/?tid=17) VERSION........: Server-9.x STATUS.........: In-Progress PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: -=-=(Guest - Fri, 17 Jul 2009, 02:44)=-=- Version updated. --- /tmp/wklog.17.old.24138 2009-07-17 02:44:49.000000000 +0300 +++ /tmp/wklog.17.new.24138 2009-07-17 02:44:49.000000000 +0300 @@ -1 +1 @@ -9.x +Server-9.x -=-=(Guest - Fri, 17 Jul 2009, 02:44)=-=- Version updated. --- /tmp/wklog.17.old.24114 2009-07-17 02:44:36.000000000 +0300 +++ /tmp/wklog.17.new.24114 2009-07-17 02:44:36.000000000 +0300 @@ -1 +1 @@ -Server-5.1 +9.x -=-=(Guest - Fri, 17 Jul 2009, 02:44)=-=- Category updated. --- /tmp/wklog.17.old.24114 2009-07-17 02:44:36.000000000 +0300 +++ /tmp/wklog.17.new.24114 2009-07-17 02:44:36.000000000 +0300 @@ -1 +1 @@ -Server-Sprint +Client-BackLog -=-=(Guest - Thu, 18 Jun 2009, 04:15)=-=- Low Level Design modified. --- /tmp/wklog.17.old.29969 2009-06-18 04:15:23.000000000 +0300 +++ /tmp/wklog.17.new.29969 2009-06-18 04:15:23.000000000 +0300 @@ -158,3 +158,43 @@ from user/EXPLAIN point of view: no. constant table is the one that we read one record from. eliminated table is the one that we don't acccess at all. +* What is described above will not be able to eliminate this outer join + create unique index idx on tableB (id, fromDate); + ... + left outer join + tableB B + on + B.id = A.id + and + B.fromDate = (select max(sub.fromDate) + from tableB sub where sub.id = A.id); + + This is because condition "B.fromDate= func(tableB)" cannot be used. + Reason#1: update_ref_and_keys() does not consider such conditions to + be of any use (and indeed they are not usable for ref access) + so they are not put into KEYUSE array. + Reason#2: even if they were put there, we would need to be able to tell + between predicates like + B.fromDate= func(B.id) // guarantees only one matching row as + // B.id is already bound by B.id=A.id + // hence B.fromDate becomes bound too. + and + "B.fromDate= func(B.*)" // Can potentially have many matching + // records. + We need to + - Have update_ref_and_keys() create KEYUSE elements for such equalities + - Have eliminate_tables() and friends make a more accurate check. + The right check is to check whether all parts of a unique key are bound. + If we have keypartX to be bound, then t.keypartY=func(keypartX) makes + keypartY to be bound. + The difficulty here is that correlated subquery predicate cannot tell what + columns it depends on (it only remembers tables). + Traversing the predicate is expensive and complicated. + We're leaning towards making each subquery predicate have a List<Item> with + items that + - are in the current select + - and it depends on. + This list will be useful in certain other subquery optimizations as well, + it is cheap to collect it in fix_fields() phase, so it will be collected + for every subquery predicate. + -=-=(Guest - Thu, 18 Jun 2009, 02:48)=-=- Low Level Design modified. --- /tmp/wklog.17.old.27792 2009-06-18 02:48:45.000000000 +0300 +++ /tmp/wklog.17.new.27792 2009-06-18 02:48:45.000000000 +0300 @@ -89,14 +89,14 @@ - queries that would use elimination - queries that are very similar to one above (so that they would have same QEP, execution cost, etc) but cannot use table elimination. +then compare run times and make a conclusion about whether dbms supports table +elimination. 6. Todo, issues to resolve -------------------------- 6.1 To resolve ~~~~~~~~~~~~~~ -- Re-check how this works with equality propagation. - - Relationship with prepared statements. On one hand, it's natural to desire to make table elimination a once-per-statement operation, like outer->inner join conversion. We'll have @@ -141,8 +141,13 @@ 7. Additional issues -------------------- -* We remove ON clauses within semi-join nests. If these clauses contain +* We remove ON clauses within outer join nests. If these clauses contain subqueries, they probably should be gone from EXPLAIN output also? + Yes. Current approach: when removing an outer join nest, walk the ON clause + and mark subselects as eliminated. Then let EXPLAIN code check if the + SELECT was eliminated before the printing (EXPLAIN is generated by doing + a recursive descent, so the check will also cause children of eliminated + selects not to be printed) * Table elimination is performed after constant table detection (but before the range analysis). Constant tables are technically different from -=-=(Guest - Thu, 18 Jun 2009, 02:24)=-=- Low Level Design modified. --- /tmp/wklog.17.old.27162 2009-06-18 02:24:14.000000000 +0300 +++ /tmp/wklog.17.new.27162 2009-06-18 02:24:14.000000000 +0300 @@ -83,9 +83,12 @@ 5. Tests and benchmarks ----------------------- -Should create a benchmark in sql-bench which checks if the dbms has table +Create a benchmark in sql-bench which checks if the DBMS has table elimination. -TODO elaborate +[According to Monty] Run + - queries that would use elimination + - queries that are very similar to one above (so that they would have same + QEP, execution cost, etc) but cannot use table elimination. 6. Todo, issues to resolve -------------------------- @@ -109,33 +112,37 @@ 6.2 Resolved ~~~~~~~~~~~~ -- outer->inner join conversion is not a problem for table elimination. +* outer->inner join conversion is not a problem for table elimination. We make outer->inner conversions based on predicates in WHERE. If the WHERE referred to an inner table (requirement for OJ->IJ conversion) then table elimination would not be applicable anyway. -7. Additional issues --------------------- -* We remove ON clauses within semi-join nests. If these clauses contain - subqueries, they probably should be gone from EXPLAIN output also? +* For Multi-table UPDATEs/DELETEs, need to also analyze the SET clause: + - affected tables must not be eliminated + - tables that are used on the right side of the SET x=y assignments must + not be eliminated either. -* Aggregate functions report they depend on all tables, that is, +* Aggregate functions used to report that they depend on all tables, that is, item_agg_func->used_tables() == (1ULL << join->tables) - 1 - always. If we want table elimination to work in presence of grouping, need - to devise some other way of analyzing aggregate functions. + always. Fixed it, now aggregate function reports it depends on + tables that its arguments depend on. In particular, COUNT(*) reports + that it depends on no tables (item_count_star->used_tables()==0). + One consequence of that is that "item->used_tables()==0" is not + equivalent to "item->const_item()==true" anymore (not sure if it's + "anymore" or this has been already happening). + +* EXPLAIN EXTENDED warning text was generated after the JOIN object has + been discarded. This didn't allow to use information about join plan + when printing the warning. Fixed this by keeping the JOIN objects until + we've printed the warning (have also an intent to remove the const + tables from the join output). -* Should eliminated tables be shown in EXPLAIN EXTENDED? - - If we just ignore the question, they will be shown - - this is what happens for constant tables, too. - - I don't see how showing them could be of any use. They only make it - harder to read the rewritten query. - It turns out that - - it is easy to have EXPLAIN EXTENDED show permanent (once-per-statement - lifetime) changes. - - it is hard to have it show per-execution data. This is because the warning - text is generated after the execution structures have been destroyed. +7. Additional issues +-------------------- +* We remove ON clauses within semi-join nests. If these clauses contain + subqueries, they probably should be gone from EXPLAIN output also? * Table elimination is performed after constant table detection (but before the range analysis). Constant tables are technically different from @@ -143,8 +150,6 @@ Considering we've already done the join_read_const_table() call, is there any real difference between constant table and eliminated one? If there is, should we mark const tables also as eliminated? + from user/EXPLAIN point of view: no. constant table is the one that we read + one record from. eliminated table is the one that we don't acccess at all. -* For Multi-table UPDATEs/DELETEs, need to also analyze the SET clause: - - affected tables must not be eliminated - - tables that are used on the right side of the SET x=y assignments must - not be eliminated either. -=-=(Guest - Tue, 16 Jun 2009, 17:01)=-=- Dependency deleted: 29 no longer depends on 17 -=-=(Guest - Wed, 10 Jun 2009, 01:23)=-=- Low Level Design modified. --- /tmp/wklog.17.old.1842 2009-06-10 01:23:42.000000000 +0300 +++ /tmp/wklog.17.new.1842 2009-06-10 01:23:42.000000000 +0300 @@ -131,6 +131,11 @@ - this is what happens for constant tables, too. - I don't see how showing them could be of any use. They only make it harder to read the rewritten query. + It turns out that + - it is easy to have EXPLAIN EXTENDED show permanent (once-per-statement + lifetime) changes. + - it is hard to have it show per-execution data. This is because the warning + text is generated after the execution structures have been destroyed. * Table elimination is performed after constant table detection (but before the range analysis). Constant tables are technically different from -=-=(Guest - Wed, 03 Jun 2009, 22:01)=-=- Low Level Design modified. --- /tmp/wklog.17.old.21801 2009-06-03 22:01:34.000000000 +0300 +++ /tmp/wklog.17.new.21801 2009-06-03 22:01:34.000000000 +0300 @@ -1,3 +1,6 @@ +The code (currently in development) is at lp: +~maria-captains/maria/maria-5.1-table-elimination tree. + <contents> 1. Conditions for removal 1.1 Quick check if there are candidates -=-=(Guest - Wed, 03 Jun 2009, 15:04)=-=- Low Level Design modified. --- /tmp/wklog.17.old.20378 2009-06-03 15:04:54.000000000 +0300 +++ /tmp/wklog.17.new.20378 2009-06-03 15:04:54.000000000 +0300 @@ -135,3 +135,8 @@ Considering we've already done the join_read_const_table() call, is there any real difference between constant table and eliminated one? If there is, should we mark const tables also as eliminated? + +* For Multi-table UPDATEs/DELETEs, need to also analyze the SET clause: + - affected tables must not be eliminated + - tables that are used on the right side of the SET x=y assignments must + not be eliminated either. ------------------------------------------------------------ -=-=(View All Progress Notes, 22 total)=-=- http://askmonty.org/worklog/index.pl?tid=17&nolimit=1 DESCRIPTION: Eliminate not needed tables from SELECT queries.. This will speed up some views and automatically generated queries. Example: CREATE TABLE B (id int primary key); select A.colA from tableA A left outer join tableB B on B.id = A.id; In this case we can remove table B and the join from the query. HIGH-LEVEL SPECIFICATION: Here is an extended explanation of table elimination. Table elimination is a feature found in some modern query optimizers, of which Microsoft SQL Server 2005/2008 seems to have the most advanced implementation. Oracle 11g has also been confirmed to use table elimination but not to the same extent. Basically, what table elimination does, is to remove tables from the execution plan when it is unnecessary to include them. This can, of course, only happen if the right circumstances arise. Let us for example look at the following query: select A.colA from tableA A left outer join tableB B on B.id = A.id; When using A as the left table we ensure that the query will return at least as many rows as there are in that table. For rows where the join condition (B.id = A.id) is not met the selected column (A.colA) will still contain it's original value. The not seen B.* row would contain all NULL:s. However, the result set could actually contain more rows than what is found in tableA if there are duplicates of the column B.id in tableB. If A contains a row [1, "val1"] and B the rows [1, "other1a"],[1, "other1b"] then two rows will match in the join condition. The only way to know what the result will look like is to actually touch both tables during execution. Instead, let's say that tableB contains rows that make it possible to place a unique constraint on the column B.id, for example and often the case a primary key. In this situation we know that we will get exactly as many rows as there are in tableA, since joining with tableB cannot introduce any duplicates. If further, as in the example query, we do not select any columns from tableB, touching that table during execution is unnecessary. We can remove the whole join operation from the execution plan. Both SQL Server 2005/2008 and Oracle 11g will deploy table elimination in the case described above. Let us look at a more advanced query, where Oracle fails. select A.colA from tableA A left outer join tableB B on B.id = A.id and B.fromDate = ( select max(sub.fromDate) from tableB sub where sub.id = A.id ); In this example we have added another join condition, which ensures that we only pick the matching row from tableB having the latest fromDate. In this case tableB will contain duplicates of the column B.id, so in order to ensure uniqueness the primary key has to contain the fromDate column as well. In other words the primary key of tableB is (B.id, B.fromDate). Furthermore, since the subselect ensures that we only pick the latest B.fromDate for a given B.id we know that at most one row will match the join condition. We will again have the situation where joining with tableB cannot affect the number of rows in the result set. Since we do not select any columns from tableB, the whole join operation can be eliminated from the execution plan. SQL Server 2005/2008 will deploy table elimination in this situation as well. We have not found a way to make Oracle 11g use it for this type of query. Queries like these arise in two situations. Either when you have denormalized model consisting of a fact table with several related dimension tables, or when you have a highly normalized model where each attribute is stored in its own table. The example with the subselect is common whenever you store historized/versioned data. LOW-LEVEL DESIGN: The code (currently in development) is at lp: ~maria-captains/maria/maria-5.1-table-elimination tree. <contents> 1. Conditions for removal 1.1 Quick check if there are candidates 2. Removal operation properties 3. Removal operation 4. User interface 5. Tests and benchmarks 6. Todo, issues to resolve 6.1 To resolve 6.2 Resolved 7. Additional issues </contents> It's not really about elimination of tables, it's about elimination of inner sides of outer joins. 1. Conditions for removal ------------------------- We can eliminate an inner side of outer join if: 1. For each record combination of outer tables, it will always produce exactly one record. 2. There are no references to columns of the inner tables anywhere else in the query. #1 means that every table inside the outer join nest is: - is a constant table: = because it can be accessed via eq_ref(const) access, or = it is a zero-rows or one-row MyISAM-like table [MARK1] - has an eq_ref access method candidate. #2 means that WHERE clause, ON clauses of embedding outer joins, ORDER BY, GROUP BY and HAVING do not refer to the inner tables of the outer join nest. 1.1 Quick check if there are candidates ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Before we start to enumerate join nests, here is a quick way to check if there *can be* something to be removed: if ((tables used in select_list | tables used in group/order by UNION | tables used in where) != bitmap_of_all_tables) { attempt table elimination; } 2. Removal operation properties ------------------------------- * There is always one way to remove (no choice to remove either this or that) * It is always better to remove as much tables as possible (at least within our cost model). Thus, no need for any cost calculations/etc. It's an unconditional rewrite. 3. Removal operation -------------------- * Remove the outer join nest's nested join structure (i.e. get the outer join's TABLE_LIST object $OJ and remove it from $OJ->embedding, $OJ->embedding->nested_join. Update table_map's of all ancestor nested joins). [MARK2] * Move the tables and their JOIN_TABs to front like it is done with const tables, with exception that if eliminated outer join nest was within another outer join nest, that shouldn't prevent us from moving away the eliminated tables. * Update join->table_count and all-join-tables bitmap. * That's it. Nothing else? 4. User interface ----------------- * We'll add an @@optimizer switch flag for table elimination. Tentative name: 'table_elimination'. (Note ^^ utility of the above questioned ^, as table elimination can never be worse than no elimination. We're leaning towards not adding the flag) * EXPLAIN will not show the removed tables at all. This will allow to check if tables were removed, and also will behave nicely with anchor model and VIEWs: stuff that user doesn't care about just won't be there. 5. Tests and benchmarks ----------------------- Create a benchmark in sql-bench which checks if the DBMS has table elimination. [According to Monty] Run - queries that would use elimination - queries that are very similar to one above (so that they would have same QEP, execution cost, etc) but cannot use table elimination. then compare run times and make a conclusion about whether dbms supports table elimination. 6. Todo, issues to resolve -------------------------- 6.1 To resolve ~~~~~~~~~~~~~~ - Relationship with prepared statements. On one hand, it's natural to desire to make table elimination a once-per-statement operation, like outer->inner join conversion. We'll have to limit the applicability by removing [MARK1] as that can change during lifetime of the statement. The other option is to do table elimination every time. This will require to rework operation [MARK2] to be undoable. I'm leaning towards doing the former. With anchor modeling, it is unlikely that we'll meet outer joins which have N inner tables of which some are 1-row MyISAM tables that do not have primary key. 6.2 Resolved ~~~~~~~~~~~~ * outer->inner join conversion is not a problem for table elimination. We make outer->inner conversions based on predicates in WHERE. If the WHERE referred to an inner table (requirement for OJ->IJ conversion) then table elimination would not be applicable anyway. * For Multi-table UPDATEs/DELETEs, need to also analyze the SET clause: - affected tables must not be eliminated - tables that are used on the right side of the SET x=y assignments must not be eliminated either. * Aggregate functions used to report that they depend on all tables, that is, item_agg_func->used_tables() == (1ULL << join->tables) - 1 always. Fixed it, now aggregate function reports it depends on tables that its arguments depend on. In particular, COUNT(*) reports that it depends on no tables (item_count_star->used_tables()==0). One consequence of that is that "item->used_tables()==0" is not equivalent to "item->const_item()==true" anymore (not sure if it's "anymore" or this has been already happening). * EXPLAIN EXTENDED warning text was generated after the JOIN object has been discarded. This didn't allow to use information about join plan when printing the warning. Fixed this by keeping the JOIN objects until we've printed the warning (have also an intent to remove the const tables from the join output). 7. Additional issues -------------------- * We remove ON clauses within outer join nests. If these clauses contain subqueries, they probably should be gone from EXPLAIN output also? Yes. Current approach: when removing an outer join nest, walk the ON clause and mark subselects as eliminated. Then let EXPLAIN code check if the SELECT was eliminated before the printing (EXPLAIN is generated by doing a recursive descent, so the check will also cause children of eliminated selects not to be printed) * Table elimination is performed after constant table detection (but before the range analysis). Constant tables are technically different from eliminated ones (e.g. the former are shown in EXPLAIN and the latter aren't). Considering we've already done the join_read_const_table() call, is there any real difference between constant table and eliminated one? If there is, should we mark const tables also as eliminated? from user/EXPLAIN point of view: no. constant table is the one that we read one record from. eliminated table is the one that we don't acccess at all. * What is described above will not be able to eliminate this outer join create unique index idx on tableB (id, fromDate); ... left outer join tableB B on B.id = A.id and B.fromDate = (select max(sub.fromDate) from tableB sub where sub.id = A.id); This is because condition "B.fromDate= func(tableB)" cannot be used. Reason#1: update_ref_and_keys() does not consider such conditions to be of any use (and indeed they are not usable for ref access) so they are not put into KEYUSE array. Reason#2: even if they were put there, we would need to be able to tell between predicates like B.fromDate= func(B.id) // guarantees only one matching row as // B.id is already bound by B.id=A.id // hence B.fromDate becomes bound too. and "B.fromDate= func(B.*)" // Can potentially have many matching // records. We need to - Have update_ref_and_keys() create KEYUSE elements for such equalities - Have eliminate_tables() and friends make a more accurate check. The right check is to check whether all parts of a unique key are bound. If we have keypartX to be bound, then t.keypartY=func(keypartX) makes keypartY to be bound. The difficulty here is that correlated subquery predicate cannot tell what columns it depends on (it only remembers tables). Traversing the predicate is expensive and complicated. We're leaning towards making each subquery predicate have a List<Item> with items that - are in the current select - and it depends on. This list will be useful in certain other subquery optimizations as well, it is cheap to collect it in fix_fields() phase, so it will be collected for every subquery predicate. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] Updated (by Guest): Table elimination (17)
by worklog-noreply＠askmonty.org 16 Jul '09

16 Jul '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Table elimination CREATION DATE..: Sun, 10 May 2009, 19:57 SUPERVISOR.....: Monty IMPLEMENTOR....: Psergey COPIES TO......: CATEGORY.......: Client-BackLog TASK ID........: 17 (http://askmonty.org/worklog/?tid=17) VERSION........: Server-9.x STATUS.........: In-Progress PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: -=-=(Guest - Fri, 17 Jul 2009, 02:44)=-=- Version updated. --- /tmp/wklog.17.old.24138 2009-07-17 02:44:49.000000000 +0300 +++ /tmp/wklog.17.new.24138 2009-07-17 02:44:49.000000000 +0300 @@ -1 +1 @@ -9.x +Server-9.x -=-=(Guest - Fri, 17 Jul 2009, 02:44)=-=- Version updated. --- /tmp/wklog.17.old.24114 2009-07-17 02:44:36.000000000 +0300 +++ /tmp/wklog.17.new.24114 2009-07-17 02:44:36.000000000 +0300 @@ -1 +1 @@ -Server-5.1 +9.x -=-=(Guest - Fri, 17 Jul 2009, 02:44)=-=- Category updated. --- /tmp/wklog.17.old.24114 2009-07-17 02:44:36.000000000 +0300 +++ /tmp/wklog.17.new.24114 2009-07-17 02:44:36.000000000 +0300 @@ -1 +1 @@ -Server-Sprint +Client-BackLog -=-=(Guest - Thu, 18 Jun 2009, 04:15)=-=- Low Level Design modified. --- /tmp/wklog.17.old.29969 2009-06-18 04:15:23.000000000 +0300 +++ /tmp/wklog.17.new.29969 2009-06-18 04:15:23.000000000 +0300 @@ -158,3 +158,43 @@ from user/EXPLAIN point of view: no. constant table is the one that we read one record from. eliminated table is the one that we don't acccess at all. +* What is described above will not be able to eliminate this outer join + create unique index idx on tableB (id, fromDate); + ... + left outer join + tableB B + on + B.id = A.id + and + B.fromDate = (select max(sub.fromDate) + from tableB sub where sub.id = A.id); + + This is because condition "B.fromDate= func(tableB)" cannot be used. + Reason#1: update_ref_and_keys() does not consider such conditions to + be of any use (and indeed they are not usable for ref access) + so they are not put into KEYUSE array. + Reason#2: even if they were put there, we would need to be able to tell + between predicates like + B.fromDate= func(B.id) // guarantees only one matching row as + // B.id is already bound by B.id=A.id + // hence B.fromDate becomes bound too. + and + "B.fromDate= func(B.*)" // Can potentially have many matching + // records. + We need to + - Have update_ref_and_keys() create KEYUSE elements for such equalities + - Have eliminate_tables() and friends make a more accurate check. + The right check is to check whether all parts of a unique key are bound. + If we have keypartX to be bound, then t.keypartY=func(keypartX) makes + keypartY to be bound. + The difficulty here is that correlated subquery predicate cannot tell what + columns it depends on (it only remembers tables). + Traversing the predicate is expensive and complicated. + We're leaning towards making each subquery predicate have a List<Item> with + items that + - are in the current select + - and it depends on. + This list will be useful in certain other subquery optimizations as well, + it is cheap to collect it in fix_fields() phase, so it will be collected + for every subquery predicate. + -=-=(Guest - Thu, 18 Jun 2009, 02:48)=-=- Low Level Design modified. --- /tmp/wklog.17.old.27792 2009-06-18 02:48:45.000000000 +0300 +++ /tmp/wklog.17.new.27792 2009-06-18 02:48:45.000000000 +0300 @@ -89,14 +89,14 @@ - queries that would use elimination - queries that are very similar to one above (so that they would have same QEP, execution cost, etc) but cannot use table elimination. +then compare run times and make a conclusion about whether dbms supports table +elimination. 6. Todo, issues to resolve -------------------------- 6.1 To resolve ~~~~~~~~~~~~~~ -- Re-check how this works with equality propagation. - - Relationship with prepared statements. On one hand, it's natural to desire to make table elimination a once-per-statement operation, like outer->inner join conversion. We'll have @@ -141,8 +141,13 @@ 7. Additional issues -------------------- -* We remove ON clauses within semi-join nests. If these clauses contain +* We remove ON clauses within outer join nests. If these clauses contain subqueries, they probably should be gone from EXPLAIN output also? + Yes. Current approach: when removing an outer join nest, walk the ON clause + and mark subselects as eliminated. Then let EXPLAIN code check if the + SELECT was eliminated before the printing (EXPLAIN is generated by doing + a recursive descent, so the check will also cause children of eliminated + selects not to be printed) * Table elimination is performed after constant table detection (but before the range analysis). Constant tables are technically different from -=-=(Guest - Thu, 18 Jun 2009, 02:24)=-=- Low Level Design modified. --- /tmp/wklog.17.old.27162 2009-06-18 02:24:14.000000000 +0300 +++ /tmp/wklog.17.new.27162 2009-06-18 02:24:14.000000000 +0300 @@ -83,9 +83,12 @@ 5. Tests and benchmarks ----------------------- -Should create a benchmark in sql-bench which checks if the dbms has table +Create a benchmark in sql-bench which checks if the DBMS has table elimination. -TODO elaborate +[According to Monty] Run + - queries that would use elimination + - queries that are very similar to one above (so that they would have same + QEP, execution cost, etc) but cannot use table elimination. 6. Todo, issues to resolve -------------------------- @@ -109,33 +112,37 @@ 6.2 Resolved ~~~~~~~~~~~~ -- outer->inner join conversion is not a problem for table elimination. +* outer->inner join conversion is not a problem for table elimination. We make outer->inner conversions based on predicates in WHERE. If the WHERE referred to an inner table (requirement for OJ->IJ conversion) then table elimination would not be applicable anyway. -7. Additional issues --------------------- -* We remove ON clauses within semi-join nests. If these clauses contain - subqueries, they probably should be gone from EXPLAIN output also? +* For Multi-table UPDATEs/DELETEs, need to also analyze the SET clause: + - affected tables must not be eliminated + - tables that are used on the right side of the SET x=y assignments must + not be eliminated either. -* Aggregate functions report they depend on all tables, that is, +* Aggregate functions used to report that they depend on all tables, that is, item_agg_func->used_tables() == (1ULL << join->tables) - 1 - always. If we want table elimination to work in presence of grouping, need - to devise some other way of analyzing aggregate functions. + always. Fixed it, now aggregate function reports it depends on + tables that its arguments depend on. In particular, COUNT(*) reports + that it depends on no tables (item_count_star->used_tables()==0). + One consequence of that is that "item->used_tables()==0" is not + equivalent to "item->const_item()==true" anymore (not sure if it's + "anymore" or this has been already happening). + +* EXPLAIN EXTENDED warning text was generated after the JOIN object has + been discarded. This didn't allow to use information about join plan + when printing the warning. Fixed this by keeping the JOIN objects until + we've printed the warning (have also an intent to remove the const + tables from the join output). -* Should eliminated tables be shown in EXPLAIN EXTENDED? - - If we just ignore the question, they will be shown - - this is what happens for constant tables, too. - - I don't see how showing them could be of any use. They only make it - harder to read the rewritten query. - It turns out that - - it is easy to have EXPLAIN EXTENDED show permanent (once-per-statement - lifetime) changes. - - it is hard to have it show per-execution data. This is because the warning - text is generated after the execution structures have been destroyed. +7. Additional issues +-------------------- +* We remove ON clauses within semi-join nests. If these clauses contain + subqueries, they probably should be gone from EXPLAIN output also? * Table elimination is performed after constant table detection (but before the range analysis). Constant tables are technically different from @@ -143,8 +150,6 @@ Considering we've already done the join_read_const_table() call, is there any real difference between constant table and eliminated one? If there is, should we mark const tables also as eliminated? + from user/EXPLAIN point of view: no. constant table is the one that we read + one record from. eliminated table is the one that we don't acccess at all. -* For Multi-table UPDATEs/DELETEs, need to also analyze the SET clause: - - affected tables must not be eliminated - - tables that are used on the right side of the SET x=y assignments must - not be eliminated either. -=-=(Guest - Tue, 16 Jun 2009, 17:01)=-=- Dependency deleted: 29 no longer depends on 17 -=-=(Guest - Wed, 10 Jun 2009, 01:23)=-=- Low Level Design modified. --- /tmp/wklog.17.old.1842 2009-06-10 01:23:42.000000000 +0300 +++ /tmp/wklog.17.new.1842 2009-06-10 01:23:42.000000000 +0300 @@ -131,6 +131,11 @@ - this is what happens for constant tables, too. - I don't see how showing them could be of any use. They only make it harder to read the rewritten query. + It turns out that + - it is easy to have EXPLAIN EXTENDED show permanent (once-per-statement + lifetime) changes. + - it is hard to have it show per-execution data. This is because the warning + text is generated after the execution structures have been destroyed. * Table elimination is performed after constant table detection (but before the range analysis). Constant tables are technically different from -=-=(Guest - Wed, 03 Jun 2009, 22:01)=-=- Low Level Design modified. --- /tmp/wklog.17.old.21801 2009-06-03 22:01:34.000000000 +0300 +++ /tmp/wklog.17.new.21801 2009-06-03 22:01:34.000000000 +0300 @@ -1,3 +1,6 @@ +The code (currently in development) is at lp: +~maria-captains/maria/maria-5.1-table-elimination tree. + <contents> 1. Conditions for removal 1.1 Quick check if there are candidates -=-=(Guest - Wed, 03 Jun 2009, 15:04)=-=- Low Level Design modified. --- /tmp/wklog.17.old.20378 2009-06-03 15:04:54.000000000 +0300 +++ /tmp/wklog.17.new.20378 2009-06-03 15:04:54.000000000 +0300 @@ -135,3 +135,8 @@ Considering we've already done the join_read_const_table() call, is there any real difference between constant table and eliminated one? If there is, should we mark const tables also as eliminated? + +* For Multi-table UPDATEs/DELETEs, need to also analyze the SET clause: + - affected tables must not be eliminated + - tables that are used on the right side of the SET x=y assignments must + not be eliminated either. ------------------------------------------------------------ -=-=(View All Progress Notes, 22 total)=-=- http://askmonty.org/worklog/index.pl?tid=17&nolimit=1 DESCRIPTION: Eliminate not needed tables from SELECT queries.. This will speed up some views and automatically generated queries. Example: CREATE TABLE B (id int primary key); select A.colA from tableA A left outer join tableB B on B.id = A.id; In this case we can remove table B and the join from the query. HIGH-LEVEL SPECIFICATION: Here is an extended explanation of table elimination. Table elimination is a feature found in some modern query optimizers, of which Microsoft SQL Server 2005/2008 seems to have the most advanced implementation. Oracle 11g has also been confirmed to use table elimination but not to the same extent. Basically, what table elimination does, is to remove tables from the execution plan when it is unnecessary to include them. This can, of course, only happen if the right circumstances arise. Let us for example look at the following query: select A.colA from tableA A left outer join tableB B on B.id = A.id; When using A as the left table we ensure that the query will return at least as many rows as there are in that table. For rows where the join condition (B.id = A.id) is not met the selected column (A.colA) will still contain it's original value. The not seen B.* row would contain all NULL:s. However, the result set could actually contain more rows than what is found in tableA if there are duplicates of the column B.id in tableB. If A contains a row [1, "val1"] and B the rows [1, "other1a"],[1, "other1b"] then two rows will match in the join condition. The only way to know what the result will look like is to actually touch both tables during execution. Instead, let's say that tableB contains rows that make it possible to place a unique constraint on the column B.id, for example and often the case a primary key. In this situation we know that we will get exactly as many rows as there are in tableA, since joining with tableB cannot introduce any duplicates. If further, as in the example query, we do not select any columns from tableB, touching that table during execution is unnecessary. We can remove the whole join operation from the execution plan. Both SQL Server 2005/2008 and Oracle 11g will deploy table elimination in the case described above. Let us look at a more advanced query, where Oracle fails. select A.colA from tableA A left outer join tableB B on B.id = A.id and B.fromDate = ( select max(sub.fromDate) from tableB sub where sub.id = A.id ); In this example we have added another join condition, which ensures that we only pick the matching row from tableB having the latest fromDate. In this case tableB will contain duplicates of the column B.id, so in order to ensure uniqueness the primary key has to contain the fromDate column as well. In other words the primary key of tableB is (B.id, B.fromDate). Furthermore, since the subselect ensures that we only pick the latest B.fromDate for a given B.id we know that at most one row will match the join condition. We will again have the situation where joining with tableB cannot affect the number of rows in the result set. Since we do not select any columns from tableB, the whole join operation can be eliminated from the execution plan. SQL Server 2005/2008 will deploy table elimination in this situation as well. We have not found a way to make Oracle 11g use it for this type of query. Queries like these arise in two situations. Either when you have denormalized model consisting of a fact table with several related dimension tables, or when you have a highly normalized model where each attribute is stored in its own table. The example with the subselect is common whenever you store historized/versioned data. LOW-LEVEL DESIGN: The code (currently in development) is at lp: ~maria-captains/maria/maria-5.1-table-elimination tree. <contents> 1. Conditions for removal 1.1 Quick check if there are candidates 2. Removal operation properties 3. Removal operation 4. User interface 5. Tests and benchmarks 6. Todo, issues to resolve 6.1 To resolve 6.2 Resolved 7. Additional issues </contents> It's not really about elimination of tables, it's about elimination of inner sides of outer joins. 1. Conditions for removal ------------------------- We can eliminate an inner side of outer join if: 1. For each record combination of outer tables, it will always produce exactly one record. 2. There are no references to columns of the inner tables anywhere else in the query. #1 means that every table inside the outer join nest is: - is a constant table: = because it can be accessed via eq_ref(const) access, or = it is a zero-rows or one-row MyISAM-like table [MARK1] - has an eq_ref access method candidate. #2 means that WHERE clause, ON clauses of embedding outer joins, ORDER BY, GROUP BY and HAVING do not refer to the inner tables of the outer join nest. 1.1 Quick check if there are candidates ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Before we start to enumerate join nests, here is a quick way to check if there *can be* something to be removed: if ((tables used in select_list | tables used in group/order by UNION | tables used in where) != bitmap_of_all_tables) { attempt table elimination; } 2. Removal operation properties ------------------------------- * There is always one way to remove (no choice to remove either this or that) * It is always better to remove as much tables as possible (at least within our cost model). Thus, no need for any cost calculations/etc. It's an unconditional rewrite. 3. Removal operation -------------------- * Remove the outer join nest's nested join structure (i.e. get the outer join's TABLE_LIST object $OJ and remove it from $OJ->embedding, $OJ->embedding->nested_join. Update table_map's of all ancestor nested joins). [MARK2] * Move the tables and their JOIN_TABs to front like it is done with const tables, with exception that if eliminated outer join nest was within another outer join nest, that shouldn't prevent us from moving away the eliminated tables. * Update join->table_count and all-join-tables bitmap. * That's it. Nothing else? 4. User interface ----------------- * We'll add an @@optimizer switch flag for table elimination. Tentative name: 'table_elimination'. (Note ^^ utility of the above questioned ^, as table elimination can never be worse than no elimination. We're leaning towards not adding the flag) * EXPLAIN will not show the removed tables at all. This will allow to check if tables were removed, and also will behave nicely with anchor model and VIEWs: stuff that user doesn't care about just won't be there. 5. Tests and benchmarks ----------------------- Create a benchmark in sql-bench which checks if the DBMS has table elimination. [According to Monty] Run - queries that would use elimination - queries that are very similar to one above (so that they would have same QEP, execution cost, etc) but cannot use table elimination. then compare run times and make a conclusion about whether dbms supports table elimination. 6. Todo, issues to resolve -------------------------- 6.1 To resolve ~~~~~~~~~~~~~~ - Relationship with prepared statements. On one hand, it's natural to desire to make table elimination a once-per-statement operation, like outer->inner join conversion. We'll have to limit the applicability by removing [MARK1] as that can change during lifetime of the statement. The other option is to do table elimination every time. This will require to rework operation [MARK2] to be undoable. I'm leaning towards doing the former. With anchor modeling, it is unlikely that we'll meet outer joins which have N inner tables of which some are 1-row MyISAM tables that do not have primary key. 6.2 Resolved ~~~~~~~~~~~~ * outer->inner join conversion is not a problem for table elimination. We make outer->inner conversions based on predicates in WHERE. If the WHERE referred to an inner table (requirement for OJ->IJ conversion) then table elimination would not be applicable anyway. * For Multi-table UPDATEs/DELETEs, need to also analyze the SET clause: - affected tables must not be eliminated - tables that are used on the right side of the SET x=y assignments must not be eliminated either. * Aggregate functions used to report that they depend on all tables, that is, item_agg_func->used_tables() == (1ULL << join->tables) - 1 always. Fixed it, now aggregate function reports it depends on tables that its arguments depend on. In particular, COUNT(*) reports that it depends on no tables (item_count_star->used_tables()==0). One consequence of that is that "item->used_tables()==0" is not equivalent to "item->const_item()==true" anymore (not sure if it's "anymore" or this has been already happening). * EXPLAIN EXTENDED warning text was generated after the JOIN object has been discarded. This didn't allow to use information about join plan when printing the warning. Fixed this by keeping the JOIN objects until we've printed the warning (have also an intent to remove the const tables from the join output). 7. Additional issues -------------------- * We remove ON clauses within outer join nests. If these clauses contain subqueries, they probably should be gone from EXPLAIN output also? Yes. Current approach: when removing an outer join nest, walk the ON clause and mark subselects as eliminated. Then let EXPLAIN code check if the SELECT was eliminated before the printing (EXPLAIN is generated by doing a recursive descent, so the check will also cause children of eliminated selects not to be printed) * Table elimination is performed after constant table detection (but before the range analysis). Constant tables are technically different from eliminated ones (e.g. the former are shown in EXPLAIN and the latter aren't). Considering we've already done the join_read_const_table() call, is there any real difference between constant table and eliminated one? If there is, should we mark const tables also as eliminated? from user/EXPLAIN point of view: no. constant table is the one that we read one record from. eliminated table is the one that we don't acccess at all. * What is described above will not be able to eliminate this outer join create unique index idx on tableB (id, fromDate); ... left outer join tableB B on B.id = A.id and B.fromDate = (select max(sub.fromDate) from tableB sub where sub.id = A.id); This is because condition "B.fromDate= func(tableB)" cannot be used. Reason#1: update_ref_and_keys() does not consider such conditions to be of any use (and indeed they are not usable for ref access) so they are not put into KEYUSE array. Reason#2: even if they were put there, we would need to be able to tell between predicates like B.fromDate= func(B.id) // guarantees only one matching row as // B.id is already bound by B.id=A.id // hence B.fromDate becomes bound too. and "B.fromDate= func(B.*)" // Can potentially have many matching // records. We need to - Have update_ref_and_keys() create KEYUSE elements for such equalities - Have eliminate_tables() and friends make a more accurate check. The right check is to check whether all parts of a unique key are bound. If we have keypartX to be bound, then t.keypartY=func(keypartX) makes keypartY to be bound. The difficulty here is that correlated subquery predicate cannot tell what columns it depends on (it only remembers tables). Traversing the predicate is expensive and complicated. We're leaning towards making each subquery predicate have a List<Item> with items that - are in the current select - and it depends on. This list will be useful in certain other subquery optimizations as well, it is cheap to collect it in fix_fields() phase, so it will be collected for every subquery predicate. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] Updated (by Guest): Table elimination (17)
by worklog-noreply＠askmonty.org 16 Jul '09

16 Jul '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Table elimination CREATION DATE..: Sun, 10 May 2009, 19:57 SUPERVISOR.....: Monty IMPLEMENTOR....: Knielsen COPIES TO......: CATEGORY.......: Client-BackLog TASK ID........: 17 (http://askmonty.org/worklog/?tid=17) VERSION........: 9.x STATUS.........: In-Progress PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: -=-=(Guest - Fri, 17 Jul 2009, 02:44)=-=- Version updated. --- /tmp/wklog.17.old.24114 2009-07-17 02:44:36.000000000 +0300 +++ /tmp/wklog.17.new.24114 2009-07-17 02:44:36.000000000 +0300 @@ -1 +1 @@ -Server-5.1 +9.x -=-=(Guest - Fri, 17 Jul 2009, 02:44)=-=- Category updated. --- /tmp/wklog.17.old.24114 2009-07-17 02:44:36.000000000 +0300 +++ /tmp/wklog.17.new.24114 2009-07-17 02:44:36.000000000 +0300 @@ -1 +1 @@ -Server-Sprint +Client-BackLog -=-=(Guest - Thu, 18 Jun 2009, 04:15)=-=- Low Level Design modified. --- /tmp/wklog.17.old.29969 2009-06-18 04:15:23.000000000 +0300 +++ /tmp/wklog.17.new.29969 2009-06-18 04:15:23.000000000 +0300 @@ -158,3 +158,43 @@ from user/EXPLAIN point of view: no. constant table is the one that we read one record from. eliminated table is the one that we don't acccess at all. +* What is described above will not be able to eliminate this outer join + create unique index idx on tableB (id, fromDate); + ... + left outer join + tableB B + on + B.id = A.id + and + B.fromDate = (select max(sub.fromDate) + from tableB sub where sub.id = A.id); + + This is because condition "B.fromDate= func(tableB)" cannot be used. + Reason#1: update_ref_and_keys() does not consider such conditions to + be of any use (and indeed they are not usable for ref access) + so they are not put into KEYUSE array. + Reason#2: even if they were put there, we would need to be able to tell + between predicates like + B.fromDate= func(B.id) // guarantees only one matching row as + // B.id is already bound by B.id=A.id + // hence B.fromDate becomes bound too. + and + "B.fromDate= func(B.*)" // Can potentially have many matching + // records. + We need to + - Have update_ref_and_keys() create KEYUSE elements for such equalities + - Have eliminate_tables() and friends make a more accurate check. + The right check is to check whether all parts of a unique key are bound. + If we have keypartX to be bound, then t.keypartY=func(keypartX) makes + keypartY to be bound. + The difficulty here is that correlated subquery predicate cannot tell what + columns it depends on (it only remembers tables). + Traversing the predicate is expensive and complicated. + We're leaning towards making each subquery predicate have a List<Item> with + items that + - are in the current select + - and it depends on. + This list will be useful in certain other subquery optimizations as well, + it is cheap to collect it in fix_fields() phase, so it will be collected + for every subquery predicate. + -=-=(Guest - Thu, 18 Jun 2009, 02:48)=-=- Low Level Design modified. --- /tmp/wklog.17.old.27792 2009-06-18 02:48:45.000000000 +0300 +++ /tmp/wklog.17.new.27792 2009-06-18 02:48:45.000000000 +0300 @@ -89,14 +89,14 @@ - queries that would use elimination - queries that are very similar to one above (so that they would have same QEP, execution cost, etc) but cannot use table elimination. +then compare run times and make a conclusion about whether dbms supports table +elimination. 6. Todo, issues to resolve -------------------------- 6.1 To resolve ~~~~~~~~~~~~~~ -- Re-check how this works with equality propagation. - - Relationship with prepared statements. On one hand, it's natural to desire to make table elimination a once-per-statement operation, like outer->inner join conversion. We'll have @@ -141,8 +141,13 @@ 7. Additional issues -------------------- -* We remove ON clauses within semi-join nests. If these clauses contain +* We remove ON clauses within outer join nests. If these clauses contain subqueries, they probably should be gone from EXPLAIN output also? + Yes. Current approach: when removing an outer join nest, walk the ON clause + and mark subselects as eliminated. Then let EXPLAIN code check if the + SELECT was eliminated before the printing (EXPLAIN is generated by doing + a recursive descent, so the check will also cause children of eliminated + selects not to be printed) * Table elimination is performed after constant table detection (but before the range analysis). Constant tables are technically different from -=-=(Guest - Thu, 18 Jun 2009, 02:24)=-=- Low Level Design modified. --- /tmp/wklog.17.old.27162 2009-06-18 02:24:14.000000000 +0300 +++ /tmp/wklog.17.new.27162 2009-06-18 02:24:14.000000000 +0300 @@ -83,9 +83,12 @@ 5. Tests and benchmarks ----------------------- -Should create a benchmark in sql-bench which checks if the dbms has table +Create a benchmark in sql-bench which checks if the DBMS has table elimination. -TODO elaborate +[According to Monty] Run + - queries that would use elimination + - queries that are very similar to one above (so that they would have same + QEP, execution cost, etc) but cannot use table elimination. 6. Todo, issues to resolve -------------------------- @@ -109,33 +112,37 @@ 6.2 Resolved ~~~~~~~~~~~~ -- outer->inner join conversion is not a problem for table elimination. +* outer->inner join conversion is not a problem for table elimination. We make outer->inner conversions based on predicates in WHERE. If the WHERE referred to an inner table (requirement for OJ->IJ conversion) then table elimination would not be applicable anyway. -7. Additional issues --------------------- -* We remove ON clauses within semi-join nests. If these clauses contain - subqueries, they probably should be gone from EXPLAIN output also? +* For Multi-table UPDATEs/DELETEs, need to also analyze the SET clause: + - affected tables must not be eliminated + - tables that are used on the right side of the SET x=y assignments must + not be eliminated either. -* Aggregate functions report they depend on all tables, that is, +* Aggregate functions used to report that they depend on all tables, that is, item_agg_func->used_tables() == (1ULL << join->tables) - 1 - always. If we want table elimination to work in presence of grouping, need - to devise some other way of analyzing aggregate functions. + always. Fixed it, now aggregate function reports it depends on + tables that its arguments depend on. In particular, COUNT(*) reports + that it depends on no tables (item_count_star->used_tables()==0). + One consequence of that is that "item->used_tables()==0" is not + equivalent to "item->const_item()==true" anymore (not sure if it's + "anymore" or this has been already happening). + +* EXPLAIN EXTENDED warning text was generated after the JOIN object has + been discarded. This didn't allow to use information about join plan + when printing the warning. Fixed this by keeping the JOIN objects until + we've printed the warning (have also an intent to remove the const + tables from the join output). -* Should eliminated tables be shown in EXPLAIN EXTENDED? - - If we just ignore the question, they will be shown - - this is what happens for constant tables, too. - - I don't see how showing them could be of any use. They only make it - harder to read the rewritten query. - It turns out that - - it is easy to have EXPLAIN EXTENDED show permanent (once-per-statement - lifetime) changes. - - it is hard to have it show per-execution data. This is because the warning - text is generated after the execution structures have been destroyed. +7. Additional issues +-------------------- +* We remove ON clauses within semi-join nests. If these clauses contain + subqueries, they probably should be gone from EXPLAIN output also? * Table elimination is performed after constant table detection (but before the range analysis). Constant tables are technically different from @@ -143,8 +150,6 @@ Considering we've already done the join_read_const_table() call, is there any real difference between constant table and eliminated one? If there is, should we mark const tables also as eliminated? + from user/EXPLAIN point of view: no. constant table is the one that we read + one record from. eliminated table is the one that we don't acccess at all. -* For Multi-table UPDATEs/DELETEs, need to also analyze the SET clause: - - affected tables must not be eliminated - - tables that are used on the right side of the SET x=y assignments must - not be eliminated either. -=-=(Guest - Tue, 16 Jun 2009, 17:01)=-=- Dependency deleted: 29 no longer depends on 17 -=-=(Guest - Wed, 10 Jun 2009, 01:23)=-=- Low Level Design modified. --- /tmp/wklog.17.old.1842 2009-06-10 01:23:42.000000000 +0300 +++ /tmp/wklog.17.new.1842 2009-06-10 01:23:42.000000000 +0300 @@ -131,6 +131,11 @@ - this is what happens for constant tables, too. - I don't see how showing them could be of any use. They only make it harder to read the rewritten query. + It turns out that + - it is easy to have EXPLAIN EXTENDED show permanent (once-per-statement + lifetime) changes. + - it is hard to have it show per-execution data. This is because the warning + text is generated after the execution structures have been destroyed. * Table elimination is performed after constant table detection (but before the range analysis). Constant tables are technically different from -=-=(Guest - Wed, 03 Jun 2009, 22:01)=-=- Low Level Design modified. --- /tmp/wklog.17.old.21801 2009-06-03 22:01:34.000000000 +0300 +++ /tmp/wklog.17.new.21801 2009-06-03 22:01:34.000000000 +0300 @@ -1,3 +1,6 @@ +The code (currently in development) is at lp: +~maria-captains/maria/maria-5.1-table-elimination tree. + <contents> 1. Conditions for removal 1.1 Quick check if there are candidates -=-=(Guest - Wed, 03 Jun 2009, 15:04)=-=- Low Level Design modified. --- /tmp/wklog.17.old.20378 2009-06-03 15:04:54.000000000 +0300 +++ /tmp/wklog.17.new.20378 2009-06-03 15:04:54.000000000 +0300 @@ -135,3 +135,8 @@ Considering we've already done the join_read_const_table() call, is there any real difference between constant table and eliminated one? If there is, should we mark const tables also as eliminated? + +* For Multi-table UPDATEs/DELETEs, need to also analyze the SET clause: + - affected tables must not be eliminated + - tables that are used on the right side of the SET x=y assignments must + not be eliminated either. -=-=(Psergey - Wed, 03 Jun 2009, 12:07)=-=- Dependency created: 29 now depends on 17 ------------------------------------------------------------ -=-=(View All Progress Notes, 21 total)=-=- http://askmonty.org/worklog/index.pl?tid=17&nolimit=1 DESCRIPTION: Eliminate not needed tables from SELECT queries.. This will speed up some views and automatically generated queries. Example: CREATE TABLE B (id int primary key); select A.colA from tableA A left outer join tableB B on B.id = A.id; In this case we can remove table B and the join from the query. HIGH-LEVEL SPECIFICATION: Here is an extended explanation of table elimination. Table elimination is a feature found in some modern query optimizers, of which Microsoft SQL Server 2005/2008 seems to have the most advanced implementation. Oracle 11g has also been confirmed to use table elimination but not to the same extent. Basically, what table elimination does, is to remove tables from the execution plan when it is unnecessary to include them. This can, of course, only happen if the right circumstances arise. Let us for example look at the following query: select A.colA from tableA A left outer join tableB B on B.id = A.id; When using A as the left table we ensure that the query will return at least as many rows as there are in that table. For rows where the join condition (B.id = A.id) is not met the selected column (A.colA) will still contain it's original value. The not seen B.* row would contain all NULL:s. However, the result set could actually contain more rows than what is found in tableA if there are duplicates of the column B.id in tableB. If A contains a row [1, "val1"] and B the rows [1, "other1a"],[1, "other1b"] then two rows will match in the join condition. The only way to know what the result will look like is to actually touch both tables during execution. Instead, let's say that tableB contains rows that make it possible to place a unique constraint on the column B.id, for example and often the case a primary key. In this situation we know that we will get exactly as many rows as there are in tableA, since joining with tableB cannot introduce any duplicates. If further, as in the example query, we do not select any columns from tableB, touching that table during execution is unnecessary. We can remove the whole join operation from the execution plan. Both SQL Server 2005/2008 and Oracle 11g will deploy table elimination in the case described above. Let us look at a more advanced query, where Oracle fails. select A.colA from tableA A left outer join tableB B on B.id = A.id and B.fromDate = ( select max(sub.fromDate) from tableB sub where sub.id = A.id ); In this example we have added another join condition, which ensures that we only pick the matching row from tableB having the latest fromDate. In this case tableB will contain duplicates of the column B.id, so in order to ensure uniqueness the primary key has to contain the fromDate column as well. In other words the primary key of tableB is (B.id, B.fromDate). Furthermore, since the subselect ensures that we only pick the latest B.fromDate for a given B.id we know that at most one row will match the join condition. We will again have the situation where joining with tableB cannot affect the number of rows in the result set. Since we do not select any columns from tableB, the whole join operation can be eliminated from the execution plan. SQL Server 2005/2008 will deploy table elimination in this situation as well. We have not found a way to make Oracle 11g use it for this type of query. Queries like these arise in two situations. Either when you have denormalized model consisting of a fact table with several related dimension tables, or when you have a highly normalized model where each attribute is stored in its own table. The example with the subselect is common whenever you store historized/versioned data. LOW-LEVEL DESIGN: The code (currently in development) is at lp: ~maria-captains/maria/maria-5.1-table-elimination tree. <contents> 1. Conditions for removal 1.1 Quick check if there are candidates 2. Removal operation properties 3. Removal operation 4. User interface 5. Tests and benchmarks 6. Todo, issues to resolve 6.1 To resolve 6.2 Resolved 7. Additional issues </contents> It's not really about elimination of tables, it's about elimination of inner sides of outer joins. 1. Conditions for removal ------------------------- We can eliminate an inner side of outer join if: 1. For each record combination of outer tables, it will always produce exactly one record. 2. There are no references to columns of the inner tables anywhere else in the query. #1 means that every table inside the outer join nest is: - is a constant table: = because it can be accessed via eq_ref(const) access, or = it is a zero-rows or one-row MyISAM-like table [MARK1] - has an eq_ref access method candidate. #2 means that WHERE clause, ON clauses of embedding outer joins, ORDER BY, GROUP BY and HAVING do not refer to the inner tables of the outer join nest. 1.1 Quick check if there are candidates ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Before we start to enumerate join nests, here is a quick way to check if there *can be* something to be removed: if ((tables used in select_list | tables used in group/order by UNION | tables used in where) != bitmap_of_all_tables) { attempt table elimination; } 2. Removal operation properties ------------------------------- * There is always one way to remove (no choice to remove either this or that) * It is always better to remove as much tables as possible (at least within our cost model). Thus, no need for any cost calculations/etc. It's an unconditional rewrite. 3. Removal operation -------------------- * Remove the outer join nest's nested join structure (i.e. get the outer join's TABLE_LIST object $OJ and remove it from $OJ->embedding, $OJ->embedding->nested_join. Update table_map's of all ancestor nested joins). [MARK2] * Move the tables and their JOIN_TABs to front like it is done with const tables, with exception that if eliminated outer join nest was within another outer join nest, that shouldn't prevent us from moving away the eliminated tables. * Update join->table_count and all-join-tables bitmap. * That's it. Nothing else? 4. User interface ----------------- * We'll add an @@optimizer switch flag for table elimination. Tentative name: 'table_elimination'. (Note ^^ utility of the above questioned ^, as table elimination can never be worse than no elimination. We're leaning towards not adding the flag) * EXPLAIN will not show the removed tables at all. This will allow to check if tables were removed, and also will behave nicely with anchor model and VIEWs: stuff that user doesn't care about just won't be there. 5. Tests and benchmarks ----------------------- Create a benchmark in sql-bench which checks if the DBMS has table elimination. [According to Monty] Run - queries that would use elimination - queries that are very similar to one above (so that they would have same QEP, execution cost, etc) but cannot use table elimination. then compare run times and make a conclusion about whether dbms supports table elimination. 6. Todo, issues to resolve -------------------------- 6.1 To resolve ~~~~~~~~~~~~~~ - Relationship with prepared statements. On one hand, it's natural to desire to make table elimination a once-per-statement operation, like outer->inner join conversion. We'll have to limit the applicability by removing [MARK1] as that can change during lifetime of the statement. The other option is to do table elimination every time. This will require to rework operation [MARK2] to be undoable. I'm leaning towards doing the former. With anchor modeling, it is unlikely that we'll meet outer joins which have N inner tables of which some are 1-row MyISAM tables that do not have primary key. 6.2 Resolved ~~~~~~~~~~~~ * outer->inner join conversion is not a problem for table elimination. We make outer->inner conversions based on predicates in WHERE. If the WHERE referred to an inner table (requirement for OJ->IJ conversion) then table elimination would not be applicable anyway. * For Multi-table UPDATEs/DELETEs, need to also analyze the SET clause: - affected tables must not be eliminated - tables that are used on the right side of the SET x=y assignments must not be eliminated either. * Aggregate functions used to report that they depend on all tables, that is, item_agg_func->used_tables() == (1ULL << join->tables) - 1 always. Fixed it, now aggregate function reports it depends on tables that its arguments depend on. In particular, COUNT(*) reports that it depends on no tables (item_count_star->used_tables()==0). One consequence of that is that "item->used_tables()==0" is not equivalent to "item->const_item()==true" anymore (not sure if it's "anymore" or this has been already happening). * EXPLAIN EXTENDED warning text was generated after the JOIN object has been discarded. This didn't allow to use information about join plan when printing the warning. Fixed this by keeping the JOIN objects until we've printed the warning (have also an intent to remove the const tables from the join output). 7. Additional issues -------------------- * We remove ON clauses within outer join nests. If these clauses contain subqueries, they probably should be gone from EXPLAIN output also? Yes. Current approach: when removing an outer join nest, walk the ON clause and mark subselects as eliminated. Then let EXPLAIN code check if the SELECT was eliminated before the printing (EXPLAIN is generated by doing a recursive descent, so the check will also cause children of eliminated selects not to be printed) * Table elimination is performed after constant table detection (but before the range analysis). Constant tables are technically different from eliminated ones (e.g. the former are shown in EXPLAIN and the latter aren't). Considering we've already done the join_read_const_table() call, is there any real difference between constant table and eliminated one? If there is, should we mark const tables also as eliminated? from user/EXPLAIN point of view: no. constant table is the one that we read one record from. eliminated table is the one that we don't acccess at all. * What is described above will not be able to eliminate this outer join create unique index idx on tableB (id, fromDate); ... left outer join tableB B on B.id = A.id and B.fromDate = (select max(sub.fromDate) from tableB sub where sub.id = A.id); This is because condition "B.fromDate= func(tableB)" cannot be used. Reason#1: update_ref_and_keys() does not consider such conditions to be of any use (and indeed they are not usable for ref access) so they are not put into KEYUSE array. Reason#2: even if they were put there, we would need to be able to tell between predicates like B.fromDate= func(B.id) // guarantees only one matching row as // B.id is already bound by B.id=A.id // hence B.fromDate becomes bound too. and "B.fromDate= func(B.*)" // Can potentially have many matching // records. We need to - Have update_ref_and_keys() create KEYUSE elements for such equalities - Have eliminate_tables() and friends make a more accurate check. The right check is to check whether all parts of a unique key are bound. If we have keypartX to be bound, then t.keypartY=func(keypartX) makes keypartY to be bound. The difficulty here is that correlated subquery predicate cannot tell what columns it depends on (it only remembers tables). Traversing the predicate is expensive and complicated. We're leaning towards making each subquery predicate have a List<Item> with items that - are in the current select - and it depends on. This list will be useful in certain other subquery optimizations as well, it is cheap to collect it in fix_fields() phase, so it will be collected for every subquery predicate. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

[Maria-developers] Updated (by Guest): Table elimination (17)
by worklog-noreply＠askmonty.org 16 Jul '09

16 Jul '09

----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Table elimination CREATION DATE..: Sun, 10 May 2009, 19:57 SUPERVISOR.....: Monty IMPLEMENTOR....: Knielsen COPIES TO......: CATEGORY.......: Client-BackLog TASK ID........: 17 (http://askmonty.org/worklog/?tid=17) VERSION........: 9.x STATUS.........: In-Progress PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: -=-=(Guest - Fri, 17 Jul 2009, 02:44)=-=- Version updated. --- /tmp/wklog.17.old.24114 2009-07-17 02:44:36.000000000 +0300 +++ /tmp/wklog.17.new.24114 2009-07-17 02:44:36.000000000 +0300 @@ -1 +1 @@ -Server-5.1 +9.x -=-=(Guest - Fri, 17 Jul 2009, 02:44)=-=- Category updated. --- /tmp/wklog.17.old.24114 2009-07-17 02:44:36.000000000 +0300 +++ /tmp/wklog.17.new.24114 2009-07-17 02:44:36.000000000 +0300 @@ -1 +1 @@ -Server-Sprint +Client-BackLog -=-=(Guest - Thu, 18 Jun 2009, 04:15)=-=- Low Level Design modified. --- /tmp/wklog.17.old.29969 2009-06-18 04:15:23.000000000 +0300 +++ /tmp/wklog.17.new.29969 2009-06-18 04:15:23.000000000 +0300 @@ -158,3 +158,43 @@ from user/EXPLAIN point of view: no. constant table is the one that we read one record from. eliminated table is the one that we don't acccess at all. +* What is described above will not be able to eliminate this outer join + create unique index idx on tableB (id, fromDate); + ... + left outer join + tableB B + on + B.id = A.id + and + B.fromDate = (select max(sub.fromDate) + from tableB sub where sub.id = A.id); + + This is because condition "B.fromDate= func(tableB)" cannot be used. + Reason#1: update_ref_and_keys() does not consider such conditions to + be of any use (and indeed they are not usable for ref access) + so they are not put into KEYUSE array. + Reason#2: even if they were put there, we would need to be able to tell + between predicates like + B.fromDate= func(B.id) // guarantees only one matching row as + // B.id is already bound by B.id=A.id + // hence B.fromDate becomes bound too. + and + "B.fromDate= func(B.*)" // Can potentially have many matching + // records. + We need to + - Have update_ref_and_keys() create KEYUSE elements for such equalities + - Have eliminate_tables() and friends make a more accurate check. + The right check is to check whether all parts of a unique key are bound. + If we have keypartX to be bound, then t.keypartY=func(keypartX) makes + keypartY to be bound. + The difficulty here is that correlated subquery predicate cannot tell what + columns it depends on (it only remembers tables). + Traversing the predicate is expensive and complicated. + We're leaning towards making each subquery predicate have a List<Item> with + items that + - are in the current select + - and it depends on. + This list will be useful in certain other subquery optimizations as well, + it is cheap to collect it in fix_fields() phase, so it will be collected + for every subquery predicate. + -=-=(Guest - Thu, 18 Jun 2009, 02:48)=-=- Low Level Design modified. --- /tmp/wklog.17.old.27792 2009-06-18 02:48:45.000000000 +0300 +++ /tmp/wklog.17.new.27792 2009-06-18 02:48:45.000000000 +0300 @@ -89,14 +89,14 @@ - queries that would use elimination - queries that are very similar to one above (so that they would have same QEP, execution cost, etc) but cannot use table elimination. +then compare run times and make a conclusion about whether dbms supports table +elimination. 6. Todo, issues to resolve -------------------------- 6.1 To resolve ~~~~~~~~~~~~~~ -- Re-check how this works with equality propagation. - - Relationship with prepared statements. On one hand, it's natural to desire to make table elimination a once-per-statement operation, like outer->inner join conversion. We'll have @@ -141,8 +141,13 @@ 7. Additional issues -------------------- -* We remove ON clauses within semi-join nests. If these clauses contain +* We remove ON clauses within outer join nests. If these clauses contain subqueries, they probably should be gone from EXPLAIN output also? + Yes. Current approach: when removing an outer join nest, walk the ON clause + and mark subselects as eliminated. Then let EXPLAIN code check if the + SELECT was eliminated before the printing (EXPLAIN is generated by doing + a recursive descent, so the check will also cause children of eliminated + selects not to be printed) * Table elimination is performed after constant table detection (but before the range analysis). Constant tables are technically different from -=-=(Guest - Thu, 18 Jun 2009, 02:24)=-=- Low Level Design modified. --- /tmp/wklog.17.old.27162 2009-06-18 02:24:14.000000000 +0300 +++ /tmp/wklog.17.new.27162 2009-06-18 02:24:14.000000000 +0300 @@ -83,9 +83,12 @@ 5. Tests and benchmarks ----------------------- -Should create a benchmark in sql-bench which checks if the dbms has table +Create a benchmark in sql-bench which checks if the DBMS has table elimination. -TODO elaborate +[According to Monty] Run + - queries that would use elimination + - queries that are very similar to one above (so that they would have same + QEP, execution cost, etc) but cannot use table elimination. 6. Todo, issues to resolve -------------------------- @@ -109,33 +112,37 @@ 6.2 Resolved ~~~~~~~~~~~~ -- outer->inner join conversion is not a problem for table elimination. +* outer->inner join conversion is not a problem for table elimination. We make outer->inner conversions based on predicates in WHERE. If the WHERE referred to an inner table (requirement for OJ->IJ conversion) then table elimination would not be applicable anyway. -7. Additional issues --------------------- -* We remove ON clauses within semi-join nests. If these clauses contain - subqueries, they probably should be gone from EXPLAIN output also? +* For Multi-table UPDATEs/DELETEs, need to also analyze the SET clause: + - affected tables must not be eliminated + - tables that are used on the right side of the SET x=y assignments must + not be eliminated either. -* Aggregate functions report they depend on all tables, that is, +* Aggregate functions used to report that they depend on all tables, that is, item_agg_func->used_tables() == (1ULL << join->tables) - 1 - always. If we want table elimination to work in presence of grouping, need - to devise some other way of analyzing aggregate functions. + always. Fixed it, now aggregate function reports it depends on + tables that its arguments depend on. In particular, COUNT(*) reports + that it depends on no tables (item_count_star->used_tables()==0). + One consequence of that is that "item->used_tables()==0" is not + equivalent to "item->const_item()==true" anymore (not sure if it's + "anymore" or this has been already happening). + +* EXPLAIN EXTENDED warning text was generated after the JOIN object has + been discarded. This didn't allow to use information about join plan + when printing the warning. Fixed this by keeping the JOIN objects until + we've printed the warning (have also an intent to remove the const + tables from the join output). -* Should eliminated tables be shown in EXPLAIN EXTENDED? - - If we just ignore the question, they will be shown - - this is what happens for constant tables, too. - - I don't see how showing them could be of any use. They only make it - harder to read the rewritten query. - It turns out that - - it is easy to have EXPLAIN EXTENDED show permanent (once-per-statement - lifetime) changes. - - it is hard to have it show per-execution data. This is because the warning - text is generated after the execution structures have been destroyed. +7. Additional issues +-------------------- +* We remove ON clauses within semi-join nests. If these clauses contain + subqueries, they probably should be gone from EXPLAIN output also? * Table elimination is performed after constant table detection (but before the range analysis). Constant tables are technically different from @@ -143,8 +150,6 @@ Considering we've already done the join_read_const_table() call, is there any real difference between constant table and eliminated one? If there is, should we mark const tables also as eliminated? + from user/EXPLAIN point of view: no. constant table is the one that we read + one record from. eliminated table is the one that we don't acccess at all. -* For Multi-table UPDATEs/DELETEs, need to also analyze the SET clause: - - affected tables must not be eliminated - - tables that are used on the right side of the SET x=y assignments must - not be eliminated either. -=-=(Guest - Tue, 16 Jun 2009, 17:01)=-=- Dependency deleted: 29 no longer depends on 17 -=-=(Guest - Wed, 10 Jun 2009, 01:23)=-=- Low Level Design modified. --- /tmp/wklog.17.old.1842 2009-06-10 01:23:42.000000000 +0300 +++ /tmp/wklog.17.new.1842 2009-06-10 01:23:42.000000000 +0300 @@ -131,6 +131,11 @@ - this is what happens for constant tables, too. - I don't see how showing them could be of any use. They only make it harder to read the rewritten query. + It turns out that + - it is easy to have EXPLAIN EXTENDED show permanent (once-per-statement + lifetime) changes. + - it is hard to have it show per-execution data. This is because the warning + text is generated after the execution structures have been destroyed. * Table elimination is performed after constant table detection (but before the range analysis). Constant tables are technically different from -=-=(Guest - Wed, 03 Jun 2009, 22:01)=-=- Low Level Design modified. --- /tmp/wklog.17.old.21801 2009-06-03 22:01:34.000000000 +0300 +++ /tmp/wklog.17.new.21801 2009-06-03 22:01:34.000000000 +0300 @@ -1,3 +1,6 @@ +The code (currently in development) is at lp: +~maria-captains/maria/maria-5.1-table-elimination tree. + <contents> 1. Conditions for removal 1.1 Quick check if there are candidates -=-=(Guest - Wed, 03 Jun 2009, 15:04)=-=- Low Level Design modified. --- /tmp/wklog.17.old.20378 2009-06-03 15:04:54.000000000 +0300 +++ /tmp/wklog.17.new.20378 2009-06-03 15:04:54.000000000 +0300 @@ -135,3 +135,8 @@ Considering we've already done the join_read_const_table() call, is there any real difference between constant table and eliminated one? If there is, should we mark const tables also as eliminated? + +* For Multi-table UPDATEs/DELETEs, need to also analyze the SET clause: + - affected tables must not be eliminated + - tables that are used on the right side of the SET x=y assignments must + not be eliminated either. -=-=(Psergey - Wed, 03 Jun 2009, 12:07)=-=- Dependency created: 29 now depends on 17 ------------------------------------------------------------ -=-=(View All Progress Notes, 21 total)=-=- http://askmonty.org/worklog/index.pl?tid=17&nolimit=1 DESCRIPTION: Eliminate not needed tables from SELECT queries.. This will speed up some views and automatically generated queries. Example: CREATE TABLE B (id int primary key); select A.colA from tableA A left outer join tableB B on B.id = A.id; In this case we can remove table B and the join from the query. HIGH-LEVEL SPECIFICATION: Here is an extended explanation of table elimination. Table elimination is a feature found in some modern query optimizers, of which Microsoft SQL Server 2005/2008 seems to have the most advanced implementation. Oracle 11g has also been confirmed to use table elimination but not to the same extent. Basically, what table elimination does, is to remove tables from the execution plan when it is unnecessary to include them. This can, of course, only happen if the right circumstances arise. Let us for example look at the following query: select A.colA from tableA A left outer join tableB B on B.id = A.id; When using A as the left table we ensure that the query will return at least as many rows as there are in that table. For rows where the join condition (B.id = A.id) is not met the selected column (A.colA) will still contain it's original value. The not seen B.* row would contain all NULL:s. However, the result set could actually contain more rows than what is found in tableA if there are duplicates of the column B.id in tableB. If A contains a row [1, "val1"] and B the rows [1, "other1a"],[1, "other1b"] then two rows will match in the join condition. The only way to know what the result will look like is to actually touch both tables during execution. Instead, let's say that tableB contains rows that make it possible to place a unique constraint on the column B.id, for example and often the case a primary key. In this situation we know that we will get exactly as many rows as there are in tableA, since joining with tableB cannot introduce any duplicates. If further, as in the example query, we do not select any columns from tableB, touching that table during execution is unnecessary. We can remove the whole join operation from the execution plan. Both SQL Server 2005/2008 and Oracle 11g will deploy table elimination in the case described above. Let us look at a more advanced query, where Oracle fails. select A.colA from tableA A left outer join tableB B on B.id = A.id and B.fromDate = ( select max(sub.fromDate) from tableB sub where sub.id = A.id ); In this example we have added another join condition, which ensures that we only pick the matching row from tableB having the latest fromDate. In this case tableB will contain duplicates of the column B.id, so in order to ensure uniqueness the primary key has to contain the fromDate column as well. In other words the primary key of tableB is (B.id, B.fromDate). Furthermore, since the subselect ensures that we only pick the latest B.fromDate for a given B.id we know that at most one row will match the join condition. We will again have the situation where joining with tableB cannot affect the number of rows in the result set. Since we do not select any columns from tableB, the whole join operation can be eliminated from the execution plan. SQL Server 2005/2008 will deploy table elimination in this situation as well. We have not found a way to make Oracle 11g use it for this type of query. Queries like these arise in two situations. Either when you have denormalized model consisting of a fact table with several related dimension tables, or when you have a highly normalized model where each attribute is stored in its own table. The example with the subselect is common whenever you store historized/versioned data. LOW-LEVEL DESIGN: The code (currently in development) is at lp: ~maria-captains/maria/maria-5.1-table-elimination tree. <contents> 1. Conditions for removal 1.1 Quick check if there are candidates 2. Removal operation properties 3. Removal operation 4. User interface 5. Tests and benchmarks 6. Todo, issues to resolve 6.1 To resolve 6.2 Resolved 7. Additional issues </contents> It's not really about elimination of tables, it's about elimination of inner sides of outer joins. 1. Conditions for removal ------------------------- We can eliminate an inner side of outer join if: 1. For each record combination of outer tables, it will always produce exactly one record. 2. There are no references to columns of the inner tables anywhere else in the query. #1 means that every table inside the outer join nest is: - is a constant table: = because it can be accessed via eq_ref(const) access, or = it is a zero-rows or one-row MyISAM-like table [MARK1] - has an eq_ref access method candidate. #2 means that WHERE clause, ON clauses of embedding outer joins, ORDER BY, GROUP BY and HAVING do not refer to the inner tables of the outer join nest. 1.1 Quick check if there are candidates ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Before we start to enumerate join nests, here is a quick way to check if there *can be* something to be removed: if ((tables used in select_list | tables used in group/order by UNION | tables used in where) != bitmap_of_all_tables) { attempt table elimination; } 2. Removal operation properties ------------------------------- * There is always one way to remove (no choice to remove either this or that) * It is always better to remove as much tables as possible (at least within our cost model). Thus, no need for any cost calculations/etc. It's an unconditional rewrite. 3. Removal operation -------------------- * Remove the outer join nest's nested join structure (i.e. get the outer join's TABLE_LIST object $OJ and remove it from $OJ->embedding, $OJ->embedding->nested_join. Update table_map's of all ancestor nested joins). [MARK2] * Move the tables and their JOIN_TABs to front like it is done with const tables, with exception that if eliminated outer join nest was within another outer join nest, that shouldn't prevent us from moving away the eliminated tables. * Update join->table_count and all-join-tables bitmap. * That's it. Nothing else? 4. User interface ----------------- * We'll add an @@optimizer switch flag for table elimination. Tentative name: 'table_elimination'. (Note ^^ utility of the above questioned ^, as table elimination can never be worse than no elimination. We're leaning towards not adding the flag) * EXPLAIN will not show the removed tables at all. This will allow to check if tables were removed, and also will behave nicely with anchor model and VIEWs: stuff that user doesn't care about just won't be there. 5. Tests and benchmarks ----------------------- Create a benchmark in sql-bench which checks if the DBMS has table elimination. [According to Monty] Run - queries that would use elimination - queries that are very similar to one above (so that they would have same QEP, execution cost, etc) but cannot use table elimination. then compare run times and make a conclusion about whether dbms supports table elimination. 6. Todo, issues to resolve -------------------------- 6.1 To resolve ~~~~~~~~~~~~~~ - Relationship with prepared statements. On one hand, it's natural to desire to make table elimination a once-per-statement operation, like outer->inner join conversion. We'll have to limit the applicability by removing [MARK1] as that can change during lifetime of the statement. The other option is to do table elimination every time. This will require to rework operation [MARK2] to be undoable. I'm leaning towards doing the former. With anchor modeling, it is unlikely that we'll meet outer joins which have N inner tables of which some are 1-row MyISAM tables that do not have primary key. 6.2 Resolved ~~~~~~~~~~~~ * outer->inner join conversion is not a problem for table elimination. We make outer->inner conversions based on predicates in WHERE. If the WHERE referred to an inner table (requirement for OJ->IJ conversion) then table elimination would not be applicable anyway. * For Multi-table UPDATEs/DELETEs, need to also analyze the SET clause: - affected tables must not be eliminated - tables that are used on the right side of the SET x=y assignments must not be eliminated either. * Aggregate functions used to report that they depend on all tables, that is, item_agg_func->used_tables() == (1ULL << join->tables) - 1 always. Fixed it, now aggregate function reports it depends on tables that its arguments depend on. In particular, COUNT(*) reports that it depends on no tables (item_count_star->used_tables()==0). One consequence of that is that "item->used_tables()==0" is not equivalent to "item->const_item()==true" anymore (not sure if it's "anymore" or this has been already happening). * EXPLAIN EXTENDED warning text was generated after the JOIN object has been discarded. This didn't allow to use information about join plan when printing the warning. Fixed this by keeping the JOIN objects until we've printed the warning (have also an intent to remove the const tables from the join output). 7. Additional issues -------------------- * We remove ON clauses within outer join nests. If these clauses contain subqueries, they probably should be gone from EXPLAIN output also? Yes. Current approach: when removing an outer join nest, walk the ON clause and mark subselects as eliminated. Then let EXPLAIN code check if the SELECT was eliminated before the printing (EXPLAIN is generated by doing a recursive descent, so the check will also cause children of eliminated selects not to be printed) * Table elimination is performed after constant table detection (but before the range analysis). Constant tables are technically different from eliminated ones (e.g. the former are shown in EXPLAIN and the latter aren't). Considering we've already done the join_read_const_table() call, is there any real difference between constant table and eliminated one? If there is, should we mark const tables also as eliminated? from user/EXPLAIN point of view: no. constant table is the one that we read one record from. eliminated table is the one that we don't acccess at all. * What is described above will not be able to eliminate this outer join create unique index idx on tableB (id, fromDate); ... left outer join tableB B on B.id = A.id and B.fromDate = (select max(sub.fromDate) from tableB sub where sub.id = A.id); This is because condition "B.fromDate= func(tableB)" cannot be used. Reason#1: update_ref_and_keys() does not consider such conditions to be of any use (and indeed they are not usable for ref access) so they are not put into KEYUSE array. Reason#2: even if they were put there, we would need to be able to tell between predicates like B.fromDate= func(B.id) // guarantees only one matching row as // B.id is already bound by B.id=A.id // hence B.fromDate becomes bound too. and "B.fromDate= func(B.*)" // Can potentially have many matching // records. We need to - Have update_ref_and_keys() create KEYUSE elements for such equalities - Have eliminate_tables() and friends make a more accurate check. The right check is to check whether all parts of a unique key are bound. If we have keypartX to be bound, then t.keypartY=func(keypartX) makes keypartY to be bound. The difficulty here is that correlated subquery predicate cannot tell what columns it depends on (it only remembers tables). Traversing the predicate is expensive and complicated. We're leaning towards making each subquery predicate have a List<Item> with items that - are in the current select - and it depends on. This list will be useful in certain other subquery optimizations as well, it is cheap to collect it in fix_fields() phase, so it will be collected for every subquery predicate. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)

1 0

Re: [Maria-developers] [Merge] lp:~maria-captains/maria/maria-xtradb into lp:maria
by Kristian Nielsen 10 Jul '09

10 Jul '09

Percona <launchpad(a)percona.com> writes: > Percona has proposed merging lp:~maria-captains/maria/maria-xtradb into lp:maria. > > Requested reviews: > Maria-captains (maria-captains) > > Proposal to merge replacement InnoDB->XtraDB Thanks a lot for your efforts in this! I branched the tree and took a look. There are a couple of issues that I think need to be resolved before we can merge it into MariaDB. I have some questions below, but please don't hesitate to ask me for any kind of help needed to move this forward. > === modified file 'storage/innobase/include/sync0rw.h' > --- storage/innobase/include/sync0rw.h 2008-02-19 16:44:09 +0000 > +++ storage/innobase/include/sync0rw.h 2009-03-31 04:19:17 +0000 > +#ifndef INNODB_RW_LOCKS_USE_ATOMICS > +#error INNODB_RW_LOCKS_USE_ATOMICS is not defined. Do you use enough new GCC or compatibles? > +#error Or do you use exact options for CFLAGS? > +#error e.g. (for x86_32): "-m32 -march=i586 -mtune=i686" > +#error e.g. (for Sparc_64): "-m64 -mcpu=v9" > +#error Otherwise, this build may be slower than normal version. > +#endif > + My attempt to build (BUILD/compile-pentium64-max) failed with this error. There were also other build errors (I assume places where the atomics-using code has not been extended with a part that works without the availability of atomics). The reason is that in the MariaDB tree, HAVE_GCC_ATOMIC_BUILTINS is disabled, which caused XtraDB to disable INNODB_RW_LOCKS_USE_ATOMICS, which triggers the above error. I think I understand why this makes sense for Percona, after all using these better synchronisation primitives is part of the reason for using the Percona server in the first place. Can you tell me if Percona has decided not to maintain XtraDB working without the availability of atomic operations? Or if it is just an oversight? I need to discuss with other MariaDB people whether XtraDB for MariaDB should be maintained working without the atomic operations (if so, we should of course be willing to do the work/effort required). So, any thoughts about the best way to deal with this? Should the above #error be removed and XtraDB extended to work without atomics in the MariaDB tree? And is this something Percona wants to do, or should I look into it? Also, Sergei Golubchik told me that HAVE_GCC_ATOMIC_BUILTINS is for my_atomic_ops, and InnoDB/XtraDB shouldn't really be using it. But I need to look more into the code to understand what the problem is, if any. > === added directory 'storage/innobase/mysql-test' > === added directory 'storage/innobase/mysql-test/patches' When I ran the test suite, I got test failures in test main.innodb. I see that the patch contains patches for main MySQL test cases in mysql-test/t/*.test, and also seems to add separate test cases in the storage/innobase/mysql-test/ directory. Do you know what the status is of these test suite modifications? Do the patches need to be applied to the existing test suite, and/or should the extra test cases be used to add to/overwrite the existing tests? We would need to get the test suite to run without problems before merging. Does Percona run the test suite with no failures? Can you suggest which directions I should work in to solve the test failures? Ie. I'm unsure to what extent the extra test cases/patches have already been applied in main MySQL sources, and whether failures are expected or are just due to not being adapted for current MariaDB source changes. Any help with the above would be great. I plan to continue working with you on this so we can get it merged without unnecessary delays. - Kristian.

3 17

[Maria-developers] Storage Engine API changes
by Stewart Smith 08 Jul '09

08 Jul '09

So, I've been making a fair bit of changes around bits of the storage engine API (by all accounts for the better) in Drizzle. The idea being to move the handler to be a cursor on a table, with actions not pertaining that to reside in StorageEngine (e.g. DDL). There's also the (now rather old) change to drop table return code. The next thing that will move into the StorageEngine is metadata handling with the engine being able to be responsible for its own (table) metadata. This is well and truly increasing the differences between MySQL/MariaDB and Drizzle in this area of code - increasing the work needed to port an engine (either way). I would guess it makes little sense for MySQL and MariaDB to diverge here, although I have been (and continue to be) okay with Drizzle diverging. So, is there somebody interested in working with me to have the MySQL/MariaDB API evolve in the same way? -- Stewart Smith

7 14

[Maria-developers] Rev 2819: BUG#31480: Incorrect result for nested subquery when executed via semi join in file:///home/psergey/dev/mysql-next-fix-subq/
by Sergey Petrunya 08 Jul '09

08 Jul '09

At file:///home/psergey/dev/mysql-next-fix-subq/ ------------------------------------------------------------ revno: 2819 revision-id: psergey(a)askmonty.org-20090708174703-dz9uf5b0m6pcvtl6 parent: psergey(a)askmonty.org-20090708095341-9i08n2r8igulpxzz committer: Sergey Petrunya <psergey(a)askmonty.org> branch nick: mysql-next-fix-subq timestamp: Wed 2009-07-08 21:47:03 +0400 message: BUG#31480: Incorrect result for nested subquery when executed via semi join Make the fix work with prepared statements: - in previous cset changed calloc to alloc, forgot to add bzero. === modified file 'sql/item_subselect.cc' --- a/sql/item_subselect.cc 2009-07-08 09:53:41 +0000 +++ b/sql/item_subselect.cc 2009-07-08 17:47:03 +0000 @@ -180,10 +180,11 @@ if (!ancestor_used_tables) { set_depth(); - if (!(ancestor_used_tables= - (table_map*)alloc_root(thd->stmt_arena->mem_root, - (1+depth)*sizeof(table_map)))) + size_t size= (1+depth) * sizeof(table_map); + if (!(ancestor_used_tables= (table_map*) + alloc_root(thd->stmt_arena->mem_root, size))) return TRUE; + bzero(ancestor_used_tables, size); furthest_correlated_ancestor= 0; inside_first_fix_fields= TRUE; } @@ -258,7 +259,7 @@ is_correlated= TRUE; furthest_correlated_ancestor= max(furthest_correlated_ancestor, n_levels); if (n_levels > 1) - ancestor_used_tables[n_levels - 2]= dep_map; + ancestor_used_tables[n_levels - 2] |= dep_map; } }

1 0

[Maria-developers] Rev 2727: MWL#17: Table elimination in file:///home/psergey/dev/maria-5.1-table-elim/
by Sergey Petrunya 08 Jul '09

08 Jul '09

At file:///home/psergey/dev/maria-5.1-table-elim/ ------------------------------------------------------------ revno: 2727 revision-id: psergey(a)askmonty.org-20090708171038-9nyc3hcg1o7h8635 parent: psergey(a)askmonty.org-20090630132018-8qwou8bqiq5z1qjg committer: Sergey Petrunya <psergey(a)askmonty.org> branch nick: maria-5.1-table-elim timestamp: Wed 2009-07-08 21:10:38 +0400 message: MWL#17: Table elimination - When collecting Item_subselect::refers_to, put references to the correct subselect entry. === modified file 'sql/sql_lex.cc' --- a/sql/sql_lex.cc 2009-06-22 11:46:31 +0000 +++ b/sql/sql_lex.cc 2009-07-08 17:10:38 +0000 @@ -1780,6 +1780,7 @@ void st_select_lex::mark_as_dependent(st_select_lex *last, Item *dependency) { + SELECT_LEX *next_to_last; /* Mark all selects from resolved to 1 before select where was found table as depended (of select where was found table) @@ -1787,6 +1788,7 @@ for (SELECT_LEX *s= this; s && s != last; s= s->outer_select()) + { if (!(s->uncacheable & UNCACHEABLE_DEPENDENT)) { // Select is dependent of outer select @@ -1802,10 +1804,12 @@ sl->uncacheable|= UNCACHEABLE_UNITED; } } + next_to_last= s; + } is_correlated= TRUE; this->master_unit()->item->is_correlated= TRUE; if (dependency) - this->master_unit()->item->refers_to.push_back(dependency); + next_to_last->master_unit()->item->refers_to.push_back(dependency); } bool st_select_lex_node::set_braces(bool value) { return 1; }

1 0

[Maria-developers] Rev 2818: BUG#31480: Incorrect result for nested subquery when executed via semi join in file:///home/psergey/dev/mysql-next-fix-subq/
by Sergey Petrunya 08 Jul '09

08 Jul '09

At file:///home/psergey/dev/mysql-next-fix-subq/ ------------------------------------------------------------ revno: 2818 revision-id: psergey(a)askmonty.org-20090708095341-9i08n2r8igulpxzz parent: psergey(a)askmonty.org-20090706143329-72s3e73rov2f5tml committer: Sergey Petrunya <psergey(a)askmonty.org> branch nick: mysql-next-fix-subq timestamp: Wed 2009-07-08 13:53:41 +0400 message: BUG#31480: Incorrect result for nested subquery when executed via semi join Make the fix work with prepared statements: - collect/save ancestor_used_tables and furthest_correlated_ancestor only at PREPARE phase (at execute() we are unable to tell what table_map the outer reference would tell. since it would be the same anyway, we save it at PREPARE phase) === modified file 'sql/item_subselect.cc' --- a/sql/item_subselect.cc 2009-07-06 14:26:03 +0000 +++ b/sql/item_subselect.cc 2009-07-08 09:53:41 +0000 @@ -39,8 +39,8 @@ Item_subselect::Item_subselect(): Item_result_field(), value_assigned(0), thd(0), substitution(0), engine(0), old_engine(0), used_tables_cache(0), have_to_be_excluded(0), - const_item_cache(1), inside_fix_fields(0), engine_changed(0), changed(0), - is_correlated(FALSE) + const_item_cache(1), inside_first_fix_fields(0), ancestor_used_tables(0), + engine_changed(0), changed(0), is_correlated(FALSE) { with_subselect= 1; reset(); @@ -158,6 +158,7 @@ DBUG_RETURN(RES_OK); } + void Item_subselect::set_depth() { uint n= 0; @@ -166,6 +167,7 @@ this->depth= n - 1; } + bool Item_subselect::fix_fields(THD *thd_param, Item **ref) { char const *save_where= thd_param->where; @@ -175,23 +177,25 @@ DBUG_ASSERT(fixed == 0); engine->set_thd((thd= thd_param)); - if (!inside_fix_fields) + if (!ancestor_used_tables) { set_depth(); - if (!(ancestor_used_tables= (table_map*)thd->calloc((1+depth) * - sizeof(table_map)))) + if (!(ancestor_used_tables= + (table_map*)alloc_root(thd->stmt_arena->mem_root, + (1+depth)*sizeof(table_map)))) return TRUE; furthest_correlated_ancestor= 0; + inside_first_fix_fields= TRUE; } if (check_stack_overrun(thd, STACK_MIN_SIZE, (uchar*)&res)) return TRUE; - inside_fix_fields++; res= engine->prepare(); - + // all transformation is done (used by prepared statements) changed= 1; + inside_first_fix_fields= FALSE; if (!res) { @@ -220,14 +224,12 @@ if (!(*ref)->fixed) ret= (*ref)->fix_fields(thd, ref); thd->where= save_where; - inside_fix_fields--; return ret; } // Is it one field subselect? if (engine->cols() > max_columns) { my_error(ER_OPERAND_COLUMNS, MYF(0), 1); - inside_fix_fields--; return TRUE; } fix_length_and_dec(); @@ -244,12 +246,23 @@ fixed= 1; err: - inside_fix_fields--; thd->where= save_where; return res; } +void Item_subselect::mark_as_dependent(uint n_levels, table_map dep_map) +{ + if (inside_first_fix_fields) + { + is_correlated= TRUE; + furthest_correlated_ancestor= max(furthest_correlated_ancestor, n_levels); + if (n_levels > 1) + ancestor_used_tables[n_levels - 2]= dep_map; + } +} + + /* Adjust attributes after our parent select has been merged into grandparent === modified file 'sql/item_subselect.h' --- a/sql/item_subselect.h 2009-07-06 07:57:39 +0000 +++ b/sql/item_subselect.h 2009-07-08 09:53:41 +0000 @@ -68,7 +68,7 @@ /* cache of constant state */ bool const_item_cache; - int inside_fix_fields; + int inside_first_fix_fields; public: /* Depth of the subquery predicate. @@ -140,6 +140,7 @@ return null_value; } bool fix_fields(THD *thd, Item **ref); + void mark_as_dependent(uint n_levels, table_map dep_map); void fix_after_pullout(st_select_lex *new_parent, uint parent_tables, Item **ref); virtual bool exec(); === modified file 'sql/sql_lex.cc' --- a/sql/sql_lex.cc 2009-07-06 07:57:39 +0000 +++ b/sql/sql_lex.cc 2009-07-08 09:53:41 +0000 @@ -1929,13 +1929,7 @@ } Item_subselect *subquery_predicate= s->master_unit()->item; if (subquery_predicate) - { - subquery_predicate->is_correlated= TRUE; - subquery_predicate->furthest_correlated_ancestor= - max(subquery_predicate->furthest_correlated_ancestor, n_levels); - if (n_levels > 1) - subquery_predicate->ancestor_used_tables[n_levels - 2]= dep_map; - } + subquery_predicate->mark_as_dependent(n_levels, dep_map); n_levels--; } }

1 0

[Maria-developers] [Merge] lp:~qu1j0t3/maria/solaris10-port into lp:maria
by Toby Thain 07 Jul '09

07 Jul '09

Toby Thain has proposed merging lp:~qu1j0t3/maria/solaris10-port into lp:maria. Requested reviews: Kristian Nielsen (knielsen) Added build scripts for 32 bit x86 architecture on Solaris. Renamed some scripts for consistency. Changed to dynamic linking of libgcc. -- https://code.launchpad.net/~qu1j0t3/maria/solaris10-port/+merge/6999 Your team Maria developers is subscribed to branch lp:maria.

2 4

[Maria-developers] [Branch ~maria-captains/maria/5.1] Rev 2716: Solaris 10 build script fixes by Toby Thain.
by noreply＠launchpad.net 07 Jul '09

07 Jul '09

------------------------------------------------------------ revno: 2716 committer: knielsen(a)knielsen-hq.org branch nick: mariadb-solaris10-port-merge timestamp: Tue 2009-07-07 13:19:24 +0200 message: Solaris 10 build script fixes by Toby Thain. Added build scripts for 32 bit x86 architecture on Solaris. Renamed some scripts for consistency. Changed to dynamic linking of libgcc. removed: BUILD/compile-solaris-amd64-forte-debug added: BUILD/compile-solaris-amd64-debug-forte BUILD/compile-solaris-x86-32 BUILD/compile-solaris-x86-32-debug BUILD/compile-solaris-x86-32-debug-forte BUILD/compile-solaris-x86-forte-32 modified: BUILD/compile-solaris-amd64 BUILD/compile-solaris-amd64-debug === modified file 'BUILD/compile-solaris-amd64' --- BUILD/compile-solaris-amd64 2009-05-09 04:01:53 +0000 +++ BUILD/compile-solaris-amd64 2009-07-07 11:19:24 +0000 @@ -26,7 +26,7 @@ extra_flags="$amd64_cflags -D__sun -m64 -mtune=athlon64" extra_configs="$amd64_configs $max_configs --with-libevent" -LDFLAGS="-lmtmalloc -static-libgcc" +LDFLAGS="-lmtmalloc -R/usr/sfw/lib/64" export LDFLAGS . "$path/FINISH.sh" === modified file 'BUILD/compile-solaris-amd64-debug' --- BUILD/compile-solaris-amd64-debug 2009-05-09 04:01:53 +0000 +++ BUILD/compile-solaris-amd64-debug 2009-07-07 11:19:24 +0000 @@ -5,7 +5,7 @@ extra_flags="$amd64_cflags -D__sun -m64 -mtune=athlon64 $debug_cflags" extra_configs="$amd64_configs $debug_configs $max_configs --with-libevent" -LDFLAGS="-lmtmalloc -static-libgcc" +LDFLAGS="-lmtmalloc -R/usr/sfw/lib/64" export LDFLAGS . "$path/FINISH.sh" === added file 'BUILD/compile-solaris-amd64-debug-forte' --- BUILD/compile-solaris-amd64-debug-forte 1970-01-01 00:00:00 +0000 +++ BUILD/compile-solaris-amd64-debug-forte 2009-07-07 11:19:24 +0000 @@ -0,0 +1,27 @@ +#!/bin/sh + +path=`dirname $0` +. "$path/SETUP.sh" + +# Take only #define options - the others are gcc specific. +# (real fix is for SETUP.sh not to put gcc specific options in $debug_cflags) +DEFS="" +for F in $debug_cflags ; do + expr "$F" : "^-D" && DEFS="$DEFS $F" +done +debug_cflags="-O0 -g $DEFS" + +extra_flags="-m64 -mt -D_FORTEC_ -xlibmopt -fns=no $debug_cflags" +extra_configs="$max_configs --with-libevent $debug_configs" + +warnings="" +c_warnings="" +cxx_warnings="" +base_cxxflags="-noex" + +CC=cc +CFLAGS="-xstrconst" +CXX=CC +LDFLAGS="-lmtmalloc" + +. "$path/FINISH.sh" === removed file 'BUILD/compile-solaris-amd64-forte-debug' --- BUILD/compile-solaris-amd64-forte-debug 2009-05-09 04:01:53 +0000 +++ BUILD/compile-solaris-amd64-forte-debug 1970-01-01 00:00:00 +0000 @@ -1,27 +0,0 @@ -#!/bin/sh - -path=`dirname $0` -. "$path/SETUP.sh" - -# Take only #define options - the others are gcc specific. -# (real fix is for SETUP.sh not to put gcc specific options in $debug_cflags) -DEFS="" -for F in $debug_cflags ; do - expr "$F" : "^-D" && DEFS="$DEFS $F" -done -debug_cflags="-O0 -g $DEFS" - -extra_flags="-m64 -mt -D_FORTEC_ -xlibmopt -fns=no $debug_cflags" -extra_configs="$max_configs --with-libevent $debug_configs" - -warnings="" -c_warnings="" -cxx_warnings="" -base_cxxflags="-noex" - -CC=cc -CFLAGS="-xstrconst" -CXX=CC -LDFLAGS="-lmtmalloc" - -. "$path/FINISH.sh" === added file 'BUILD/compile-solaris-x86-32' --- BUILD/compile-solaris-x86-32 1970-01-01 00:00:00 +0000 +++ BUILD/compile-solaris-x86-32 2009-07-07 11:19:24 +0000 @@ -0,0 +1,11 @@ +#!/bin/sh + +path=`dirname $0` +. "$path/SETUP.sh" +extra_flags="-D__sun -m32" +extra_configs="$max_configs --with-libevent" + +LDFLAGS="-lmtmalloc -R/usr/sfw/lib" +export LDFLAGS + +. "$path/FINISH.sh" === added file 'BUILD/compile-solaris-x86-32-debug' --- BUILD/compile-solaris-x86-32-debug 1970-01-01 00:00:00 +0000 +++ BUILD/compile-solaris-x86-32-debug 2009-07-07 11:19:24 +0000 @@ -0,0 +1,11 @@ +#!/bin/sh + +path=`dirname $0` +. "$path/SETUP.sh" +extra_flags="-D__sun -m32 $debug_cflags" +extra_configs="$max_configs --with-libevent $debug_configs" + +LDFLAGS="-lmtmalloc -R/usr/sfw/lib" +export LDFLAGS + +. "$path/FINISH.sh" === added file 'BUILD/compile-solaris-x86-32-debug-forte' --- BUILD/compile-solaris-x86-32-debug-forte 1970-01-01 00:00:00 +0000 +++ BUILD/compile-solaris-x86-32-debug-forte 2009-07-07 11:19:24 +0000 @@ -0,0 +1,27 @@ +#!/bin/sh + +path=`dirname $0` +. "$path/SETUP.sh" + +# Take only #define options - the others are gcc specific. +# (real fix is for SETUP.sh not to put gcc specific options in $debug_cflags) +DEFS="" +for F in $debug_cflags ; do + expr "$F" : "^-D" && DEFS="$DEFS $F" +done +debug_cflags="-O0 -g $DEFS" + +extra_flags="-m32 -mt -D_FORTEC_ -xbuiltin=%all -xlibmil -xlibmopt -fns=no -xprefetch=auto -xprefetch_level=3 $debug_cflags" +extra_configs="$max_configs --with-libevent $debug_configs" + +warnings="" +c_warnings="" +cxx_warnings="" +base_cxxflags="-noex" + +CC=cc +CFLAGS="-xstrconst" +CXX=CC +LDFLAGS="-lmtmalloc" + +. "$path/FINISH.sh" === added file 'BUILD/compile-solaris-x86-forte-32' --- BUILD/compile-solaris-x86-forte-32 1970-01-01 00:00:00 +0000 +++ BUILD/compile-solaris-x86-forte-32 2009-07-07 11:19:24 +0000 @@ -0,0 +1,19 @@ +#!/bin/sh + +path=`dirname $0` +. "$path/SETUP.sh" + +extra_flags="-m32 -mt -D_FORTEC_ -xbuiltin=%all -xlibmil -xlibmopt -fns=no -xprefetch=auto -xprefetch_level=3" +extra_configs="$max_configs --with-libevent" + +warnings="" +c_warnings="" +cxx_warnings="" +base_cxxflags="-noex" + +CC=cc +CFLAGS="-xstrconst" +CXX=CC +LDFLAGS="-lmtmalloc" + +. "$path/FINISH.sh" -- lp:maria https://code.launchpad.net/~maria-captains/maria/5.1 Your team Maria developers is subscribed to branch lp:maria. To unsubscribe from this branch go to https://code.launchpad.net/~maria-captains/maria/5.1/+edit-subscription.

1 0