developers
Threads by month
- ----- 2024 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2023 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2022 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2021 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2020 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2019 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2018 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2017 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2016 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2015 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2014 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2013 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2012 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2011 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2010 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2009 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- 8 participants
- 6811 discussions
[Maria-developers] Rev 2836: Check of maria engine presence added. in file:///home/bell/maria/bzr/work-maria-5.2-lb607147/
by sanja@askmonty.org 04 Aug '10
by sanja@askmonty.org 04 Aug '10
04 Aug '10
At file:///home/bell/maria/bzr/work-maria-5.2-lb607147/
------------------------------------------------------------
revno: 2836
revision-id: sanja(a)askmonty.org-20100804094351-8yyx0m06vi4pr9fj
parent: sanja(a)askmonty.org-20100803094925-fpuj52qvdkkw5994
committer: sanja(a)askmonty.org
branch nick: work-maria-5.2-lb607147
timestamp: Wed 2010-08-04 12:43:51 +0300
message:
Check of maria engine presence added.
Comment fixed.
=== modified file 'mysql-test/suite/vcol/t/vcol_handler_maria.test'
--- a/mysql-test/suite/vcol/t/vcol_handler_maria.test 2010-08-03 09:49:25 +0000
+++ b/mysql-test/suite/vcol/t/vcol_handler_maria.test 2010-08-04 09:43:51 +0000
@@ -14,8 +14,10 @@
# Change: #
################################################################################
+--source include/have_maria.inc
+
#
-# NOTE: PLEASE DO NOT ADD NOT MYISAM SPECIFIC TESTCASES HERE !
+# NOTE: PLEASE DO NOT ADD NOT MARIA SPECIFIC TESTCASES HERE !
# TESTCASES WHICH MUST BE APPLIED TO ALL STORAGE ENGINES MUST BE ADDED IN
# THE SOURCED FILES ONLY.
#
1
0
[Maria-developers] Rev 2835: Fix for launchpad bug #612894 in file:///home/bell/maria/bzr/work-maria-5.2-lb607147/
by sanja@askmonty.org 03 Aug '10
by sanja@askmonty.org 03 Aug '10
03 Aug '10
At file:///home/bell/maria/bzr/work-maria-5.2-lb607147/
------------------------------------------------------------
revno: 2835
revision-id: sanja(a)askmonty.org-20100803094925-fpuj52qvdkkw5994
parent: igor(a)askmonty.org-20100728190938-esx94q58hw3v5jue
committer: sanja(a)askmonty.org
branch nick: work-maria-5.2-lb607147
timestamp: Tue 2010-08-03 12:49:25 +0300
message:
Fix for launchpad bug #612894
Support of virtual columns added to maria engine.
=== added file 'mysql-test/suite/vcol/r/vcol_handler_maria.result'
--- a/mysql-test/suite/vcol/r/vcol_handler_maria.result 1970-01-01 00:00:00 +0000
+++ b/mysql-test/suite/vcol/r/vcol_handler_maria.result 2010-08-03 09:49:25 +0000
@@ -0,0 +1,76 @@
+SET @@session.storage_engine = 'maria';
+create table t1 (a int,
+b int as (-a),
+c int as (-a) persistent,
+d char(1),
+index (a),
+index (c));
+insert into t1 (a,d) values (4,'a'), (2,'b'), (1,'c'), (3,'d');
+select * from t1;
+a b c d
+4 -4 -4 a
+2 -2 -2 b
+1 -1 -1 c
+3 -3 -3 d
+# HANDLER tbl_name OPEN
+handler t1 open;
+# HANDLER tbl_name READ non-vcol_index_name > (value1,value2,...)
+handler t1 read a > (2);
+a b c d
+3 -3 -3 d
+# HANDLER tbl_name READ non-vcol_index_name > (value1,value2,...) WHERE non-vcol_field=expr
+handler t1 read a > (2) where d='c';
+a b c d
+# HANDLER tbl_name READ vcol_index_name = (value1,value2,...)
+handler t1 read c = (-2);
+a b c d
+2 -2 -2 b
+# HANDLER tbl_name READ vcol_index_name = (value1,value2,...) WHERE non-vcol_field=expr
+handler t1 read c = (-2) where d='c';
+a b c d
+# HANDLER tbl_name READ non-vcol_index_name > (value1,value2,...) WHERE vcol_field=expr
+handler t1 read a > (2) where b=-3 && c=-3;
+a b c d
+3 -3 -3 d
+# HANDLER tbl_name READ vcol_index_name <= (value1,value2,...)
+handler t1 read c <= (-2);
+a b c d
+2 -2 -2 b
+# HANDLER tbl_name READ vcol_index_name > (value1,value2,...) WHERE vcol_field=expr
+handler t1 read c <= (-2) where b=-3;
+a b c d
+3 -3 -3 d
+# HANDLER tbl_name READ vcol_index_name FIRST
+handler t1 read c first;
+a b c d
+4 -4 -4 a
+# HANDLER tbl_name READ vcol_index_name NEXT
+handler t1 read c next;
+a b c d
+3 -3 -3 d
+# HANDLER tbl_name READ vcol_index_name PREV
+handler t1 read c prev;
+a b c d
+4 -4 -4 a
+# HANDLER tbl_name READ vcol_index_name LAST
+handler t1 read c last;
+a b c d
+1 -1 -1 c
+# HANDLER tbl_name READ FIRST where non-vcol=expr
+handler t1 read FIRST where a >= 2;
+a b c d
+4 -4 -4 a
+# HANDLER tbl_name READ FIRST where vcol=expr
+handler t1 read FIRST where b >= -2;
+a b c d
+2 -2 -2 b
+# HANDLER tbl_name READ NEXT where non-vcol=expr
+handler t1 read NEXT where d='c';
+a b c d
+1 -1 -1 c
+# HANDLER tbl_name READ NEXT where vcol=expr
+handler t1 read NEXT where b<=-4;
+a b c d
+# HANDLER tbl_name CLOSE
+handler t1 close;
+drop table t1;
=== added file 'mysql-test/suite/vcol/t/vcol_handler_maria.test'
--- a/mysql-test/suite/vcol/t/vcol_handler_maria.test 1970-01-01 00:00:00 +0000
+++ b/mysql-test/suite/vcol/t/vcol_handler_maria.test 2010-08-03 09:49:25 +0000
@@ -0,0 +1,50 @@
+################################################################################
+# t/vcol_handler_maria.test #
+# #
+# Purpose: #
+# Testing HANDLER.
+# #
+# Maria branch #
+# #
+#------------------------------------------------------------------------------#
+# Original Author: Andrey Zhakov #
+# Original Date: 2008-09-04 #
+# Change Author: #
+# Change Date: #
+# Change: #
+################################################################################
+
+#
+# NOTE: PLEASE DO NOT ADD NOT MYISAM SPECIFIC TESTCASES HERE !
+# TESTCASES WHICH MUST BE APPLIED TO ALL STORAGE ENGINES MUST BE ADDED IN
+# THE SOURCED FILES ONLY.
+#
+
+#------------------------------------------------------------------------------#
+# General not engine specific settings and requirements
+--source suite/vcol/inc/vcol_init_vars.pre
+
+#------------------------------------------------------------------------------#
+# Cleanup
+--source suite/vcol/inc/vcol_cleanup.inc
+
+#------------------------------------------------------------------------------#
+# Engine specific settings and requirements
+
+##### Storage engine to be tested
+# Set the session storage engine
+eval SET @@session.storage_engine = 'maria';
+
+##### Workarounds for known open engine specific bugs
+# none
+
+#------------------------------------------------------------------------------#
+# Execute the tests to be applied to all storage engines
+--source suite/vcol/inc/vcol_handler.inc
+
+#------------------------------------------------------------------------------#
+# Execute storage engine specific tests
+
+#------------------------------------------------------------------------------#
+# Cleanup
+--source suite/vcol/inc/vcol_cleanup.inc
=== modified file 'storage/maria/ha_maria.cc'
--- a/storage/maria/ha_maria.cc 2010-07-25 15:09:21 +0000
+++ b/storage/maria/ha_maria.cc 2010-08-03 09:49:25 +0000
@@ -468,7 +468,7 @@
recinfo_pos= recinfo;
create_info->null_bytes= table_arg->s->null_bytes;
- while (recpos < (uint) share->reclength)
+ while (recpos < (uint) share->stored_rec_length)
{
Field **field, *found= 0;
minpos= share->reclength;
=== modified file 'storage/maria/ha_maria.h'
--- a/storage/maria/ha_maria.h 2010-07-23 20:37:21 +0000
+++ b/storage/maria/ha_maria.h 2010-08-03 09:49:25 +0000
@@ -148,6 +148,7 @@
int assign_to_keycache(THD * thd, HA_CHECK_OPT * check_opt);
int preload_keys(THD * thd, HA_CHECK_OPT * check_opt);
bool check_if_incompatible_data(HA_CREATE_INFO * info, uint table_changes);
+ bool check_if_supported_virtual_columns(void) { return TRUE;}
#ifdef HAVE_REPLICATION
int dump(THD * thd, int fd);
int net_read_dump(NET * net);
1
0
Hi all,
I hand-added Antony's fix for 571200 that I had started to
https://code.launchpad.net/~capttofu/maria/bug_571200/+merge/31606 for
review. I'll be adding his other fixes for each bug as well.
Thanks Antony!
Patrick
1
0
[Maria-developers] Rev 2809: Fix for luanchpad bug#611625: Removing NULL references from subquery parameter list added. in file:///home/bell/maria/bzr/work-maria-5.3-lb611625/
by sanja@askmonty.org 02 Aug '10
by sanja@askmonty.org 02 Aug '10
02 Aug '10
At file:///home/bell/maria/bzr/work-maria-5.3-lb611625/
------------------------------------------------------------
revno: 2809
revision-id: sanja(a)askmonty.org-20100802055612-se9olthiaazi5xju
parent: sanja(a)askmonty.org-20100730041658-2naumadh26t93e3g
committer: sanja(a)askmonty.org
branch nick: work-maria-5.3-lb611625
timestamp: Mon 2010-08-02 08:56:12 +0300
message:
Fix for luanchpad bug#611625: Removing NULL references from subquery parameter list added.
Incorrect limitation on number of parameters removed.
=== modified file 'mysql-test/r/subquery_cache.result'
--- a/mysql-test/r/subquery_cache.result 2010-07-30 04:16:58 +0000
+++ b/mysql-test/r/subquery_cache.result 2010-08-02 05:56:12 +0000
@@ -2985,3 +2985,201 @@
1 NULL f
drop table t1,t2,t3,t4;
set @@optimizer_switch= default;
+#launchpad BUG#611625
+CREATE TABLE `t1` (
+`pk` int(11) NOT NULL AUTO_INCREMENT,
+`col_int_nokey` int(11) DEFAULT NULL,
+`col_varchar_nokey` varchar(1) DEFAULT NULL,
+PRIMARY KEY (`pk`)
+) ENGINE=MyISAM AUTO_INCREMENT=21 DEFAULT CHARSET=latin1;
+INSERT INTO `t1` VALUES (1,NULL,'w');
+INSERT INTO `t1` VALUES (2,7,'m');
+INSERT INTO `t1` VALUES (3,9,'m');
+INSERT INTO `t1` VALUES (4,7,'k');
+INSERT INTO `t1` VALUES (5,4,'r');
+INSERT INTO `t1` VALUES (6,2,'t');
+INSERT INTO `t1` VALUES (7,6,'j');
+INSERT INTO `t1` VALUES (8,8,'u');
+INSERT INTO `t1` VALUES (9,NULL,'h');
+INSERT INTO `t1` VALUES (10,5,'o');
+INSERT INTO `t1` VALUES (11,NULL,NULL);
+INSERT INTO `t1` VALUES (12,6,'k');
+INSERT INTO `t1` VALUES (13,188,'e');
+INSERT INTO `t1` VALUES (14,2,'n');
+INSERT INTO `t1` VALUES (15,1,'t');
+INSERT INTO `t1` VALUES (16,1,'c');
+INSERT INTO `t1` VALUES (17,0,'m');
+INSERT INTO `t1` VALUES (18,9,'y');
+INSERT INTO `t1` VALUES (19,NULL,'f');
+INSERT INTO `t1` VALUES (20,4,'d');
+CREATE TABLE `t3` (
+`pk` int(11) NOT NULL AUTO_INCREMENT,
+`col_int_nokey` int(11) DEFAULT NULL,
+`col_varchar_nokey` varchar(1) DEFAULT NULL,
+PRIMARY KEY (`pk`)
+) ENGINE=MyISAM AUTO_INCREMENT=101 DEFAULT CHARSET=latin1;
+INSERT INTO `t3` VALUES (1,6,'r');
+INSERT INTO `t3` VALUES (2,8,'c');
+INSERT INTO `t3` VALUES (3,6,'o');
+INSERT INTO `t3` VALUES (4,6,'c');
+INSERT INTO `t3` VALUES (5,3,'d');
+INSERT INTO `t3` VALUES (6,9,'v');
+INSERT INTO `t3` VALUES (7,2,'m');
+INSERT INTO `t3` VALUES (8,1,'j');
+INSERT INTO `t3` VALUES (9,8,'f');
+INSERT INTO `t3` VALUES (10,0,'n');
+INSERT INTO `t3` VALUES (11,9,'z');
+INSERT INTO `t3` VALUES (12,8,'h');
+INSERT INTO `t3` VALUES (13,NULL,'q');
+INSERT INTO `t3` VALUES (14,0,'w');
+INSERT INTO `t3` VALUES (15,5,'z');
+INSERT INTO `t3` VALUES (16,1,'j');
+INSERT INTO `t3` VALUES (17,1,'a');
+INSERT INTO `t3` VALUES (18,6,'m');
+INSERT INTO `t3` VALUES (19,6,'n');
+INSERT INTO `t3` VALUES (20,1,'e');
+INSERT INTO `t3` VALUES (21,8,'u');
+INSERT INTO `t3` VALUES (22,1,'s');
+INSERT INTO `t3` VALUES (23,0,'u');
+INSERT INTO `t3` VALUES (24,4,'r');
+INSERT INTO `t3` VALUES (25,9,'g');
+INSERT INTO `t3` VALUES (26,8,'o');
+INSERT INTO `t3` VALUES (27,5,'w');
+INSERT INTO `t3` VALUES (28,9,'b');
+INSERT INTO `t3` VALUES (29,5,NULL);
+INSERT INTO `t3` VALUES (30,NULL,'y');
+INSERT INTO `t3` VALUES (31,NULL,'y');
+INSERT INTO `t3` VALUES (32,105,'u');
+INSERT INTO `t3` VALUES (33,0,'p');
+INSERT INTO `t3` VALUES (34,3,'s');
+INSERT INTO `t3` VALUES (35,1,'e');
+INSERT INTO `t3` VALUES (36,75,'d');
+INSERT INTO `t3` VALUES (37,9,'d');
+INSERT INTO `t3` VALUES (38,7,'c');
+INSERT INTO `t3` VALUES (39,NULL,'b');
+INSERT INTO `t3` VALUES (40,NULL,'t');
+INSERT INTO `t3` VALUES (41,4,NULL);
+INSERT INTO `t3` VALUES (42,0,'y');
+INSERT INTO `t3` VALUES (43,204,'c');
+INSERT INTO `t3` VALUES (44,0,'d');
+INSERT INTO `t3` VALUES (45,9,'x');
+INSERT INTO `t3` VALUES (46,8,'p');
+INSERT INTO `t3` VALUES (47,7,'e');
+INSERT INTO `t3` VALUES (48,8,'g');
+INSERT INTO `t3` VALUES (49,NULL,'x');
+INSERT INTO `t3` VALUES (50,6,'s');
+INSERT INTO `t3` VALUES (51,5,'e');
+INSERT INTO `t3` VALUES (52,2,'l');
+INSERT INTO `t3` VALUES (53,3,'p');
+INSERT INTO `t3` VALUES (54,7,'h');
+INSERT INTO `t3` VALUES (55,NULL,'m');
+INSERT INTO `t3` VALUES (56,145,'n');
+INSERT INTO `t3` VALUES (57,0,'v');
+INSERT INTO `t3` VALUES (58,1,'b');
+INSERT INTO `t3` VALUES (59,7,'x');
+INSERT INTO `t3` VALUES (60,3,'r');
+INSERT INTO `t3` VALUES (61,NULL,'t');
+INSERT INTO `t3` VALUES (62,2,'w');
+INSERT INTO `t3` VALUES (63,2,'w');
+INSERT INTO `t3` VALUES (64,2,'k');
+INSERT INTO `t3` VALUES (65,8,'a');
+INSERT INTO `t3` VALUES (66,6,'t');
+INSERT INTO `t3` VALUES (67,1,'z');
+INSERT INTO `t3` VALUES (68,NULL,'e');
+INSERT INTO `t3` VALUES (69,1,'q');
+INSERT INTO `t3` VALUES (70,0,'e');
+INSERT INTO `t3` VALUES (71,4,'v');
+INSERT INTO `t3` VALUES (72,1,'d');
+INSERT INTO `t3` VALUES (73,1,'u');
+INSERT INTO `t3` VALUES (74,27,'o');
+INSERT INTO `t3` VALUES (75,4,'b');
+INSERT INTO `t3` VALUES (76,6,'c');
+INSERT INTO `t3` VALUES (77,2,'q');
+INSERT INTO `t3` VALUES (78,248,NULL);
+INSERT INTO `t3` VALUES (79,NULL,'h');
+INSERT INTO `t3` VALUES (80,9,'d');
+INSERT INTO `t3` VALUES (81,75,'w');
+INSERT INTO `t3` VALUES (82,2,'m');
+INSERT INTO `t3` VALUES (83,9,'i');
+INSERT INTO `t3` VALUES (84,4,'w');
+INSERT INTO `t3` VALUES (85,0,'f');
+INSERT INTO `t3` VALUES (86,0,'k');
+INSERT INTO `t3` VALUES (87,1,'v');
+INSERT INTO `t3` VALUES (88,119,'c');
+INSERT INTO `t3` VALUES (89,1,'y');
+INSERT INTO `t3` VALUES (90,7,'h');
+INSERT INTO `t3` VALUES (91,2,NULL);
+INSERT INTO `t3` VALUES (92,7,'t');
+INSERT INTO `t3` VALUES (93,2,'l');
+INSERT INTO `t3` VALUES (94,6,'a');
+INSERT INTO `t3` VALUES (95,4,'r');
+INSERT INTO `t3` VALUES (96,5,'s');
+INSERT INTO `t3` VALUES (97,7,'z');
+INSERT INTO `t3` VALUES (98,1,'j');
+INSERT INTO `t3` VALUES (99,7,'c');
+INSERT INTO `t3` VALUES (100,2,'f');
+CREATE TABLE `t2` (
+`pk` int(11) NOT NULL AUTO_INCREMENT,
+`col_int_nokey` int(11) DEFAULT NULL,
+`col_varchar_nokey` varchar(1) DEFAULT NULL,
+PRIMARY KEY (`pk`)
+) ENGINE=MyISAM AUTO_INCREMENT=11 DEFAULT CHARSET=latin1;
+INSERT INTO `t2` VALUES (10,8,NULL);
+set optimizer_switch='subquery_cache=off';
+SELECT (
+SELECT `col_int_nokey`
+FROM t3
+WHERE table1 .`col_varchar_nokey` ) field13
+FROM t2 table1 JOIN t1 table2 ON table2 .`pk`
+ORDER BY field13;
+field13
+NULL
+NULL
+NULL
+NULL
+NULL
+NULL
+NULL
+NULL
+NULL
+NULL
+NULL
+NULL
+NULL
+NULL
+NULL
+NULL
+NULL
+NULL
+NULL
+NULL
+set optimizer_switch='subquery_cache=on';
+SELECT
+(SELECT `col_int_nokey`
+ FROM t3
+WHERE table1 .`col_varchar_nokey` ) field13
+FROM t2 table1 JOIN t1 table2 ON table2 .`pk`
+ORDER BY field13;
+field13
+NULL
+NULL
+NULL
+NULL
+NULL
+NULL
+NULL
+NULL
+NULL
+NULL
+NULL
+NULL
+NULL
+NULL
+NULL
+NULL
+NULL
+NULL
+NULL
+NULL
+drop table t1,t2,t3;
+set @@optimizer_switch= default;
=== modified file 'mysql-test/t/subquery_cache.test'
--- a/mysql-test/t/subquery_cache.test 2010-07-30 04:16:58 +0000
+++ b/mysql-test/t/subquery_cache.test 2010-08-02 05:56:12 +0000
@@ -1306,3 +1306,167 @@
drop table t1,t2,t3,t4;
set @@optimizer_switch= default;
+
+#
+--echo #launchpad BUG#611625
+#
+CREATE TABLE `t1` (
+ `pk` int(11) NOT NULL AUTO_INCREMENT,
+ `col_int_nokey` int(11) DEFAULT NULL,
+ `col_varchar_nokey` varchar(1) DEFAULT NULL,
+ PRIMARY KEY (`pk`)
+) ENGINE=MyISAM AUTO_INCREMENT=21 DEFAULT CHARSET=latin1;
+INSERT INTO `t1` VALUES (1,NULL,'w');
+INSERT INTO `t1` VALUES (2,7,'m');
+INSERT INTO `t1` VALUES (3,9,'m');
+INSERT INTO `t1` VALUES (4,7,'k');
+INSERT INTO `t1` VALUES (5,4,'r');
+INSERT INTO `t1` VALUES (6,2,'t');
+INSERT INTO `t1` VALUES (7,6,'j');
+INSERT INTO `t1` VALUES (8,8,'u');
+INSERT INTO `t1` VALUES (9,NULL,'h');
+INSERT INTO `t1` VALUES (10,5,'o');
+INSERT INTO `t1` VALUES (11,NULL,NULL);
+INSERT INTO `t1` VALUES (12,6,'k');
+INSERT INTO `t1` VALUES (13,188,'e');
+INSERT INTO `t1` VALUES (14,2,'n');
+INSERT INTO `t1` VALUES (15,1,'t');
+INSERT INTO `t1` VALUES (16,1,'c');
+INSERT INTO `t1` VALUES (17,0,'m');
+INSERT INTO `t1` VALUES (18,9,'y');
+INSERT INTO `t1` VALUES (19,NULL,'f');
+INSERT INTO `t1` VALUES (20,4,'d');
+CREATE TABLE `t3` (
+ `pk` int(11) NOT NULL AUTO_INCREMENT,
+ `col_int_nokey` int(11) DEFAULT NULL,
+ `col_varchar_nokey` varchar(1) DEFAULT NULL,
+ PRIMARY KEY (`pk`)
+) ENGINE=MyISAM AUTO_INCREMENT=101 DEFAULT CHARSET=latin1;
+INSERT INTO `t3` VALUES (1,6,'r');
+INSERT INTO `t3` VALUES (2,8,'c');
+INSERT INTO `t3` VALUES (3,6,'o');
+INSERT INTO `t3` VALUES (4,6,'c');
+INSERT INTO `t3` VALUES (5,3,'d');
+INSERT INTO `t3` VALUES (6,9,'v');
+INSERT INTO `t3` VALUES (7,2,'m');
+INSERT INTO `t3` VALUES (8,1,'j');
+INSERT INTO `t3` VALUES (9,8,'f');
+INSERT INTO `t3` VALUES (10,0,'n');
+INSERT INTO `t3` VALUES (11,9,'z');
+INSERT INTO `t3` VALUES (12,8,'h');
+INSERT INTO `t3` VALUES (13,NULL,'q');
+INSERT INTO `t3` VALUES (14,0,'w');
+INSERT INTO `t3` VALUES (15,5,'z');
+INSERT INTO `t3` VALUES (16,1,'j');
+INSERT INTO `t3` VALUES (17,1,'a');
+INSERT INTO `t3` VALUES (18,6,'m');
+INSERT INTO `t3` VALUES (19,6,'n');
+INSERT INTO `t3` VALUES (20,1,'e');
+INSERT INTO `t3` VALUES (21,8,'u');
+INSERT INTO `t3` VALUES (22,1,'s');
+INSERT INTO `t3` VALUES (23,0,'u');
+INSERT INTO `t3` VALUES (24,4,'r');
+INSERT INTO `t3` VALUES (25,9,'g');
+INSERT INTO `t3` VALUES (26,8,'o');
+INSERT INTO `t3` VALUES (27,5,'w');
+INSERT INTO `t3` VALUES (28,9,'b');
+INSERT INTO `t3` VALUES (29,5,NULL);
+INSERT INTO `t3` VALUES (30,NULL,'y');
+INSERT INTO `t3` VALUES (31,NULL,'y');
+INSERT INTO `t3` VALUES (32,105,'u');
+INSERT INTO `t3` VALUES (33,0,'p');
+INSERT INTO `t3` VALUES (34,3,'s');
+INSERT INTO `t3` VALUES (35,1,'e');
+INSERT INTO `t3` VALUES (36,75,'d');
+INSERT INTO `t3` VALUES (37,9,'d');
+INSERT INTO `t3` VALUES (38,7,'c');
+INSERT INTO `t3` VALUES (39,NULL,'b');
+INSERT INTO `t3` VALUES (40,NULL,'t');
+INSERT INTO `t3` VALUES (41,4,NULL);
+INSERT INTO `t3` VALUES (42,0,'y');
+INSERT INTO `t3` VALUES (43,204,'c');
+INSERT INTO `t3` VALUES (44,0,'d');
+INSERT INTO `t3` VALUES (45,9,'x');
+INSERT INTO `t3` VALUES (46,8,'p');
+INSERT INTO `t3` VALUES (47,7,'e');
+INSERT INTO `t3` VALUES (48,8,'g');
+INSERT INTO `t3` VALUES (49,NULL,'x');
+INSERT INTO `t3` VALUES (50,6,'s');
+INSERT INTO `t3` VALUES (51,5,'e');
+INSERT INTO `t3` VALUES (52,2,'l');
+INSERT INTO `t3` VALUES (53,3,'p');
+INSERT INTO `t3` VALUES (54,7,'h');
+INSERT INTO `t3` VALUES (55,NULL,'m');
+INSERT INTO `t3` VALUES (56,145,'n');
+INSERT INTO `t3` VALUES (57,0,'v');
+INSERT INTO `t3` VALUES (58,1,'b');
+INSERT INTO `t3` VALUES (59,7,'x');
+INSERT INTO `t3` VALUES (60,3,'r');
+INSERT INTO `t3` VALUES (61,NULL,'t');
+INSERT INTO `t3` VALUES (62,2,'w');
+INSERT INTO `t3` VALUES (63,2,'w');
+INSERT INTO `t3` VALUES (64,2,'k');
+INSERT INTO `t3` VALUES (65,8,'a');
+INSERT INTO `t3` VALUES (66,6,'t');
+INSERT INTO `t3` VALUES (67,1,'z');
+INSERT INTO `t3` VALUES (68,NULL,'e');
+INSERT INTO `t3` VALUES (69,1,'q');
+INSERT INTO `t3` VALUES (70,0,'e');
+INSERT INTO `t3` VALUES (71,4,'v');
+INSERT INTO `t3` VALUES (72,1,'d');
+INSERT INTO `t3` VALUES (73,1,'u');
+INSERT INTO `t3` VALUES (74,27,'o');
+INSERT INTO `t3` VALUES (75,4,'b');
+INSERT INTO `t3` VALUES (76,6,'c');
+INSERT INTO `t3` VALUES (77,2,'q');
+INSERT INTO `t3` VALUES (78,248,NULL);
+INSERT INTO `t3` VALUES (79,NULL,'h');
+INSERT INTO `t3` VALUES (80,9,'d');
+INSERT INTO `t3` VALUES (81,75,'w');
+INSERT INTO `t3` VALUES (82,2,'m');
+INSERT INTO `t3` VALUES (83,9,'i');
+INSERT INTO `t3` VALUES (84,4,'w');
+INSERT INTO `t3` VALUES (85,0,'f');
+INSERT INTO `t3` VALUES (86,0,'k');
+INSERT INTO `t3` VALUES (87,1,'v');
+INSERT INTO `t3` VALUES (88,119,'c');
+INSERT INTO `t3` VALUES (89,1,'y');
+INSERT INTO `t3` VALUES (90,7,'h');
+INSERT INTO `t3` VALUES (91,2,NULL);
+INSERT INTO `t3` VALUES (92,7,'t');
+INSERT INTO `t3` VALUES (93,2,'l');
+INSERT INTO `t3` VALUES (94,6,'a');
+INSERT INTO `t3` VALUES (95,4,'r');
+INSERT INTO `t3` VALUES (96,5,'s');
+INSERT INTO `t3` VALUES (97,7,'z');
+INSERT INTO `t3` VALUES (98,1,'j');
+INSERT INTO `t3` VALUES (99,7,'c');
+INSERT INTO `t3` VALUES (100,2,'f');
+CREATE TABLE `t2` (
+ `pk` int(11) NOT NULL AUTO_INCREMENT,
+ `col_int_nokey` int(11) DEFAULT NULL,
+ `col_varchar_nokey` varchar(1) DEFAULT NULL,
+ PRIMARY KEY (`pk`)
+) ENGINE=MyISAM AUTO_INCREMENT=11 DEFAULT CHARSET=latin1;
+INSERT INTO `t2` VALUES (10,8,NULL);
+
+set optimizer_switch='subquery_cache=off';
+
+SELECT (
+SELECT `col_int_nokey`
+FROM t3
+WHERE table1 .`col_varchar_nokey` ) field13
+FROM t2 table1 JOIN t1 table2 ON table2 .`pk`
+ORDER BY field13;
+
+set optimizer_switch='subquery_cache=on';
+
+SELECT
+ (SELECT `col_int_nokey`
+ FROM t3
+ WHERE table1 .`col_varchar_nokey` ) field13
+FROM t2 table1 JOIN t1 table2 ON table2 .`pk`
+ORDER BY field13;
+
+drop table t1,t2,t3;
+set @@optimizer_switch= default;
=== modified file 'sql/sql_class.h'
--- a/sql/sql_class.h 2010-07-16 11:02:15 +0000
+++ b/sql/sql_class.h 2010-08-02 05:56:12 +0000
@@ -62,9 +62,9 @@
class Item_iterator_ref_list: public Item_iterator
{
- List_iterator_fast<Item*> list;
+ List_iterator<Item*> list;
public:
- Item_iterator_ref_list(List_iterator_fast<Item*> &arg_list):
+ Item_iterator_ref_list(List_iterator<Item*> &arg_list):
list(arg_list) {}
void open() { list.rewind(); }
Item *next() { return *(list++); }
=== modified file 'sql/sql_expression_cache.cc'
--- a/sql/sql_expression_cache.cc 2010-07-30 04:16:58 +0000
+++ b/sql/sql_expression_cache.cc 2010-08-02 05:56:12 +0000
@@ -96,22 +96,39 @@
void Expression_cache_tmptable::init()
{
- List_iterator_fast<Item*> li(*list);
+ List_iterator<Item*> li(*list);
Item_iterator_ref_list it(li);
Item **item;
uint field_counter;
DBUG_ENTER("Expression_cache_tmptable::init");
DBUG_ASSERT(!inited);
inited= TRUE;
-
- if (!(ULONGLONG_MAX >> (list->elements + 1)))
- {
- DBUG_PRINT("info", ("Too many dependencies"));
+ cache_table= NULL;
+
+ while ((item= li++))
+ {
+ DBUG_ASSERT(item);
+ if (*item)
+ {
+ DBUG_ASSERT((*item)->fixed);
+ items.push_back((*item));
+ }
+ else
+ {
+ /*
+ This is possible when optimizer already executed this subquery and
+ optimized out a condition predicate. See launchpad bug#611625
+ */
+ li.remove();
+ }
+ }
+
+ if (list->elements == 0)
+ {
+ DBUG_PRINT("info", ("All parameters was removed by optimizer."));
DBUG_VOID_RETURN;
}
- cache_table= NULL;
-
cache_table_param.init();
/* dependent items and result */
cache_table_param.field_count= list->elements + 1;
@@ -119,13 +136,6 @@
cache_table_param.skip_create_table= 1;
cache_table= NULL;
- while ((item= li++))
- {
- DBUG_ASSERT(item);
- DBUG_ASSERT(*item);
- DBUG_ASSERT((*item)->fixed);
- items.push_back((*item));
- }
items.push_front(val);
if (!(cache_table= create_tmp_table(table_thd, &cache_table_param,
1
0
[Maria-developers] Rev 2808: Fix for luanchpad bug#609043 in file:///home/bell/maria/bzr/work-maria-5.3-lb609043/
by sanja@askmonty.org 30 Jul '10
by sanja@askmonty.org 30 Jul '10
30 Jul '10
At file:///home/bell/maria/bzr/work-maria-5.3-lb609043/
------------------------------------------------------------
revno: 2808
revision-id: sanja(a)askmonty.org-20100730041658-2naumadh26t93e3g
parent: sanja(a)askmonty.org-20100729111348-jjp89wlvs3kg0fqq
committer: sanja(a)askmonty.org
branch nick: work-maria-5.3-lb609043
timestamp: Fri 2010-07-30 07:16:58 +0300
message:
Fix for luanchpad bug#609043
Removed indirect reference in equalities for cache index lookup.
We should use a direct reference because some optimization of the
query may optimize out a condition predicate and if the outer reference
is the only element of the condition predicate the indirect reference
becomes NULL.
We can resolve correctly the indirect reference in
Expression_cache_tmptable::make_equalities because it is called before
optimization of the cached subquery.
=== modified file 'mysql-test/r/subquery_cache.result'
--- a/mysql-test/r/subquery_cache.result 2010-07-29 11:13:48 +0000
+++ b/mysql-test/r/subquery_cache.result 2010-07-30 04:16:58 +0000
@@ -2881,3 +2881,107 @@
field1 field2 field3 field4 field5 field6 field7 field8 field9 field10
drop table t1,t2,t3,t4,t5;
set @@optimizer_switch= default;
+#launchpad BUG#609043
+CREATE TABLE `t1` (
+`pk` int(11) NOT NULL AUTO_INCREMENT,
+`col_int_nokey` int(11) DEFAULT NULL,
+`col_int_key` int(11) DEFAULT NULL,
+`col_date_key` date DEFAULT NULL,
+`col_date_nokey` date DEFAULT NULL,
+`col_time_key` time DEFAULT NULL,
+`col_time_nokey` time DEFAULT NULL,
+`col_datetime_key` datetime DEFAULT NULL,
+`col_datetime_nokey` datetime DEFAULT NULL,
+`col_varchar_key` varchar(1) DEFAULT NULL,
+`col_varchar_nokey` varchar(1) DEFAULT NULL,
+PRIMARY KEY (`pk`),
+KEY `col_varchar_key` (`col_varchar_key`,`col_int_key`)
+) ENGINE=MyISAM AUTO_INCREMENT=21 DEFAULT CHARSET=latin1;
+INSERT INTO `t1` VALUES (19,NULL,6,'2004-08-20','2004-08-20','05:03:03','05:03:03','2007-04-19 00:19:53','2007-04-19 00:19:53','f','f');
+INSERT INTO `t1` VALUES (20,4,2,'1900-01-01','1900-01-01','18:38:59','18:38:59','1900-01-01 00:00:00','1900-01-01 00:00:00','d','d');
+CREATE TABLE `t2` (
+`pk` int(11) NOT NULL AUTO_INCREMENT,
+`col_int_nokey` int(11) DEFAULT NULL,
+`col_int_key` int(11) DEFAULT NULL,
+`col_date_key` date DEFAULT NULL,
+`col_date_nokey` date DEFAULT NULL,
+`col_time_key` time DEFAULT NULL,
+`col_time_nokey` time DEFAULT NULL,
+`col_datetime_key` datetime DEFAULT NULL,
+`col_datetime_nokey` datetime DEFAULT NULL,
+`col_varchar_key` varchar(1) DEFAULT NULL,
+`col_varchar_nokey` varchar(1) DEFAULT NULL,
+PRIMARY KEY (`pk`),
+KEY `col_int_key` (`col_int_key`),
+KEY `col_date_key` (`col_date_key`),
+KEY `col_time_key` (`col_time_key`),
+KEY `col_datetime_key` (`col_datetime_key`),
+KEY `col_varchar_key` (`col_varchar_key`,`col_int_key`)
+) ENGINE=MyISAM AUTO_INCREMENT=30 DEFAULT CHARSET=latin1;
+CREATE TABLE `t3` (
+`pk` int(11) NOT NULL AUTO_INCREMENT,
+`col_int_nokey` int(11) DEFAULT NULL,
+`col_int_key` int(11) DEFAULT NULL,
+`col_date_key` date DEFAULT NULL,
+`col_date_nokey` date DEFAULT NULL,
+`col_time_key` time DEFAULT NULL,
+`col_time_nokey` time DEFAULT NULL,
+`col_datetime_key` datetime DEFAULT NULL,
+`col_datetime_nokey` datetime DEFAULT NULL,
+`col_varchar_key` varchar(1) DEFAULT NULL,
+`col_varchar_nokey` varchar(1) DEFAULT NULL,
+PRIMARY KEY (`pk`),
+KEY `col_int_key` (`col_int_key`),
+KEY `col_date_key` (`col_date_key`),
+KEY `col_time_key` (`col_time_key`),
+KEY `col_datetime_key` (`col_datetime_key`),
+KEY `col_varchar_key` (`col_varchar_key`,`col_int_key`)
+) ENGINE=MyISAM AUTO_INCREMENT=2 DEFAULT CHARSET=latin1;
+CREATE TABLE `t4` (
+`pk` int(11) NOT NULL AUTO_INCREMENT,
+`col_int_nokey` int(11) DEFAULT NULL,
+`col_int_key` int(11) DEFAULT NULL,
+`col_date_key` date DEFAULT NULL,
+`col_date_nokey` date DEFAULT NULL,
+`col_time_key` time DEFAULT NULL,
+`col_time_nokey` time DEFAULT NULL,
+`col_datetime_key` datetime DEFAULT NULL,
+`col_datetime_nokey` datetime DEFAULT NULL,
+`col_varchar_key` varchar(1) DEFAULT NULL,
+`col_varchar_nokey` varchar(1) DEFAULT NULL,
+PRIMARY KEY (`pk`),
+KEY `col_int_key` (`col_int_key`),
+KEY `col_date_key` (`col_date_key`),
+KEY `col_time_key` (`col_time_key`),
+KEY `col_datetime_key` (`col_datetime_key`),
+KEY `col_varchar_key` (`col_varchar_key`,`col_int_key`)
+) ENGINE=MyISAM AUTO_INCREMENT=101 DEFAULT CHARSET=latin1;
+INSERT INTO `t4` VALUES (100,2,5,'2001-07-26','2001-07-26','11:49:25','11:49:25','2007-04-25 05:08:49','2007-04-25 05:08:49','f','f');
+SET @@optimizer_switch = 'subquery_cache=off';
+/* cache is off */ SELECT COUNT( DISTINCT table2 .`col_int_key` ) , (
+SELECT SUBQUERY2_t1 .`col_int_key`
+FROM t3 SUBQUERY2_t1 JOIN t2 ON SUBQUERY2_t1 .`col_int_key`
+WHERE table1 .`col_varchar_key` ) , table2 .`col_varchar_nokey` field10
+FROM t4 table1 JOIN ( t1 table2 STRAIGHT_JOIN t1 table3 ON table2 .`pk` ) ON table3 .`col_varchar_key` = table2 .`col_varchar_key`
+GROUP BY field10 ;
+COUNT( DISTINCT table2 .`col_int_key` ) (
+SELECT SUBQUERY2_t1 .`col_int_key`
+FROM t3 SUBQUERY2_t1 JOIN t2 ON SUBQUERY2_t1 .`col_int_key`
+WHERE table1 .`col_varchar_key` ) field10
+1 NULL d
+1 NULL f
+SET @@optimizer_switch = 'subquery_cache=on';
+/* cache is on */ SELECT COUNT( DISTINCT table2 .`col_int_key` ) , (
+SELECT SUBQUERY2_t1 .`col_int_key`
+FROM t3 SUBQUERY2_t1 JOIN t2 ON SUBQUERY2_t1 .`col_int_key`
+WHERE table1 .`col_varchar_key` ) , table2 .`col_varchar_nokey` field10
+FROM t4 table1 JOIN ( t1 table2 STRAIGHT_JOIN t1 table3 ON table2 .`pk` ) ON table3 .`col_varchar_key` = table2 .`col_varchar_key`
+GROUP BY field10 ;
+COUNT( DISTINCT table2 .`col_int_key` ) (
+SELECT SUBQUERY2_t1 .`col_int_key`
+FROM t3 SUBQUERY2_t1 JOIN t2 ON SUBQUERY2_t1 .`col_int_key`
+WHERE table1 .`col_varchar_key` ) field10
+1 NULL d
+1 NULL f
+drop table t1,t2,t3,t4;
+set @@optimizer_switch= default;
=== modified file 'mysql-test/t/subquery_cache.test'
--- a/mysql-test/t/subquery_cache.test 2010-07-29 11:13:48 +0000
+++ b/mysql-test/t/subquery_cache.test 2010-07-30 04:16:58 +0000
@@ -1202,3 +1202,107 @@
drop table t1,t2,t3,t4,t5;
set @@optimizer_switch= default;
+
+
+#
+--echo #launchpad BUG#609043
+#
+CREATE TABLE `t1` (
+ `pk` int(11) NOT NULL AUTO_INCREMENT,
+ `col_int_nokey` int(11) DEFAULT NULL,
+ `col_int_key` int(11) DEFAULT NULL,
+ `col_date_key` date DEFAULT NULL,
+ `col_date_nokey` date DEFAULT NULL,
+ `col_time_key` time DEFAULT NULL,
+ `col_time_nokey` time DEFAULT NULL,
+ `col_datetime_key` datetime DEFAULT NULL,
+ `col_datetime_nokey` datetime DEFAULT NULL,
+ `col_varchar_key` varchar(1) DEFAULT NULL,
+ `col_varchar_nokey` varchar(1) DEFAULT NULL,
+ PRIMARY KEY (`pk`),
+ KEY `col_varchar_key` (`col_varchar_key`,`col_int_key`)
+) ENGINE=MyISAM AUTO_INCREMENT=21 DEFAULT CHARSET=latin1;
+INSERT INTO `t1` VALUES (19,NULL,6,'2004-08-20','2004-08-20','05:03:03','05:03:03','2007-04-19 00:19:53','2007-04-19 00:19:53','f','f');
+INSERT INTO `t1` VALUES (20,4,2,'1900-01-01','1900-01-01','18:38:59','18:38:59','1900-01-01 00:00:00','1900-01-01 00:00:00','d','d');
+
+CREATE TABLE `t2` (
+ `pk` int(11) NOT NULL AUTO_INCREMENT,
+ `col_int_nokey` int(11) DEFAULT NULL,
+ `col_int_key` int(11) DEFAULT NULL,
+ `col_date_key` date DEFAULT NULL,
+ `col_date_nokey` date DEFAULT NULL,
+ `col_time_key` time DEFAULT NULL,
+ `col_time_nokey` time DEFAULT NULL,
+ `col_datetime_key` datetime DEFAULT NULL,
+ `col_datetime_nokey` datetime DEFAULT NULL,
+ `col_varchar_key` varchar(1) DEFAULT NULL,
+ `col_varchar_nokey` varchar(1) DEFAULT NULL,
+ PRIMARY KEY (`pk`),
+ KEY `col_int_key` (`col_int_key`),
+ KEY `col_date_key` (`col_date_key`),
+ KEY `col_time_key` (`col_time_key`),
+ KEY `col_datetime_key` (`col_datetime_key`),
+ KEY `col_varchar_key` (`col_varchar_key`,`col_int_key`)
+) ENGINE=MyISAM AUTO_INCREMENT=30 DEFAULT CHARSET=latin1;
+
+CREATE TABLE `t3` (
+ `pk` int(11) NOT NULL AUTO_INCREMENT,
+ `col_int_nokey` int(11) DEFAULT NULL,
+ `col_int_key` int(11) DEFAULT NULL,
+ `col_date_key` date DEFAULT NULL,
+ `col_date_nokey` date DEFAULT NULL,
+ `col_time_key` time DEFAULT NULL,
+ `col_time_nokey` time DEFAULT NULL,
+ `col_datetime_key` datetime DEFAULT NULL,
+ `col_datetime_nokey` datetime DEFAULT NULL,
+ `col_varchar_key` varchar(1) DEFAULT NULL,
+ `col_varchar_nokey` varchar(1) DEFAULT NULL,
+ PRIMARY KEY (`pk`),
+ KEY `col_int_key` (`col_int_key`),
+ KEY `col_date_key` (`col_date_key`),
+ KEY `col_time_key` (`col_time_key`),
+ KEY `col_datetime_key` (`col_datetime_key`),
+ KEY `col_varchar_key` (`col_varchar_key`,`col_int_key`)
+) ENGINE=MyISAM AUTO_INCREMENT=2 DEFAULT CHARSET=latin1;
+
+CREATE TABLE `t4` (
+ `pk` int(11) NOT NULL AUTO_INCREMENT,
+ `col_int_nokey` int(11) DEFAULT NULL,
+ `col_int_key` int(11) DEFAULT NULL,
+ `col_date_key` date DEFAULT NULL,
+ `col_date_nokey` date DEFAULT NULL,
+ `col_time_key` time DEFAULT NULL,
+ `col_time_nokey` time DEFAULT NULL,
+ `col_datetime_key` datetime DEFAULT NULL,
+ `col_datetime_nokey` datetime DEFAULT NULL,
+ `col_varchar_key` varchar(1) DEFAULT NULL,
+ `col_varchar_nokey` varchar(1) DEFAULT NULL,
+ PRIMARY KEY (`pk`),
+ KEY `col_int_key` (`col_int_key`),
+ KEY `col_date_key` (`col_date_key`),
+ KEY `col_time_key` (`col_time_key`),
+ KEY `col_datetime_key` (`col_datetime_key`),
+ KEY `col_varchar_key` (`col_varchar_key`,`col_int_key`)
+) ENGINE=MyISAM AUTO_INCREMENT=101 DEFAULT CHARSET=latin1;
+INSERT INTO `t4` VALUES (100,2,5,'2001-07-26','2001-07-26','11:49:25','11:49:25','2007-04-25 05:08:49','2007-04-25 05:08:49','f','f');
+
+SET @@optimizer_switch = 'subquery_cache=off';
+
+/* cache is off */ SELECT COUNT( DISTINCT table2 .`col_int_key` ) , (
+SELECT SUBQUERY2_t1 .`col_int_key`
+FROM t3 SUBQUERY2_t1 JOIN t2 ON SUBQUERY2_t1 .`col_int_key`
+WHERE table1 .`col_varchar_key` ) , table2 .`col_varchar_nokey` field10
+FROM t4 table1 JOIN ( t1 table2 STRAIGHT_JOIN t1 table3 ON table2 .`pk` ) ON table3 .`col_varchar_key` = table2 .`col_varchar_key`
+GROUP BY field10 ;
+
+SET @@optimizer_switch = 'subquery_cache=on';
+
+/* cache is on */ SELECT COUNT( DISTINCT table2 .`col_int_key` ) , (
+SELECT SUBQUERY2_t1 .`col_int_key`
+FROM t3 SUBQUERY2_t1 JOIN t2 ON SUBQUERY2_t1 .`col_int_key`
+WHERE table1 .`col_varchar_key` ) , table2 .`col_varchar_nokey` field10
+FROM t4 table1 JOIN ( t1 table2 STRAIGHT_JOIN t1 table3 ON table2 .`pk` ) ON table3 .`col_varchar_key` = table2 .`col_varchar_key`
+GROUP BY field10 ;
+
+drop table t1,t2,t3,t4;
+set @@optimizer_switch= default;
=== modified file 'sql/sql_expression_cache.cc'
--- a/sql/sql_expression_cache.cc 2010-07-10 10:37:30 +0000
+++ b/sql/sql_expression_cache.cc 2010-07-30 04:16:58 +0000
@@ -41,7 +41,6 @@
List<Item> args;
List_iterator_fast<Item*> li(*list);
Item **ref;
- Name_resolution_context *cn= NULL;
DBUG_ENTER("Expression_cache_tmptable::make_equalities");
for (uint i= 1 /* skip result filed */; (ref= li++); i++)
@@ -58,14 +57,7 @@
fld->type() == MYSQL_TYPE_NEWDECIMAL ||
fld->type() == MYSQL_TYPE_DECIMAL)
{
- if (!cn)
- {
- // dummy resolution context
- cn= new Name_resolution_context();
- cn->init();
- }
- args.push_front(new Item_func_eq(new Item_ref(cn, ref, "", "", FALSE),
- new Item_field(fld)));
+ args.push_front(new Item_func_eq(*ref, new Item_field(fld)));
}
}
if (args.elements == 1)
1
0
[Maria-developers] Rev 2807: Fix for luanchpad bug#609043 in file:///home/bell/maria/bzr/work-maria-5.3-lb609043/
by sanja@askmonty.org 29 Jul '10
by sanja@askmonty.org 29 Jul '10
29 Jul '10
At file:///home/bell/maria/bzr/work-maria-5.3-lb609043/
------------------------------------------------------------
revno: 2807
revision-id: sanja(a)askmonty.org-20100729164449-r66iqeuva2z0d8o8
parent: timour(a)askmonty.org-20100723082500-kwqzzvuv62nw412k
committer: sanja(a)askmonty.org
branch nick: work-maria-5.3-lb609043
timestamp: Thu 2010-07-29 19:44:49 +0300
message:
Fix for luanchpad bug#609043
Removed indirect reference in equalities for cache index lookup.
We should use direct reference because optiomization of the query can optimize out condition and if the outer reference is the only element of condition the indirect reference become NULL.
We can resolve correctly indirect reference in Expression_cache_tmptable::make_equalities because it called before optimisation of the cached subquery.
=== modified file 'mysql-test/r/subquery_cache.result'
--- a/mysql-test/r/subquery_cache.result 2010-07-10 10:37:30 +0000
+++ b/mysql-test/r/subquery_cache.result 2010-07-29 16:44:49 +0000
@@ -1838,3 +1838,107 @@
Handler_read_rnd_next 27
drop table t0,t1,t2;
set optimizer_switch='default';
+# launchpad BUG#609043
+CREATE TABLE `t1` (
+`pk` int(11) NOT NULL AUTO_INCREMENT,
+`col_int_nokey` int(11) DEFAULT NULL,
+`col_int_key` int(11) DEFAULT NULL,
+`col_date_key` date DEFAULT NULL,
+`col_date_nokey` date DEFAULT NULL,
+`col_time_key` time DEFAULT NULL,
+`col_time_nokey` time DEFAULT NULL,
+`col_datetime_key` datetime DEFAULT NULL,
+`col_datetime_nokey` datetime DEFAULT NULL,
+`col_varchar_key` varchar(1) DEFAULT NULL,
+`col_varchar_nokey` varchar(1) DEFAULT NULL,
+PRIMARY KEY (`pk`),
+KEY `col_varchar_key` (`col_varchar_key`,`col_int_key`)
+) ENGINE=MyISAM AUTO_INCREMENT=21 DEFAULT CHARSET=latin1;
+INSERT INTO `t1` VALUES (19,NULL,6,'2004-08-20','2004-08-20','05:03:03','05:03:03','2007-04-19 00:19:53','2007-04-19 00:19:53','f','f');
+INSERT INTO `t1` VALUES (20,4,2,'1900-01-01','1900-01-01','18:38:59','18:38:59','1900-01-01 00:00:00','1900-01-01 00:00:00','d','d');
+CREATE TABLE `t2` (
+`pk` int(11) NOT NULL AUTO_INCREMENT,
+`col_int_nokey` int(11) DEFAULT NULL,
+`col_int_key` int(11) DEFAULT NULL,
+`col_date_key` date DEFAULT NULL,
+`col_date_nokey` date DEFAULT NULL,
+`col_time_key` time DEFAULT NULL,
+`col_time_nokey` time DEFAULT NULL,
+`col_datetime_key` datetime DEFAULT NULL,
+`col_datetime_nokey` datetime DEFAULT NULL,
+`col_varchar_key` varchar(1) DEFAULT NULL,
+`col_varchar_nokey` varchar(1) DEFAULT NULL,
+PRIMARY KEY (`pk`),
+KEY `col_int_key` (`col_int_key`),
+KEY `col_date_key` (`col_date_key`),
+KEY `col_time_key` (`col_time_key`),
+KEY `col_datetime_key` (`col_datetime_key`),
+KEY `col_varchar_key` (`col_varchar_key`,`col_int_key`)
+) ENGINE=MyISAM AUTO_INCREMENT=30 DEFAULT CHARSET=latin1;
+CREATE TABLE `t3` (
+`pk` int(11) NOT NULL AUTO_INCREMENT,
+`col_int_nokey` int(11) DEFAULT NULL,
+`col_int_key` int(11) DEFAULT NULL,
+`col_date_key` date DEFAULT NULL,
+`col_date_nokey` date DEFAULT NULL,
+`col_time_key` time DEFAULT NULL,
+`col_time_nokey` time DEFAULT NULL,
+`col_datetime_key` datetime DEFAULT NULL,
+`col_datetime_nokey` datetime DEFAULT NULL,
+`col_varchar_key` varchar(1) DEFAULT NULL,
+`col_varchar_nokey` varchar(1) DEFAULT NULL,
+PRIMARY KEY (`pk`),
+KEY `col_int_key` (`col_int_key`),
+KEY `col_date_key` (`col_date_key`),
+KEY `col_time_key` (`col_time_key`),
+KEY `col_datetime_key` (`col_datetime_key`),
+KEY `col_varchar_key` (`col_varchar_key`,`col_int_key`)
+) ENGINE=MyISAM AUTO_INCREMENT=2 DEFAULT CHARSET=latin1;
+CREATE TABLE `t4` (
+`pk` int(11) NOT NULL AUTO_INCREMENT,
+`col_int_nokey` int(11) DEFAULT NULL,
+`col_int_key` int(11) DEFAULT NULL,
+`col_date_key` date DEFAULT NULL,
+`col_date_nokey` date DEFAULT NULL,
+`col_time_key` time DEFAULT NULL,
+`col_time_nokey` time DEFAULT NULL,
+`col_datetime_key` datetime DEFAULT NULL,
+`col_datetime_nokey` datetime DEFAULT NULL,
+`col_varchar_key` varchar(1) DEFAULT NULL,
+`col_varchar_nokey` varchar(1) DEFAULT NULL,
+PRIMARY KEY (`pk`),
+KEY `col_int_key` (`col_int_key`),
+KEY `col_date_key` (`col_date_key`),
+KEY `col_time_key` (`col_time_key`),
+KEY `col_datetime_key` (`col_datetime_key`),
+KEY `col_varchar_key` (`col_varchar_key`,`col_int_key`)
+) ENGINE=MyISAM AUTO_INCREMENT=101 DEFAULT CHARSET=latin1;
+INSERT INTO `t4` VALUES (100,2,5,'2001-07-26','2001-07-26','11:49:25','11:49:25','2007-04-25 05:08:49','2007-04-25 05:08:49','f','f');
+SET @@optimizer_switch = 'subquery_cache=off';
+/* cache is off */ SELECT COUNT( DISTINCT table2 .`col_int_key` ) , (
+SELECT SUBQUERY2_t1 .`col_int_key`
+FROM t3 SUBQUERY2_t1 JOIN t2 ON SUBQUERY2_t1 .`col_int_key`
+WHERE table1 .`col_varchar_key` ) , table2 .`col_varchar_nokey` field10
+FROM t4 table1 JOIN ( t1 table2 STRAIGHT_JOIN t1 table3 ON table2 .`pk` ) ON table3 .`col_varchar_key` = table2 .`col_varchar_key`
+GROUP BY field10 ;
+COUNT( DISTINCT table2 .`col_int_key` ) (
+SELECT SUBQUERY2_t1 .`col_int_key`
+FROM t3 SUBQUERY2_t1 JOIN t2 ON SUBQUERY2_t1 .`col_int_key`
+WHERE table1 .`col_varchar_key` ) field10
+1 NULL d
+1 NULL f
+SET @@optimizer_switch = 'subquery_cache=on';
+/* cache is on */ SELECT COUNT( DISTINCT table2 .`col_int_key` ) , (
+SELECT SUBQUERY2_t1 .`col_int_key`
+FROM t3 SUBQUERY2_t1 JOIN t2 ON SUBQUERY2_t1 .`col_int_key`
+WHERE table1 .`col_varchar_key` ) , table2 .`col_varchar_nokey` field10
+FROM t4 table1 JOIN ( t1 table2 STRAIGHT_JOIN t1 table3 ON table2 .`pk` ) ON table3 .`col_varchar_key` = table2 .`col_varchar_key`
+GROUP BY field10 ;
+COUNT( DISTINCT table2 .`col_int_key` ) (
+SELECT SUBQUERY2_t1 .`col_int_key`
+FROM t3 SUBQUERY2_t1 JOIN t2 ON SUBQUERY2_t1 .`col_int_key`
+WHERE table1 .`col_varchar_key` ) field10
+1 NULL d
+1 NULL f
+drop table t1,t2,t3,t4;
+set @@optimizer_switch= default;
=== modified file 'mysql-test/t/subquery_cache.test'
--- a/mysql-test/t/subquery_cache.test 2010-07-10 10:37:30 +0000
+++ b/mysql-test/t/subquery_cache.test 2010-07-29 16:44:49 +0000
@@ -507,3 +507,107 @@
drop table t0,t1,t2;
set optimizer_switch='default';
+
+#
+--echo # launchpad BUG#609043
+#
+CREATE TABLE `t1` (
+ `pk` int(11) NOT NULL AUTO_INCREMENT,
+ `col_int_nokey` int(11) DEFAULT NULL,
+ `col_int_key` int(11) DEFAULT NULL,
+ `col_date_key` date DEFAULT NULL,
+ `col_date_nokey` date DEFAULT NULL,
+ `col_time_key` time DEFAULT NULL,
+ `col_time_nokey` time DEFAULT NULL,
+ `col_datetime_key` datetime DEFAULT NULL,
+ `col_datetime_nokey` datetime DEFAULT NULL,
+ `col_varchar_key` varchar(1) DEFAULT NULL,
+ `col_varchar_nokey` varchar(1) DEFAULT NULL,
+ PRIMARY KEY (`pk`),
+ KEY `col_varchar_key` (`col_varchar_key`,`col_int_key`)
+) ENGINE=MyISAM AUTO_INCREMENT=21 DEFAULT CHARSET=latin1;
+INSERT INTO `t1` VALUES (19,NULL,6,'2004-08-20','2004-08-20','05:03:03','05:03:03','2007-04-19 00:19:53','2007-04-19 00:19:53','f','f');
+INSERT INTO `t1` VALUES (20,4,2,'1900-01-01','1900-01-01','18:38:59','18:38:59','1900-01-01 00:00:00','1900-01-01 00:00:00','d','d');
+
+CREATE TABLE `t2` (
+ `pk` int(11) NOT NULL AUTO_INCREMENT,
+ `col_int_nokey` int(11) DEFAULT NULL,
+ `col_int_key` int(11) DEFAULT NULL,
+ `col_date_key` date DEFAULT NULL,
+ `col_date_nokey` date DEFAULT NULL,
+ `col_time_key` time DEFAULT NULL,
+ `col_time_nokey` time DEFAULT NULL,
+ `col_datetime_key` datetime DEFAULT NULL,
+ `col_datetime_nokey` datetime DEFAULT NULL,
+ `col_varchar_key` varchar(1) DEFAULT NULL,
+ `col_varchar_nokey` varchar(1) DEFAULT NULL,
+ PRIMARY KEY (`pk`),
+ KEY `col_int_key` (`col_int_key`),
+ KEY `col_date_key` (`col_date_key`),
+ KEY `col_time_key` (`col_time_key`),
+ KEY `col_datetime_key` (`col_datetime_key`),
+ KEY `col_varchar_key` (`col_varchar_key`,`col_int_key`)
+) ENGINE=MyISAM AUTO_INCREMENT=30 DEFAULT CHARSET=latin1;
+
+CREATE TABLE `t3` (
+ `pk` int(11) NOT NULL AUTO_INCREMENT,
+ `col_int_nokey` int(11) DEFAULT NULL,
+ `col_int_key` int(11) DEFAULT NULL,
+ `col_date_key` date DEFAULT NULL,
+ `col_date_nokey` date DEFAULT NULL,
+ `col_time_key` time DEFAULT NULL,
+ `col_time_nokey` time DEFAULT NULL,
+ `col_datetime_key` datetime DEFAULT NULL,
+ `col_datetime_nokey` datetime DEFAULT NULL,
+ `col_varchar_key` varchar(1) DEFAULT NULL,
+ `col_varchar_nokey` varchar(1) DEFAULT NULL,
+ PRIMARY KEY (`pk`),
+ KEY `col_int_key` (`col_int_key`),
+ KEY `col_date_key` (`col_date_key`),
+ KEY `col_time_key` (`col_time_key`),
+ KEY `col_datetime_key` (`col_datetime_key`),
+ KEY `col_varchar_key` (`col_varchar_key`,`col_int_key`)
+) ENGINE=MyISAM AUTO_INCREMENT=2 DEFAULT CHARSET=latin1;
+
+CREATE TABLE `t4` (
+ `pk` int(11) NOT NULL AUTO_INCREMENT,
+ `col_int_nokey` int(11) DEFAULT NULL,
+ `col_int_key` int(11) DEFAULT NULL,
+ `col_date_key` date DEFAULT NULL,
+ `col_date_nokey` date DEFAULT NULL,
+ `col_time_key` time DEFAULT NULL,
+ `col_time_nokey` time DEFAULT NULL,
+ `col_datetime_key` datetime DEFAULT NULL,
+ `col_datetime_nokey` datetime DEFAULT NULL,
+ `col_varchar_key` varchar(1) DEFAULT NULL,
+ `col_varchar_nokey` varchar(1) DEFAULT NULL,
+ PRIMARY KEY (`pk`),
+ KEY `col_int_key` (`col_int_key`),
+ KEY `col_date_key` (`col_date_key`),
+ KEY `col_time_key` (`col_time_key`),
+ KEY `col_datetime_key` (`col_datetime_key`),
+ KEY `col_varchar_key` (`col_varchar_key`,`col_int_key`)
+) ENGINE=MyISAM AUTO_INCREMENT=101 DEFAULT CHARSET=latin1;
+INSERT INTO `t4` VALUES (100,2,5,'2001-07-26','2001-07-26','11:49:25','11:49:25','2007-04-25 05:08:49','2007-04-25 05:08:49','f','f');
+
+SET @@optimizer_switch = 'subquery_cache=off';
+
+/* cache is off */ SELECT COUNT( DISTINCT table2 .`col_int_key` ) , (
+SELECT SUBQUERY2_t1 .`col_int_key`
+FROM t3 SUBQUERY2_t1 JOIN t2 ON SUBQUERY2_t1 .`col_int_key`
+WHERE table1 .`col_varchar_key` ) , table2 .`col_varchar_nokey` field10
+FROM t4 table1 JOIN ( t1 table2 STRAIGHT_JOIN t1 table3 ON table2 .`pk` ) ON table3 .`col_varchar_key` = table2 .`col_varchar_key`
+GROUP BY field10 ;
+
+SET @@optimizer_switch = 'subquery_cache=on';
+
+/* cache is on */ SELECT COUNT( DISTINCT table2 .`col_int_key` ) , (
+SELECT SUBQUERY2_t1 .`col_int_key`
+FROM t3 SUBQUERY2_t1 JOIN t2 ON SUBQUERY2_t1 .`col_int_key`
+WHERE table1 .`col_varchar_key` ) , table2 .`col_varchar_nokey` field10
+FROM t4 table1 JOIN ( t1 table2 STRAIGHT_JOIN t1 table3 ON table2 .`pk` ) ON table3 .`col_varchar_key` = table2 .`col_varchar_key`
+GROUP BY field10 ;
+
+drop table t1,t2,t3,t4;
+set @@optimizer_switch= default;
+
=== modified file 'sql/sql_expression_cache.cc'
--- a/sql/sql_expression_cache.cc 2010-07-10 10:37:30 +0000
+++ b/sql/sql_expression_cache.cc 2010-07-29 16:44:49 +0000
@@ -41,7 +41,6 @@
List<Item> args;
List_iterator_fast<Item*> li(*list);
Item **ref;
- Name_resolution_context *cn= NULL;
DBUG_ENTER("Expression_cache_tmptable::make_equalities");
for (uint i= 1 /* skip result filed */; (ref= li++); i++)
@@ -58,13 +57,7 @@
fld->type() == MYSQL_TYPE_NEWDECIMAL ||
fld->type() == MYSQL_TYPE_DECIMAL)
{
- if (!cn)
- {
- // dummy resolution context
- cn= new Name_resolution_context();
- cn->init();
- }
- args.push_front(new Item_func_eq(new Item_ref(cn, ref, "", "", FALSE),
+ args.push_front(new Item_func_eq(*ref,
new Item_field(fld)));
}
}
1
0
[Maria-developers] Rev 2807: Bugfix for lounchpad bug#608834 (608824, 609045, 609052). in file:///home/bell/maria/bzr/work-maria-5.3-lb608834/
by sanja@askmonty.org 29 Jul '10
by sanja@askmonty.org 29 Jul '10
29 Jul '10
At file:///home/bell/maria/bzr/work-maria-5.3-lb608834/
------------------------------------------------------------
revno: 2807
revision-id: sanja(a)askmonty.org-20100729111348-jjp89wlvs3kg0fqq
parent: timour(a)askmonty.org-20100723082500-kwqzzvuv62nw412k
committer: sanja(a)askmonty.org
branch nick: work-maria-5.3-lb608834
timestamp: Thu 2010-07-29 14:13:48 +0300
message:
Bugfix for lounchpad bug#608834 (608824, 609045, 609052).
Added get_tmp_table_item() to cache wrapper as it has all not simple Items (Item_func, Item_field, Item_subquery).
=== modified file 'mysql-test/r/subquery_cache.result'
--- a/mysql-test/r/subquery_cache.result 2010-07-10 10:37:30 +0000
+++ b/mysql-test/r/subquery_cache.result 2010-07-29 11:13:48 +0000
@@ -1838,3 +1838,1046 @@
Handler_read_rnd_next 27
drop table t0,t1,t2;
set optimizer_switch='default';
+#launchpad BUG#608834
+CREATE TABLE `t2` (
+`pk` int(11) NOT NULL AUTO_INCREMENT,
+`col_int_nokey` int(11) DEFAULT NULL,
+`col_int_key` int(11) DEFAULT NULL,
+`col_time_key` time DEFAULT NULL,
+`col_varchar_key` varchar(1) DEFAULT NULL,
+`col_varchar_nokey` varchar(1) DEFAULT NULL,
+PRIMARY KEY (`pk`),
+KEY `col_int_key` (`col_int_key`),
+KEY `col_time_key` (`col_time_key`),
+KEY `col_varchar_key` (`col_varchar_key`,`col_int_key`)
+) ENGINE=MyISAM AUTO_INCREMENT=30 DEFAULT CHARSET=latin1;
+INSERT INTO `t2` VALUES (10,7,8,'01:27:35','v','v');
+INSERT INTO `t2` VALUES (11,1,9,'19:48:31','r','r');
+INSERT INTO `t2` VALUES (12,5,9,'00:00:00','a','a');
+INSERT INTO `t2` VALUES (13,3,186,'19:53:05','m','m');
+INSERT INTO `t2` VALUES (14,6,NULL,'19:18:56','y','y');
+INSERT INTO `t2` VALUES (15,92,2,'10:55:12','j','j');
+INSERT INTO `t2` VALUES (16,7,3,'00:25:00','d','d');
+INSERT INTO `t2` VALUES (17,NULL,0,'12:35:47','z','z');
+INSERT INTO `t2` VALUES (18,3,133,'19:53:03','e','e');
+INSERT INTO `t2` VALUES (19,5,1,'17:53:30','h','h');
+INSERT INTO `t2` VALUES (20,1,8,'11:35:49','b','b');
+INSERT INTO `t2` VALUES (21,2,5,NULL,'s','s');
+INSERT INTO `t2` VALUES (22,NULL,5,'06:01:40','e','e');
+INSERT INTO `t2` VALUES (23,1,8,'05:45:11','j','j');
+INSERT INTO `t2` VALUES (24,0,6,'00:00:00','e','e');
+INSERT INTO `t2` VALUES (25,210,51,'00:00:00','f','f');
+INSERT INTO `t2` VALUES (26,8,4,'06:11:01','v','v');
+INSERT INTO `t2` VALUES (27,7,7,'13:02:46','x','x');
+INSERT INTO `t2` VALUES (28,5,6,'21:44:25','m','m');
+INSERT INTO `t2` VALUES (29,NULL,4,'22:43:58','c','c');
+CREATE TABLE `t1` (
+`pk` int(11) NOT NULL AUTO_INCREMENT,
+`col_int_nokey` int(11) DEFAULT NULL,
+`col_int_key` int(11) DEFAULT NULL,
+`col_time_key` time DEFAULT NULL,
+`col_varchar_key` varchar(1) DEFAULT NULL,
+`col_varchar_nokey` varchar(1) DEFAULT NULL,
+PRIMARY KEY (`pk`),
+KEY `col_int_key` (`col_int_key`),
+KEY `col_time_key` (`col_time_key`),
+KEY `col_varchar_key` (`col_varchar_key`,`col_int_key`)
+) ENGINE=MyISAM AUTO_INCREMENT=21 DEFAULT CHARSET=latin1;
+INSERT INTO `t1` VALUES (1,NULL,2,'11:28:45','w','w');
+INSERT INTO `t1` VALUES (2,7,9,'20:25:14','m','m');
+INSERT INTO `t1` VALUES (3,9,3,'13:47:24','m','m');
+INSERT INTO `t1` VALUES (4,7,9,'19:24:11','k','k');
+INSERT INTO `t1` VALUES (5,4,NULL,'15:59:13','r','r');
+INSERT INTO `t1` VALUES (6,2,9,'00:00:00','t','t');
+INSERT INTO `t1` VALUES (7,6,3,'15:15:04','j','j');
+INSERT INTO `t1` VALUES (8,8,8,'11:32:06','u','u');
+INSERT INTO `t1` VALUES (9,NULL,8,'18:32:33','h','h');
+INSERT INTO `t1` VALUES (10,5,53,'15:19:25','o','o');
+INSERT INTO `t1` VALUES (11,NULL,0,'19:03:19',NULL,NULL);
+INSERT INTO `t1` VALUES (12,6,5,'00:39:46','k','k');
+INSERT INTO `t1` VALUES (13,188,166,NULL,'e','e');
+INSERT INTO `t1` VALUES (14,2,3,'00:00:00','n','n');
+INSERT INTO `t1` VALUES (15,1,0,'13:12:11','t','t');
+INSERT INTO `t1` VALUES (16,1,1,'04:56:48','c','c');
+INSERT INTO `t1` VALUES (17,0,9,'19:56:05','m','m');
+INSERT INTO `t1` VALUES (18,9,5,'19:35:19','y','y');
+INSERT INTO `t1` VALUES (19,NULL,6,'05:03:03','f','f');
+INSERT INTO `t1` VALUES (20,4,2,'18:38:59','d','d');
+set @@optimizer_switch='subquery_cache=off';
+/* cache is off */ SELECT (
+SELECT 4
+FROM DUAL ) AS field1 , SUM( DISTINCT table1 . `pk` ) AS field2 , (
+SELECT MAX( SUBQUERY2_t1 . `col_int_nokey` ) AS SUBQUERY2_field1
+FROM ( t1 AS SUBQUERY2_t1 INNER JOIN t1 AS SUBQUERY2_t2 ON (SUBQUERY2_t2 . `col_int_key` = SUBQUERY2_t1 . `pk` ) )
+WHERE SUBQUERY2_t2 . `col_varchar_nokey` <= table1 . `col_varchar_key` OR SUBQUERY2_t1 . `col_int_nokey` < table1 . `pk` ) AS field3 , table1 . `col_time_key` AS field4 , table1 . `col_int_key` AS field5 , CONCAT ( table2 . `col_varchar_nokey` , table1 . `col_varchar_key` ) AS field6
+FROM ( t1 AS table1 INNER JOIN ( ( t1 AS table2 LEFT JOIN t2 AS table3 ON (table3 . `col_varchar_key` = table2 . `col_varchar_key` ) ) ) ON (table3 . `col_varchar_key` = table2 . `col_varchar_nokey` ) )
+WHERE ( table2 . `col_varchar_nokey` NOT IN (
+SELECT 'd' UNION
+SELECT 'u' ) ) OR table3 . `col_varchar_nokey` <= table1 . `col_varchar_key`
+GROUP BY field1, field3, field4, field5, field6
+ORDER BY table1 . `col_int_key` , field1, field2, field3, field4, field5, field6
+;
+field1 field2 field3 field4 field5 field6
+4 5 9 15:59:13 NULL cr
+4 5 9 15:59:13 NULL dr
+4 5 9 15:59:13 NULL er
+4 5 9 15:59:13 NULL fr
+4 5 9 15:59:13 NULL hr
+4 5 9 15:59:13 NULL jr
+4 5 9 15:59:13 NULL mr
+4 5 9 15:59:13 NULL rr
+4 5 9 15:59:13 NULL yr
+4 11 9 19:03:19 0 NULL
+4 15 9 13:12:11 0 ct
+4 15 9 13:12:11 0 dt
+4 15 9 13:12:11 0 et
+4 15 9 13:12:11 0 ft
+4 15 9 13:12:11 0 ht
+4 15 9 13:12:11 0 jt
+4 15 9 13:12:11 0 mt
+4 15 9 13:12:11 0 rt
+4 15 9 13:12:11 0 yt
+4 16 9 04:56:48 1 cc
+4 16 9 04:56:48 1 ec
+4 16 9 04:56:48 1 fc
+4 16 9 04:56:48 1 hc
+4 16 9 04:56:48 1 jc
+4 16 9 04:56:48 1 mc
+4 16 9 04:56:48 1 rc
+4 16 9 04:56:48 1 yc
+4 1 9 11:28:45 2 cw
+4 1 9 11:28:45 2 dw
+4 1 9 11:28:45 2 ew
+4 1 9 11:28:45 2 fw
+4 1 9 11:28:45 2 hw
+4 1 9 11:28:45 2 jw
+4 1 9 11:28:45 2 mw
+4 1 9 11:28:45 2 rw
+4 1 9 11:28:45 2 yw
+4 20 9 18:38:59 2 cd
+4 20 9 18:38:59 2 dd
+4 20 9 18:38:59 2 ed
+4 20 9 18:38:59 2 fd
+4 20 9 18:38:59 2 hd
+4 20 9 18:38:59 2 jd
+4 20 9 18:38:59 2 md
+4 20 9 18:38:59 2 rd
+4 20 9 18:38:59 2 yd
+4 3 9 13:47:24 3 cm
+4 3 9 13:47:24 3 dm
+4 3 9 13:47:24 3 em
+4 3 9 13:47:24 3 fm
+4 3 9 13:47:24 3 hm
+4 3 9 13:47:24 3 jm
+4 3 9 13:47:24 3 mm
+4 3 9 13:47:24 3 rm
+4 3 9 13:47:24 3 ym
+4 7 9 15:15:04 3 cj
+4 7 9 15:15:04 3 dj
+4 7 9 15:15:04 3 ej
+4 7 9 15:15:04 3 fj
+4 7 9 15:15:04 3 hj
+4 7 9 15:15:04 3 jj
+4 7 9 15:15:04 3 mj
+4 7 9 15:15:04 3 rj
+4 7 9 15:15:04 3 yj
+4 14 9 00:00:00 3 cn
+4 14 9 00:00:00 3 dn
+4 14 9 00:00:00 3 en
+4 14 9 00:00:00 3 fn
+4 14 9 00:00:00 3 hn
+4 14 9 00:00:00 3 jn
+4 14 9 00:00:00 3 mn
+4 14 9 00:00:00 3 rn
+4 14 9 00:00:00 3 yn
+4 12 9 00:39:46 5 ck
+4 12 9 00:39:46 5 dk
+4 12 9 00:39:46 5 ek
+4 12 9 00:39:46 5 fk
+4 12 9 00:39:46 5 hk
+4 12 9 00:39:46 5 jk
+4 12 9 00:39:46 5 mk
+4 12 9 00:39:46 5 rk
+4 12 9 00:39:46 5 yk
+4 18 9 19:35:19 5 cy
+4 18 9 19:35:19 5 dy
+4 18 9 19:35:19 5 ey
+4 18 9 19:35:19 5 fy
+4 18 9 19:35:19 5 hy
+4 18 9 19:35:19 5 jy
+4 18 9 19:35:19 5 my
+4 18 9 19:35:19 5 ry
+4 18 9 19:35:19 5 yy
+4 19 9 05:03:03 6 cf
+4 19 9 05:03:03 6 df
+4 19 9 05:03:03 6 ef
+4 19 9 05:03:03 6 ff
+4 19 9 05:03:03 6 hf
+4 19 9 05:03:03 6 jf
+4 19 9 05:03:03 6 mf
+4 19 9 05:03:03 6 rf
+4 19 9 05:03:03 6 yf
+4 8 9 11:32:06 8 cu
+4 8 9 11:32:06 8 du
+4 8 9 11:32:06 8 eu
+4 8 9 11:32:06 8 fu
+4 8 9 11:32:06 8 hu
+4 8 9 11:32:06 8 ju
+4 8 9 11:32:06 8 mu
+4 8 9 11:32:06 8 ru
+4 8 9 11:32:06 8 yu
+4 9 8 18:32:33 8 ch
+4 9 8 18:32:33 8 dh
+4 9 8 18:32:33 8 eh
+4 9 8 18:32:33 8 fh
+4 9 8 18:32:33 8 hh
+4 9 8 18:32:33 8 jh
+4 9 8 18:32:33 8 mh
+4 9 8 18:32:33 8 rh
+4 9 8 18:32:33 8 yh
+4 2 9 20:25:14 9 cm
+4 2 9 20:25:14 9 dm
+4 2 9 20:25:14 9 em
+4 2 9 20:25:14 9 fm
+4 2 9 20:25:14 9 hm
+4 2 9 20:25:14 9 jm
+4 2 9 20:25:14 9 mm
+4 2 9 20:25:14 9 rm
+4 2 9 20:25:14 9 ym
+4 4 9 19:24:11 9 ck
+4 4 9 19:24:11 9 dk
+4 4 9 19:24:11 9 ek
+4 4 9 19:24:11 9 fk
+4 4 9 19:24:11 9 hk
+4 4 9 19:24:11 9 jk
+4 4 9 19:24:11 9 mk
+4 4 9 19:24:11 9 rk
+4 4 9 19:24:11 9 yk
+4 6 9 00:00:00 9 ct
+4 6 9 00:00:00 9 dt
+4 6 9 00:00:00 9 et
+4 6 9 00:00:00 9 ft
+4 6 9 00:00:00 9 ht
+4 6 9 00:00:00 9 jt
+4 6 9 00:00:00 9 mt
+4 6 9 00:00:00 9 rt
+4 6 9 00:00:00 9 yt
+4 17 9 19:56:05 9 cm
+4 17 9 19:56:05 9 dm
+4 17 9 19:56:05 9 em
+4 17 9 19:56:05 9 fm
+4 17 9 19:56:05 9 hm
+4 17 9 19:56:05 9 jm
+4 17 9 19:56:05 9 mm
+4 17 9 19:56:05 9 rm
+4 17 9 19:56:05 9 ym
+4 10 9 15:19:25 53 co
+4 10 9 15:19:25 53 do
+4 10 9 15:19:25 53 eo
+4 10 9 15:19:25 53 fo
+4 10 9 15:19:25 53 ho
+4 10 9 15:19:25 53 jo
+4 10 9 15:19:25 53 mo
+4 10 9 15:19:25 53 ro
+4 10 9 15:19:25 53 yo
+4 13 9 NULL 166 ce
+4 13 9 NULL 166 de
+4 13 9 NULL 166 ee
+4 13 9 NULL 166 fe
+4 13 9 NULL 166 he
+4 13 9 NULL 166 je
+4 13 9 NULL 166 me
+4 13 9 NULL 166 re
+4 13 9 NULL 166 ye
+set @@optimizer_switch='subquery_cache=on';
+/* cache is on */ SELECT (
+SELECT 4
+FROM DUAL ) AS field1 , SUM( DISTINCT table1 . `pk` ) AS field2 , (
+SELECT MAX( SUBQUERY2_t1 . `col_int_nokey` ) AS SUBQUERY2_field1
+FROM ( t1 AS SUBQUERY2_t1 INNER JOIN t1 AS SUBQUERY2_t2 ON (SUBQUERY2_t2 . `col_int_key` = SUBQUERY2_t1 . `pk` ) )
+WHERE SUBQUERY2_t2 . `col_varchar_nokey` <= table1 . `col_varchar_key` OR SUBQUERY2_t1 . `col_int_nokey` < table1 . `pk` ) AS field3 , table1 . `col_time_key` AS field4 , table1 . `col_int_key` AS field5 , CONCAT ( table2 . `col_varchar_nokey` , table1 . `col_varchar_key` ) AS field6
+FROM ( t1 AS table1 INNER JOIN ( ( t1 AS table2 LEFT JOIN t2 AS table3 ON (table3 . `col_varchar_key` = table2 . `col_varchar_key` ) ) ) ON (table3 . `col_varchar_key` = table2 . `col_varchar_nokey` ) )
+WHERE ( table2 . `col_varchar_nokey` NOT IN (
+SELECT 'd' UNION
+SELECT 'u' ) ) OR table3 . `col_varchar_nokey` <= table1 . `col_varchar_key`
+GROUP BY field1, field3, field4, field5, field6
+ORDER BY table1 . `col_int_key` , field1, field2, field3, field4, field5, field6
+;
+field1 field2 field3 field4 field5 field6
+4 5 9 15:59:13 NULL cr
+4 5 9 15:59:13 NULL dr
+4 5 9 15:59:13 NULL er
+4 5 9 15:59:13 NULL fr
+4 5 9 15:59:13 NULL hr
+4 5 9 15:59:13 NULL jr
+4 5 9 15:59:13 NULL mr
+4 5 9 15:59:13 NULL rr
+4 5 9 15:59:13 NULL yr
+4 11 9 19:03:19 0 NULL
+4 15 9 13:12:11 0 ct
+4 15 9 13:12:11 0 dt
+4 15 9 13:12:11 0 et
+4 15 9 13:12:11 0 ft
+4 15 9 13:12:11 0 ht
+4 15 9 13:12:11 0 jt
+4 15 9 13:12:11 0 mt
+4 15 9 13:12:11 0 rt
+4 15 9 13:12:11 0 yt
+4 16 9 04:56:48 1 cc
+4 16 9 04:56:48 1 ec
+4 16 9 04:56:48 1 fc
+4 16 9 04:56:48 1 hc
+4 16 9 04:56:48 1 jc
+4 16 9 04:56:48 1 mc
+4 16 9 04:56:48 1 rc
+4 16 9 04:56:48 1 yc
+4 1 9 11:28:45 2 cw
+4 1 9 11:28:45 2 dw
+4 1 9 11:28:45 2 ew
+4 1 9 11:28:45 2 fw
+4 1 9 11:28:45 2 hw
+4 1 9 11:28:45 2 jw
+4 1 9 11:28:45 2 mw
+4 1 9 11:28:45 2 rw
+4 1 9 11:28:45 2 yw
+4 20 9 18:38:59 2 cd
+4 20 9 18:38:59 2 dd
+4 20 9 18:38:59 2 ed
+4 20 9 18:38:59 2 fd
+4 20 9 18:38:59 2 hd
+4 20 9 18:38:59 2 jd
+4 20 9 18:38:59 2 md
+4 20 9 18:38:59 2 rd
+4 20 9 18:38:59 2 yd
+4 3 9 13:47:24 3 cm
+4 3 9 13:47:24 3 dm
+4 3 9 13:47:24 3 em
+4 3 9 13:47:24 3 fm
+4 3 9 13:47:24 3 hm
+4 3 9 13:47:24 3 jm
+4 3 9 13:47:24 3 mm
+4 3 9 13:47:24 3 rm
+4 3 9 13:47:24 3 ym
+4 7 9 15:15:04 3 cj
+4 7 9 15:15:04 3 dj
+4 7 9 15:15:04 3 ej
+4 7 9 15:15:04 3 fj
+4 7 9 15:15:04 3 hj
+4 7 9 15:15:04 3 jj
+4 7 9 15:15:04 3 mj
+4 7 9 15:15:04 3 rj
+4 7 9 15:15:04 3 yj
+4 14 9 00:00:00 3 cn
+4 14 9 00:00:00 3 dn
+4 14 9 00:00:00 3 en
+4 14 9 00:00:00 3 fn
+4 14 9 00:00:00 3 hn
+4 14 9 00:00:00 3 jn
+4 14 9 00:00:00 3 mn
+4 14 9 00:00:00 3 rn
+4 14 9 00:00:00 3 yn
+4 12 9 00:39:46 5 ck
+4 12 9 00:39:46 5 dk
+4 12 9 00:39:46 5 ek
+4 12 9 00:39:46 5 fk
+4 12 9 00:39:46 5 hk
+4 12 9 00:39:46 5 jk
+4 12 9 00:39:46 5 mk
+4 12 9 00:39:46 5 rk
+4 12 9 00:39:46 5 yk
+4 18 9 19:35:19 5 cy
+4 18 9 19:35:19 5 dy
+4 18 9 19:35:19 5 ey
+4 18 9 19:35:19 5 fy
+4 18 9 19:35:19 5 hy
+4 18 9 19:35:19 5 jy
+4 18 9 19:35:19 5 my
+4 18 9 19:35:19 5 ry
+4 18 9 19:35:19 5 yy
+4 19 9 05:03:03 6 cf
+4 19 9 05:03:03 6 df
+4 19 9 05:03:03 6 ef
+4 19 9 05:03:03 6 ff
+4 19 9 05:03:03 6 hf
+4 19 9 05:03:03 6 jf
+4 19 9 05:03:03 6 mf
+4 19 9 05:03:03 6 rf
+4 19 9 05:03:03 6 yf
+4 8 9 11:32:06 8 cu
+4 8 9 11:32:06 8 du
+4 8 9 11:32:06 8 eu
+4 8 9 11:32:06 8 fu
+4 8 9 11:32:06 8 hu
+4 8 9 11:32:06 8 ju
+4 8 9 11:32:06 8 mu
+4 8 9 11:32:06 8 ru
+4 8 9 11:32:06 8 yu
+4 9 8 18:32:33 8 ch
+4 9 8 18:32:33 8 dh
+4 9 8 18:32:33 8 eh
+4 9 8 18:32:33 8 fh
+4 9 8 18:32:33 8 hh
+4 9 8 18:32:33 8 jh
+4 9 8 18:32:33 8 mh
+4 9 8 18:32:33 8 rh
+4 9 8 18:32:33 8 yh
+4 2 9 20:25:14 9 cm
+4 2 9 20:25:14 9 dm
+4 2 9 20:25:14 9 em
+4 2 9 20:25:14 9 fm
+4 2 9 20:25:14 9 hm
+4 2 9 20:25:14 9 jm
+4 2 9 20:25:14 9 mm
+4 2 9 20:25:14 9 rm
+4 2 9 20:25:14 9 ym
+4 4 9 19:24:11 9 ck
+4 4 9 19:24:11 9 dk
+4 4 9 19:24:11 9 ek
+4 4 9 19:24:11 9 fk
+4 4 9 19:24:11 9 hk
+4 4 9 19:24:11 9 jk
+4 4 9 19:24:11 9 mk
+4 4 9 19:24:11 9 rk
+4 4 9 19:24:11 9 yk
+4 6 9 00:00:00 9 ct
+4 6 9 00:00:00 9 dt
+4 6 9 00:00:00 9 et
+4 6 9 00:00:00 9 ft
+4 6 9 00:00:00 9 ht
+4 6 9 00:00:00 9 jt
+4 6 9 00:00:00 9 mt
+4 6 9 00:00:00 9 rt
+4 6 9 00:00:00 9 yt
+4 17 9 19:56:05 9 cm
+4 17 9 19:56:05 9 dm
+4 17 9 19:56:05 9 em
+4 17 9 19:56:05 9 fm
+4 17 9 19:56:05 9 hm
+4 17 9 19:56:05 9 jm
+4 17 9 19:56:05 9 mm
+4 17 9 19:56:05 9 rm
+4 17 9 19:56:05 9 ym
+4 10 9 15:19:25 53 co
+4 10 9 15:19:25 53 do
+4 10 9 15:19:25 53 eo
+4 10 9 15:19:25 53 fo
+4 10 9 15:19:25 53 ho
+4 10 9 15:19:25 53 jo
+4 10 9 15:19:25 53 mo
+4 10 9 15:19:25 53 ro
+4 10 9 15:19:25 53 yo
+4 13 9 NULL 166 ce
+4 13 9 NULL 166 de
+4 13 9 NULL 166 ee
+4 13 9 NULL 166 fe
+4 13 9 NULL 166 he
+4 13 9 NULL 166 je
+4 13 9 NULL 166 me
+4 13 9 NULL 166 re
+4 13 9 NULL 166 ye
+drop table t1,t2;
+set @@optimizer_switch= default;
+#launchpad BUG#609045
+CREATE TABLE `t1` (
+`pk` int(11) NOT NULL AUTO_INCREMENT,
+`col_int_nokey` int(11) DEFAULT NULL,
+`col_int_key` int(11) DEFAULT NULL,
+`col_date_key` date DEFAULT NULL,
+`col_date_nokey` date DEFAULT NULL,
+`col_time_key` time DEFAULT NULL,
+`col_time_nokey` time DEFAULT NULL,
+`col_datetime_key` datetime DEFAULT NULL,
+`col_datetime_nokey` datetime DEFAULT NULL,
+`col_varchar_key` varchar(1) DEFAULT NULL,
+`col_varchar_nokey` varchar(1) DEFAULT NULL,
+PRIMARY KEY (`pk`),
+KEY `col_int_key` (`col_int_key`),
+KEY `col_date_key` (`col_date_key`),
+KEY `col_time_key` (`col_time_key`),
+KEY `col_datetime_key` (`col_datetime_key`),
+KEY `col_varchar_key` (`col_varchar_key`,`col_int_key`)
+) ENGINE=MyISAM AUTO_INCREMENT=21 DEFAULT CHARSET=latin1;
+INSERT INTO `t1` VALUES (1,NULL,2,NULL,NULL,'11:28:45','11:28:45','2004-10-11 18:13:16','2004-10-11 18:13:16','w','w');
+INSERT INTO `t1` VALUES (2,7,9,'2001-09-19','2001-09-19','20:25:14','20:25:14',NULL,NULL,'m','m');
+INSERT INTO `t1` VALUES (3,9,3,'2004-09-12','2004-09-12','13:47:24','13:47:24','1900-01-01 00:00:00','1900-01-01 00:00:00','m','m');
+INSERT INTO `t1` VALUES (4,7,9,NULL,NULL,'19:24:11','19:24:11','2009-07-25 00:00:00','2009-07-25 00:00:00','k','k');
+INSERT INTO `t1` VALUES (5,4,NULL,'2002-07-19','2002-07-19','15:59:13','15:59:13',NULL,NULL,'r','r');
+INSERT INTO `t1` VALUES (6,2,9,'2002-12-16','2002-12-16','00:00:00','00:00:00','2008-07-27 00:00:00','2008-07-27 00:00:00','t','t');
+INSERT INTO `t1` VALUES (7,6,3,'2006-02-08','2006-02-08','15:15:04','15:15:04','2002-11-13 16:37:31','2002-11-13 16:37:31','j','j');
+INSERT INTO `t1` VALUES (8,8,8,'2006-08-28','2006-08-28','11:32:06','11:32:06','1900-01-01 00:00:00','1900-01-01 00:00:00','u','u');
+INSERT INTO `t1` VALUES (9,NULL,8,'2001-04-14','2001-04-14','18:32:33','18:32:33','2003-12-10 00:00:00','2003-12-10 00:00:00','h','h');
+INSERT INTO `t1` VALUES (10,5,53,'2000-01-05','2000-01-05','15:19:25','15:19:25','2001-12-21 22:38:22','2001-12-21 22:38:22','o','o');
+INSERT INTO `t1` VALUES (11,NULL,0,'2003-12-06','2003-12-06','19:03:19','19:03:19','2008-12-13 23:16:44','2008-12-13 23:16:44',NULL,NULL);
+INSERT INTO `t1` VALUES (12,6,5,'1900-01-01','1900-01-01','00:39:46','00:39:46','2005-08-15 12:39:41','2005-08-15 12:39:41','k','k');
+INSERT INTO `t1` VALUES (13,188,166,'2002-11-27','2002-11-27',NULL,NULL,NULL,NULL,'e','e');
+INSERT INTO `t1` VALUES (14,2,3,NULL,NULL,'00:00:00','00:00:00','2006-09-11 12:06:14','2006-09-11 12:06:14','n','n');
+INSERT INTO `t1` VALUES (15,1,0,'2003-05-27','2003-05-27','13:12:11','13:12:11','2007-12-15 12:39:34','2007-12-15 12:39:34','t','t');
+INSERT INTO `t1` VALUES (16,1,1,'2005-05-03','2005-05-03','04:56:48','04:56:48','2005-08-09 00:00:00','2005-08-09 00:00:00','c','c');
+INSERT INTO `t1` VALUES (17,0,9,'2001-04-18','2001-04-18','19:56:05','19:56:05','2001-09-02 22:50:02','2001-09-02 22:50:02','m','m');
+INSERT INTO `t1` VALUES (18,9,5,'2005-12-27','2005-12-27','19:35:19','19:35:19','2005-12-16 22:58:11','2005-12-16 22:58:11','y','y');
+INSERT INTO `t1` VALUES (19,NULL,6,'2004-08-20','2004-08-20','05:03:03','05:03:03','2007-04-19 00:19:53','2007-04-19 00:19:53','f','f');
+INSERT INTO `t1` VALUES (20,4,2,'1900-01-01','1900-01-01','18:38:59','18:38:59','1900-01-01 00:00:00','1900-01-01 00:00:00','d','d');
+CREATE TABLE `t2` (
+`pk` int(11) NOT NULL AUTO_INCREMENT,
+`col_int_nokey` int(11) DEFAULT NULL,
+`col_int_key` int(11) DEFAULT NULL,
+`col_date_key` date DEFAULT NULL,
+`col_date_nokey` date DEFAULT NULL,
+`col_time_key` time DEFAULT NULL,
+`col_time_nokey` time DEFAULT NULL,
+`col_datetime_key` datetime DEFAULT NULL,
+`col_datetime_nokey` datetime DEFAULT NULL,
+`col_varchar_key` varchar(1) DEFAULT NULL,
+`col_varchar_nokey` varchar(1) DEFAULT NULL,
+PRIMARY KEY (`pk`),
+KEY `col_int_key` (`col_int_key`),
+KEY `col_date_key` (`col_date_key`),
+KEY `col_time_key` (`col_time_key`),
+KEY `col_datetime_key` (`col_datetime_key`),
+KEY `col_varchar_key` (`col_varchar_key`,`col_int_key`)
+);
+INSERT INTO `t2` VALUES (10,7,8,NULL,NULL,'01:27:35','01:27:35','2002-02-26 06:14:37','2002-02-26 06:14:37','v','v');
+INSERT INTO `t2` VALUES (11,1,9,'2006-06-14','2006-06-14','19:48:31','19:48:31','1900-01-01 00:00:00','1900-01-01 00:00:00','r','r');
+INSERT INTO `t2` VALUES (12,5,9,'2002-09-12','2002-09-12','00:00:00','00:00:00','2006-12-03 09:37:26','2006-12-03 09:37:26','a','a');
+INSERT INTO `t2` VALUES (13,3,186,'2005-02-15','2005-02-15','19:53:05','19:53:05','2008-05-26 12:27:10','2008-05-26 12:27:10','m','m');
+INSERT INTO `t2` VALUES (14,6,NULL,NULL,NULL,'19:18:56','19:18:56','2004-12-14 16:37:30','2004-12-14 16:37:30','y','y');
+INSERT INTO `t2` VALUES (15,92,2,'2008-11-04','2008-11-04','10:55:12','10:55:12','2003-02-11 21:19:41','2003-02-11 21:19:41','j','j');
+INSERT INTO `t2` VALUES (16,7,3,'2004-09-04','2004-09-04','00:25:00','00:25:00','2009-10-18 02:27:49','2009-10-18 02:27:49','d','d');
+INSERT INTO `t2` VALUES (17,NULL,0,'2006-06-05','2006-06-05','12:35:47','12:35:47','2000-09-26 07:45:57','2000-09-26 07:45:57','z','z');
+INSERT INTO `t2` VALUES (18,3,133,'1900-01-01','1900-01-01','19:53:03','19:53:03',NULL,NULL,'e','e');
+INSERT INTO `t2` VALUES (19,5,1,'1900-01-01','1900-01-01','17:53:30','17:53:30','2005-11-10 12:40:29','2005-11-10 12:40:29','h','h');
+INSERT INTO `t2` VALUES (20,1,8,'1900-01-01','1900-01-01','11:35:49','11:35:49','2009-04-25 00:00:00','2009-04-25 00:00:00','b','b');
+INSERT INTO `t2` VALUES (21,2,5,'2005-01-13','2005-01-13',NULL,NULL,'2002-11-27 00:00:00','2002-11-27 00:00:00','s','s');
+INSERT INTO `t2` VALUES (22,NULL,5,'2006-05-21','2006-05-21','06:01:40','06:01:40','2004-01-26 20:32:32','2004-01-26 20:32:32','e','e');
+INSERT INTO `t2` VALUES (23,1,8,'2003-09-08','2003-09-08','05:45:11','05:45:11','2007-10-26 11:41:40','2007-10-26 11:41:40','j','j');
+INSERT INTO `t2` VALUES (24,0,6,'2006-12-23','2006-12-23','00:00:00','00:00:00','2005-10-07 00:00:00','2005-10-07 00:00:00','e','e');
+INSERT INTO `t2` VALUES (25,210,51,'2006-10-15','2006-10-15','00:00:00','00:00:00','2000-07-15 05:00:34','2000-07-15 05:00:34','f','f');
+INSERT INTO `t2` VALUES (26,8,4,'2005-04-06','2005-04-06','06:11:01','06:11:01','2000-04-03 16:33:32','2000-04-03 16:33:32','v','v');
+INSERT INTO `t2` VALUES (27,7,7,'2008-04-07','2008-04-07','13:02:46','13:02:46',NULL,NULL,'x','x');
+INSERT INTO `t2` VALUES (28,5,6,'2006-10-10','2006-10-10','21:44:25','21:44:25','2001-04-25 01:26:12','2001-04-25 01:26:12','m','m');
+INSERT INTO `t2` VALUES (29,NULL,4,'1900-01-01','1900-01-01','22:43:58','22:43:58','2000-12-27 00:00:00','2000-12-27 00:00:00','c','c');
+CREATE TABLE `t3` (
+`pk` int(11) NOT NULL AUTO_INCREMENT,
+`col_int_nokey` int(11) DEFAULT NULL,
+`col_int_key` int(11) DEFAULT NULL,
+`col_date_key` date DEFAULT NULL,
+`col_date_nokey` date DEFAULT NULL,
+`col_time_key` time DEFAULT NULL,
+`col_time_nokey` time DEFAULT NULL,
+`col_datetime_key` datetime DEFAULT NULL,
+`col_datetime_nokey` datetime DEFAULT NULL,
+`col_varchar_key` varchar(1) DEFAULT NULL,
+`col_varchar_nokey` varchar(1) DEFAULT NULL,
+PRIMARY KEY (`pk`),
+KEY `col_int_key` (`col_int_key`),
+KEY `col_date_key` (`col_date_key`),
+KEY `col_time_key` (`col_time_key`),
+KEY `col_datetime_key` (`col_datetime_key`),
+KEY `col_varchar_key` (`col_varchar_key`,`col_int_key`)
+);
+INSERT INTO `t3` VALUES (1,1,7,'1900-01-01','1900-01-01','01:13:38','01:13:38','2005-02-05 00:00:00','2005-02-05 00:00:00','f','f');
+CREATE TABLE `t4` (
+`pk` int(11) NOT NULL AUTO_INCREMENT,
+`col_int_nokey` int(11) DEFAULT NULL,
+`col_int_key` int(11) DEFAULT NULL,
+`col_date_key` date DEFAULT NULL,
+`col_date_nokey` date DEFAULT NULL,
+`col_time_key` time DEFAULT NULL,
+`col_time_nokey` time DEFAULT NULL,
+`col_datetime_key` datetime DEFAULT NULL,
+`col_datetime_nokey` datetime DEFAULT NULL,
+`col_varchar_key` varchar(1) DEFAULT NULL,
+`col_varchar_nokey` varchar(1) DEFAULT NULL,
+PRIMARY KEY (`pk`),
+KEY `col_int_key` (`col_int_key`),
+KEY `col_date_key` (`col_date_key`),
+KEY `col_time_key` (`col_time_key`),
+KEY `col_datetime_key` (`col_datetime_key`),
+KEY `col_varchar_key` (`col_varchar_key`,`col_int_key`)
+);
+INSERT INTO `t4` VALUES (1,6,NULL,'2003-05-12','2003-05-12',NULL,NULL,'2000-09-12 00:00:00','2000-09-12 00:00:00','r','r');
+INSERT INTO `t4` VALUES (2,8,0,'2003-01-07','2003-01-07','14:34:45','14:34:45','2004-08-10 09:09:31','2004-08-10 09:09:31','c','c');
+INSERT INTO `t4` VALUES (3,6,0,NULL,NULL,'11:49:48','11:49:48','2005-03-21 04:31:40','2005-03-21 04:31:40','o','o');
+INSERT INTO `t4` VALUES (4,6,7,'2005-03-12','2005-03-12','18:12:55','18:12:55','2002-10-25 23:50:35','2002-10-25 23:50:35','c','c');
+INSERT INTO `t4` VALUES (5,3,8,'2000-08-02','2000-08-02','18:30:05','18:30:05','2001-04-01 21:14:04','2001-04-01 21:14:04','d','d');
+INSERT INTO `t4` VALUES (6,9,4,'1900-01-01','1900-01-01','14:19:30','14:19:30','2005-03-12 06:02:34','2005-03-12 06:02:34','v','v');
+INSERT INTO `t4` VALUES (7,2,6,'2006-07-06','2006-07-06','05:20:04','05:20:04','2001-05-06 14:49:12','2001-05-06 14:49:12','m','m');
+INSERT INTO `t4` VALUES (8,1,5,'2006-12-24','2006-12-24','20:29:31','20:29:31','2004-04-25 00:00:00','2004-04-25 00:00:00','j','j');
+INSERT INTO `t4` VALUES (9,8,NULL,'2004-11-16','2004-11-16','07:08:09','07:08:09','2001-03-22 18:38:43','2001-03-22 18:38:43','f','f');
+INSERT INTO `t4` VALUES (10,0,NULL,'2002-09-09','2002-09-09','14:49:14','14:49:14','2006-04-25 21:03:02','2006-04-25 21:03:02','n','n');
+INSERT INTO `t4` VALUES (11,9,8,NULL,NULL,'00:00:00','00:00:00','2009-09-07 18:40:43','2009-09-07 18:40:43','z','z');
+INSERT INTO `t4` VALUES (12,8,8,'2008-06-24','2008-06-24','09:58:06','09:58:06','2004-03-23 00:00:00','2004-03-23 00:00:00','h','h');
+INSERT INTO `t4` VALUES (13,NULL,8,'2001-04-21','2001-04-21',NULL,NULL,'2009-04-15 00:08:29','2009-04-15 00:08:29','q','q');
+INSERT INTO `t4` VALUES (14,0,1,'2003-11-22','2003-11-22','18:24:16','18:24:16','2000-04-21 00:00:00','2000-04-21 00:00:00','w','w');
+INSERT INTO `t4` VALUES (15,5,1,'2004-09-12','2004-09-12','17:39:57','17:39:57','2000-02-17 19:41:23','2000-02-17 19:41:23','z','z');
+INSERT INTO `t4` VALUES (16,1,5,'2006-06-20','2006-06-20','08:23:21','08:23:21','2003-09-20 07:38:14','2003-09-20 07:38:14','j','j');
+INSERT INTO `t4` VALUES (17,1,2,NULL,NULL,NULL,NULL,'2000-11-28 20:42:12','2000-11-28 20:42:12','a','a');
+INSERT INTO `t4` VALUES (18,6,7,'2001-11-25','2001-11-25','21:50:46','21:50:46','2005-06-12 11:13:17','2005-06-12 11:13:17','m','m');
+INSERT INTO `t4` VALUES (19,6,6,'2004-10-26','2004-10-26','12:33:17','12:33:17','1900-01-01 00:00:00','1900-01-01 00:00:00','n','n');
+INSERT INTO `t4` VALUES (20,1,4,'2005-01-19','2005-01-19','03:06:43','03:06:43','2006-02-09 20:41:06','2006-02-09 20:41:06','e','e');
+INSERT INTO `t4` VALUES (21,8,7,'2008-07-06','2008-07-06','03:46:14','03:46:14','2004-05-22 01:05:57','2004-05-22 01:05:57','u','u');
+INSERT INTO `t4` VALUES (22,1,0,'1900-01-01','1900-01-01','20:34:52','20:34:52','2004-03-04 13:46:31','2004-03-04 13:46:31','s','s');
+INSERT INTO `t4` VALUES (23,0,9,'1900-01-01','1900-01-01',NULL,NULL,'1900-01-01 00:00:00','1900-01-01 00:00:00','u','u');
+INSERT INTO `t4` VALUES (24,4,3,'2004-06-08','2004-06-08','10:41:20','10:41:20','2004-10-20 07:20:19','2004-10-20 07:20:19','r','r');
+INSERT INTO `t4` VALUES (25,9,5,'2007-02-20','2007-02-20','08:43:11','08:43:11','2006-04-17 00:00:00','2006-04-17 00:00:00','g','g');
+INSERT INTO `t4` VALUES (26,8,1,'2008-06-18','2008-06-18',NULL,NULL,'2000-10-27 00:00:00','2000-10-27 00:00:00','o','o');
+INSERT INTO `t4` VALUES (27,5,1,'2008-05-15','2008-05-15','10:17:51','10:17:51','2007-04-14 08:54:06','2007-04-14 08:54:06','w','w');
+INSERT INTO `t4` VALUES (28,9,5,'2005-10-06','2005-10-06','06:34:09','06:34:09','2008-04-12 17:03:52','2008-04-12 17:03:52','b','b');
+INSERT INTO `t4` VALUES (29,5,9,NULL,NULL,'21:22:47','21:22:47','2007-02-19 17:37:09','2007-02-19 17:37:09',NULL,NULL);
+INSERT INTO `t4` VALUES (30,NULL,2,'2006-10-12','2006-10-12','04:02:32','04:02:32','1900-01-01 00:00:00','1900-01-01 00:00:00','y','y');
+INSERT INTO `t4` VALUES (31,NULL,5,'2005-01-24','2005-01-24','02:33:14','02:33:14','2001-10-10 08:32:27','2001-10-10 08:32:27','y','y');
+INSERT INTO `t4` VALUES (32,105,248,'2009-06-27','2009-06-27','16:32:56','16:32:56',NULL,NULL,'u','u');
+INSERT INTO `t4` VALUES (33,0,0,NULL,NULL,'21:32:42','21:32:42','2001-12-16 05:31:53','2001-12-16 05:31:53','p','p');
+INSERT INTO `t4` VALUES (34,3,8,NULL,NULL,'23:04:47','23:04:47','2003-07-19 18:03:28','2003-07-19 18:03:28','s','s');
+INSERT INTO `t4` VALUES (35,1,1,'1900-01-01','1900-01-01','22:05:43','22:05:43','2001-03-27 11:44:10','2001-03-27 11:44:10','e','e');
+INSERT INTO `t4` VALUES (36,75,255,'2005-12-22','2005-12-22','02:05:45','02:05:45','2008-06-15 02:13:00','2008-06-15 02:13:00','d','d');
+INSERT INTO `t4` VALUES (37,9,9,'2005-05-03','2005-05-03','00:00:00','00:00:00','2009-03-14 21:29:56','2009-03-14 21:29:56','d','d');
+INSERT INTO `t4` VALUES (38,7,9,'2003-05-27','2003-05-27','18:09:07','18:09:07','2005-01-02 00:00:00','2005-01-02 00:00:00','c','c');
+INSERT INTO `t4` VALUES (39,NULL,3,'2006-05-25','2006-05-25','10:54:06','10:54:06','2007-07-16 04:44:07','2007-07-16 04:44:07','b','b');
+INSERT INTO `t4` VALUES (40,NULL,9,NULL,NULL,'23:15:50','23:15:50','2003-08-26 21:38:26','2003-08-26 21:38:26','t','t');
+INSERT INTO `t4` VALUES (41,4,6,'2009-01-04','2009-01-04','10:17:40','10:17:40','2004-04-19 04:18:47','2004-04-19 04:18:47',NULL,NULL);
+INSERT INTO `t4` VALUES (42,0,4,'2009-02-14','2009-02-14','03:37:09','03:37:09','2000-01-06 20:32:48','2000-01-06 20:32:48','y','y');
+INSERT INTO `t4` VALUES (43,204,60,'2003-01-16','2003-01-16','22:26:06','22:26:06','2006-06-23 13:27:17','2006-06-23 13:27:17','c','c');
+INSERT INTO `t4` VALUES (44,0,7,'1900-01-01','1900-01-01','17:10:38','17:10:38','2007-11-27 00:00:00','2007-11-27 00:00:00','d','d');
+INSERT INTO `t4` VALUES (45,9,1,'2007-06-26','2007-06-26','00:00:00','00:00:00','2002-04-03 12:06:51','2002-04-03 12:06:51','x','x');
+INSERT INTO `t4` VALUES (46,8,6,'2004-03-27','2004-03-27','17:08:49','17:08:49','2008-12-28 09:47:42','2008-12-28 09:47:42','p','p');
+INSERT INTO `t4` VALUES (47,7,4,NULL,NULL,'19:04:40','19:04:40','2002-04-04 10:07:54','2002-04-04 10:07:54','e','e');
+INSERT INTO `t4` VALUES (48,8,NULL,'2005-06-06','2005-06-06','20:53:28','20:53:28','2003-04-26 02:55:13','2003-04-26 02:55:13','g','g');
+INSERT INTO `t4` VALUES (49,NULL,8,'2003-03-02','2003-03-02','11:46:03','11:46:03',NULL,NULL,'x','x');
+INSERT INTO `t4` VALUES (50,6,0,'2004-05-13','2004-05-13',NULL,NULL,'2009-02-19 03:17:06','2009-02-19 03:17:06','s','s');
+INSERT INTO `t4` VALUES (51,5,8,'2005-09-13','2005-09-13','10:58:07','10:58:07','1900-01-01 00:00:00','1900-01-01 00:00:00','e','e');
+INSERT INTO `t4` VALUES (52,2,151,'2005-10-03','2005-10-03','00:00:00','00:00:00','2000-11-10 08:20:01','2000-11-10 08:20:01','l','l');
+INSERT INTO `t4` VALUES (53,3,7,'2005-10-14','2005-10-14','09:43:15','09:43:15','2008-02-10 00:00:00','2008-02-10 00:00:00','p','p');
+INSERT INTO `t4` VALUES (54,7,6,NULL,NULL,'21:40:32','21:40:32','1900-01-01 00:00:00','1900-01-01 00:00:00','h','h');
+INSERT INTO `t4` VALUES (55,NULL,NULL,'2005-09-16','2005-09-16','00:17:44','00:17:44',NULL,NULL,'m','m');
+INSERT INTO `t4` VALUES (56,145,23,'2005-03-10','2005-03-10','16:47:26','16:47:26','2001-02-05 02:01:50','2001-02-05 02:01:50','n','n');
+INSERT INTO `t4` VALUES (57,0,2,'2000-06-19','2000-06-19','00:00:00','00:00:00','2000-10-28 08:44:25','2000-10-28 08:44:25','v','v');
+INSERT INTO `t4` VALUES (58,1,4,'2002-11-03','2002-11-03','05:25:59','05:25:59','2005-03-20 10:53:59','2005-03-20 10:53:59','b','b');
+INSERT INTO `t4` VALUES (59,7,NULL,'2009-01-05','2009-01-05','00:00:00','00:00:00','2001-06-02 13:54:13','2001-06-02 13:54:13','x','x');
+INSERT INTO `t4` VALUES (60,3,NULL,'2003-05-22','2003-05-22','20:33:04','20:33:04','1900-01-01 00:00:00','1900-01-01 00:00:00','r','r');
+INSERT INTO `t4` VALUES (61,NULL,77,'2005-07-02','2005-07-02','00:46:12','00:46:12','2009-07-16 13:05:43','2009-07-16 13:05:43','t','t');
+INSERT INTO `t4` VALUES (62,2,NULL,'1900-01-01','1900-01-01','00:00:00','00:00:00','2009-03-26 23:16:20','2009-03-26 23:16:20','w','w');
+INSERT INTO `t4` VALUES (63,2,NULL,'2006-06-21','2006-06-21','02:13:59','02:13:59','2003-02-06 18:12:15','2003-02-06 18:12:15','w','w');
+INSERT INTO `t4` VALUES (64,2,7,NULL,NULL,'02:54:47','02:54:47','2006-06-05 03:22:51','2006-06-05 03:22:51','k','k');
+INSERT INTO `t4` VALUES (65,8,1,'2005-12-16','2005-12-16','18:13:59','18:13:59','2002-02-10 05:47:27','2002-02-10 05:47:27','a','a');
+INSERT INTO `t4` VALUES (66,6,9,'2004-11-05','2004-11-05','13:53:08','13:53:08','2001-08-01 08:50:52','2001-08-01 08:50:52','t','t');
+INSERT INTO `t4` VALUES (67,1,6,NULL,NULL,'22:21:30','22:21:30','1900-01-01 00:00:00','1900-01-01 00:00:00','z','z');
+INSERT INTO `t4` VALUES (68,NULL,2,'2004-09-14','2004-09-14','11:41:50','11:41:50',NULL,NULL,'e','e');
+INSERT INTO `t4` VALUES (69,1,3,'2002-04-06','2002-04-06','15:20:02','15:20:02','1900-01-01 00:00:00','1900-01-01 00:00:00','q','q');
+INSERT INTO `t4` VALUES (70,0,0,NULL,NULL,NULL,NULL,'2000-09-23 00:00:00','2000-09-23 00:00:00','e','e');
+INSERT INTO `t4` VALUES (71,4,NULL,'2002-11-13','2002-11-13',NULL,NULL,'2007-07-09 08:32:49','2007-07-09 08:32:49','v','v');
+INSERT INTO `t4` VALUES (72,1,6,'2006-05-27','2006-05-27','07:51:52','07:51:52','2000-01-05 00:00:00','2000-01-05 00:00:00','d','d');
+INSERT INTO `t4` VALUES (73,1,3,'2000-12-22','2000-12-22','00:00:00','00:00:00','2000-09-24 00:00:00','2000-09-24 00:00:00','u','u');
+INSERT INTO `t4` VALUES (74,27,195,'2004-02-21','2004-02-21',NULL,NULL,'2005-05-06 00:00:00','2005-05-06 00:00:00','o','o');
+INSERT INTO `t4` VALUES (75,4,5,'2009-05-15','2009-05-15',NULL,NULL,'2000-03-11 00:00:00','2000-03-11 00:00:00','b','b');
+INSERT INTO `t4` VALUES (76,6,2,'2008-12-12','2008-12-12','12:31:05','12:31:05','2001-09-02 16:17:35','2001-09-02 16:17:35','c','c');
+INSERT INTO `t4` VALUES (77,2,7,'2000-04-15','2000-04-15','00:00:00','00:00:00','2006-04-25 05:43:44','2006-04-25 05:43:44','q','q');
+INSERT INTO `t4` VALUES (78,248,25,NULL,NULL,'01:16:45','01:16:45','2009-10-25 22:04:02','2009-10-25 22:04:02',NULL,NULL);
+INSERT INTO `t4` VALUES (79,NULL,NULL,'2001-10-18','2001-10-18','20:38:54','20:38:54','2004-08-06 00:00:00','2004-08-06 00:00:00','h','h');
+INSERT INTO `t4` VALUES (80,9,0,'2008-05-25','2008-05-25','00:30:15','00:30:15','2001-11-27 05:07:57','2001-11-27 05:07:57','d','d');
+INSERT INTO `t4` VALUES (81,75,98,'2004-12-02','2004-12-02','23:46:36','23:46:36','2009-06-28 03:18:39','2009-06-28 03:18:39','w','w');
+INSERT INTO `t4` VALUES (82,2,6,'2002-02-15','2002-02-15','19:03:13','19:03:13','2000-03-12 00:00:00','2000-03-12 00:00:00','m','m');
+INSERT INTO `t4` VALUES (83,9,5,'2002-03-03','2002-03-03','10:54:27','10:54:27',NULL,NULL,'i','i');
+INSERT INTO `t4` VALUES (84,4,0,NULL,NULL,'00:25:47','00:25:47','2007-10-20 00:00:00','2007-10-20 00:00:00','w','w');
+INSERT INTO `t4` VALUES (85,0,3,'2003-01-26','2003-01-26','08:44:27','08:44:27','2009-09-27 00:00:00','2009-09-27 00:00:00','f','f');
+INSERT INTO `t4` VALUES (86,0,1,'2001-12-19','2001-12-19','08:15:38','08:15:38','2002-07-16 00:00:00','2002-07-16 00:00:00','k','k');
+INSERT INTO `t4` VALUES (87,1,1,'2001-08-07','2001-08-07','19:56:21','19:56:21','2005-02-20 00:00:00','2005-02-20 00:00:00','v','v');
+INSERT INTO `t4` VALUES (88,119,147,'2005-02-16','2005-02-16','00:00:00','00:00:00',NULL,NULL,'c','c');
+INSERT INTO `t4` VALUES (89,1,3,'2006-06-10','2006-06-10','20:50:52','20:50:52','2001-07-16 00:00:00','2001-07-16 00:00:00','y','y');
+INSERT INTO `t4` VALUES (90,7,3,NULL,NULL,'03:54:39','03:54:39','2009-05-20 21:04:12','2009-05-20 21:04:12','h','h');
+INSERT INTO `t4` VALUES (91,2,NULL,'2005-04-06','2005-04-06','23:58:17','23:58:17','2002-03-13 10:55:40','2002-03-13 10:55:40',NULL,NULL);
+INSERT INTO `t4` VALUES (92,7,2,'2003-04-27','2003-04-27','12:54:58','12:54:58','2005-07-12 00:00:00','2005-07-12 00:00:00','t','t');
+INSERT INTO `t4` VALUES (93,2,1,'2005-10-13','2005-10-13','04:02:43','04:02:43','2006-07-22 09:46:34','2006-07-22 09:46:34','l','l');
+INSERT INTO `t4` VALUES (94,6,8,'2003-10-02','2003-10-02','11:31:12','11:31:12','2001-09-01 00:00:00','2001-09-01 00:00:00','a','a');
+INSERT INTO `t4` VALUES (95,4,8,'2005-09-09','2005-09-09','20:20:04','20:20:04','2002-05-27 18:38:45','2002-05-27 18:38:45','r','r');
+INSERT INTO `t4` VALUES (96,5,8,NULL,NULL,'00:22:24','00:22:24',NULL,NULL,'s','s');
+INSERT INTO `t4` VALUES (97,7,0,'2006-02-15','2006-02-15','10:09:31','10:09:31',NULL,NULL,'z','z');
+INSERT INTO `t4` VALUES (98,1,1,'1900-01-01','1900-01-01',NULL,NULL,'2009-08-08 22:38:53','2009-08-08 22:38:53','j','j');
+INSERT INTO `t4` VALUES (99,7,8,'2003-12-24','2003-12-24','18:45:35','18:45:35',NULL,NULL,'c','c');
+INSERT INTO `t4` VALUES (100,2,5,'2001-07-26','2001-07-26','11:49:25','11:49:25','2007-04-25 05:08:49','2007-04-25 05:08:49','f','f');
+SET @@optimizer_switch='subquery_cache=off';
+/* cache is off */ SELECT COUNT( DISTINCT table2 .`col_int_key` ) , (
+SELECT SUBQUERY2_t1 .`col_int_key`
+FROM t3 SUBQUERY2_t1 JOIN t2 ON SUBQUERY2_t1 .`col_int_key`
+WHERE table1 .`col_varchar_key` ) , table2 .`col_varchar_nokey` field10
+FROM t4 table1 JOIN ( t1 table2 STRAIGHT_JOIN t1 table3 ON table2 .`pk` ) ON table3 .`col_varchar_key` = table2 .`col_varchar_key`
+GROUP BY field10 ;
+COUNT( DISTINCT table2 .`col_int_key` ) (
+SELECT SUBQUERY2_t1 .`col_int_key`
+FROM t3 SUBQUERY2_t1 JOIN t2 ON SUBQUERY2_t1 .`col_int_key`
+WHERE table1 .`col_varchar_key` ) field10
+1 NULL c
+1 NULL d
+1 NULL e
+1 NULL f
+1 NULL h
+1 NULL j
+2 NULL k
+2 NULL m
+1 NULL n
+1 NULL o
+0 NULL r
+2 NULL t
+1 NULL u
+1 NULL w
+1 NULL y
+SET @@optimizer_switch='subquery_cache=on';
+/* cache is on */ SELECT COUNT( DISTINCT table2 .`col_int_key` ) , (
+SELECT SUBQUERY2_t1 .`col_int_key`
+FROM t3 SUBQUERY2_t1 JOIN t2 ON SUBQUERY2_t1 .`col_int_key`
+WHERE table1 .`col_varchar_key` ) , table2 .`col_varchar_nokey` field10
+FROM t4 table1 JOIN ( t1 table2 STRAIGHT_JOIN t1 table3 ON table2 .`pk` ) ON table3 .`col_varchar_key` = table2 .`col_varchar_key`
+GROUP BY field10 ;
+COUNT( DISTINCT table2 .`col_int_key` ) (
+SELECT SUBQUERY2_t1 .`col_int_key`
+FROM t3 SUBQUERY2_t1 JOIN t2 ON SUBQUERY2_t1 .`col_int_key`
+WHERE table1 .`col_varchar_key` ) field10
+1 NULL c
+1 NULL d
+1 NULL e
+1 NULL f
+1 NULL h
+1 NULL j
+2 NULL k
+2 NULL m
+1 NULL n
+1 NULL o
+0 NULL r
+2 NULL t
+1 NULL u
+1 NULL w
+1 NULL y
+drop table t1,t2,t3,t4;
+set @@optimizer_switch= default;
+#launchpad BUG#609045
+CREATE TABLE `t2` (
+`pk` int(11) NOT NULL AUTO_INCREMENT,
+`col_int_nokey` int(11) DEFAULT NULL,
+`col_int_key` int(11) DEFAULT NULL,
+`col_varchar_key` varchar(1) DEFAULT NULL,
+`col_varchar_nokey` varchar(1) DEFAULT NULL,
+PRIMARY KEY (`pk`),
+KEY `col_int_key` (`col_int_key`),
+KEY `col_varchar_key` (`col_varchar_key`,`col_int_key`)
+) ENGINE=MyISAM AUTO_INCREMENT=30 DEFAULT CHARSET=latin1;
+INSERT INTO `t2` VALUES (10,7,8,'v','v');
+INSERT INTO `t2` VALUES (11,1,9,'r','r');
+INSERT INTO `t2` VALUES (12,5,9,'a','a');
+INSERT INTO `t2` VALUES (13,3,186,'m','m');
+INSERT INTO `t2` VALUES (14,6,NULL,'y','y');
+INSERT INTO `t2` VALUES (15,92,2,'j','j');
+INSERT INTO `t2` VALUES (16,7,3,'d','d');
+INSERT INTO `t2` VALUES (17,NULL,0,'z','z');
+INSERT INTO `t2` VALUES (18,3,133,'e','e');
+INSERT INTO `t2` VALUES (19,5,1,'h','h');
+INSERT INTO `t2` VALUES (20,1,8,'b','b');
+INSERT INTO `t2` VALUES (21,2,5,'s','s');
+INSERT INTO `t2` VALUES (22,NULL,5,'e','e');
+INSERT INTO `t2` VALUES (23,1,8,'j','j');
+INSERT INTO `t2` VALUES (24,0,6,'e','e');
+INSERT INTO `t2` VALUES (25,210,51,'f','f');
+INSERT INTO `t2` VALUES (26,8,4,'v','v');
+INSERT INTO `t2` VALUES (27,7,7,'x','x');
+INSERT INTO `t2` VALUES (28,5,6,'m','m');
+INSERT INTO `t2` VALUES (29,NULL,4,'c','c');
+CREATE TABLE `t1` (
+`pk` int(11) NOT NULL AUTO_INCREMENT,
+`col_int_nokey` int(11) DEFAULT NULL,
+`col_int_key` int(11) DEFAULT NULL,
+`col_varchar_key` varchar(1) DEFAULT NULL,
+`col_varchar_nokey` varchar(1) DEFAULT NULL,
+PRIMARY KEY (`pk`),
+KEY `col_int_key` (`col_int_key`),
+KEY `col_varchar_key` (`col_varchar_key`,`col_int_key`)
+) ENGINE=MyISAM AUTO_INCREMENT=21 DEFAULT CHARSET=latin1;
+INSERT INTO `t1` VALUES (1,NULL,2,'w','w');
+INSERT INTO `t1` VALUES (2,7,9,'m','m');
+INSERT INTO `t1` VALUES (3,9,3,'m','m');
+INSERT INTO `t1` VALUES (4,7,9,'k','k');
+INSERT INTO `t1` VALUES (5,4,NULL,'r','r');
+INSERT INTO `t1` VALUES (6,2,9,'t','t');
+INSERT INTO `t1` VALUES (7,6,3,'j','j');
+INSERT INTO `t1` VALUES (8,8,8,'u','u');
+INSERT INTO `t1` VALUES (9,NULL,8,'h','h');
+INSERT INTO `t1` VALUES (10,5,53,'o','o');
+INSERT INTO `t1` VALUES (11,NULL,0,NULL,NULL);
+INSERT INTO `t1` VALUES (12,6,5,'k','k');
+INSERT INTO `t1` VALUES (13,188,166,'e','e');
+INSERT INTO `t1` VALUES (14,2,3,'n','n');
+INSERT INTO `t1` VALUES (15,1,0,'t','t');
+INSERT INTO `t1` VALUES (16,1,1,'c','c');
+INSERT INTO `t1` VALUES (17,0,9,'m','m');
+INSERT INTO `t1` VALUES (18,9,5,'y','y');
+INSERT INTO `t1` VALUES (19,NULL,6,'f','f');
+INSERT INTO `t1` VALUES (20,4,2,'d','d');
+SET @@optimizer_switch = 'subquery_cache=off';
+/* cache is off */ SELECT SUM( DISTINCT table1 .`pk` ) , (
+SELECT MAX( `col_int_nokey` )
+FROM t1
+WHERE table1 .`pk` ) field3
+FROM t1 table1
+JOIN (
+t1 table2
+JOIN t2 table3
+ON table3 .`col_varchar_key` = table2 .`col_varchar_key`
+)
+ON table3 .`col_varchar_key` = table2 .`col_varchar_nokey`
+GROUP BY field3 ;
+SUM( DISTINCT table1 .`pk` ) field3
+210 188
+SET @@optimizer_switch = 'subquery_cache=on';
+/* cache is on */ SELECT SUM( DISTINCT table1 .`pk` ) , (
+SELECT MAX( `col_int_nokey` )
+FROM t1
+WHERE table1 .`pk` ) field3
+FROM t1 table1
+JOIN (
+t1 table2
+JOIN t2 table3
+ON table3 .`col_varchar_key` = table2 .`col_varchar_key`
+)
+ON table3 .`col_varchar_key` = table2 .`col_varchar_nokey`
+GROUP BY field3 ;
+SUM( DISTINCT table1 .`pk` ) field3
+210 188
+drop table t1,t2;
+set @@optimizer_switch= default;
+#launchpad BUG#609052
+CREATE TABLE `t2` (
+`pk` int(11) NOT NULL AUTO_INCREMENT,
+`col_int_nokey` int(11) DEFAULT NULL,
+`col_int_key` int(11) DEFAULT NULL,
+`col_time_key` time DEFAULT NULL,
+`col_varchar_key` varchar(1) DEFAULT NULL,
+`col_varchar_nokey` varchar(1) DEFAULT NULL,
+PRIMARY KEY (`pk`),
+KEY `col_int_key` (`col_int_key`),
+KEY `col_time_key` (`col_time_key`),
+KEY `col_varchar_key` (`col_varchar_key`,`col_int_key`)
+) ENGINE=MyISAM AUTO_INCREMENT=30 DEFAULT CHARSET=latin1;
+INSERT INTO `t2` VALUES (10,7,8,'01:27:35','v','v');
+INSERT INTO `t2` VALUES (11,1,9,'19:48:31','r','r');
+INSERT INTO `t2` VALUES (12,5,9,'00:00:00','a','a');
+INSERT INTO `t2` VALUES (13,3,186,'19:53:05','m','m');
+INSERT INTO `t2` VALUES (14,6,NULL,'19:18:56','y','y');
+INSERT INTO `t2` VALUES (15,92,2,'10:55:12','j','j');
+INSERT INTO `t2` VALUES (16,7,3,'00:25:00','d','d');
+INSERT INTO `t2` VALUES (17,NULL,0,'12:35:47','z','z');
+INSERT INTO `t2` VALUES (18,3,133,'19:53:03','e','e');
+INSERT INTO `t2` VALUES (19,5,1,'17:53:30','h','h');
+INSERT INTO `t2` VALUES (20,1,8,'11:35:49','b','b');
+INSERT INTO `t2` VALUES (21,2,5,NULL,'s','s');
+INSERT INTO `t2` VALUES (22,NULL,5,'06:01:40','e','e');
+INSERT INTO `t2` VALUES (23,1,8,'05:45:11','j','j');
+INSERT INTO `t2` VALUES (24,0,6,'00:00:00','e','e');
+INSERT INTO `t2` VALUES (25,210,51,'00:00:00','f','f');
+INSERT INTO `t2` VALUES (26,8,4,'06:11:01','v','v');
+INSERT INTO `t2` VALUES (27,7,7,'13:02:46','x','x');
+INSERT INTO `t2` VALUES (28,5,6,'21:44:25','m','m');
+INSERT INTO `t2` VALUES (29,NULL,4,'22:43:58','c','c');
+CREATE TABLE `t4` (
+`pk` int(11) NOT NULL AUTO_INCREMENT,
+`col_int_nokey` int(11) DEFAULT NULL,
+`col_int_key` int(11) DEFAULT NULL,
+`col_time_key` time DEFAULT NULL,
+`col_varchar_key` varchar(1) DEFAULT NULL,
+`col_varchar_nokey` varchar(1) DEFAULT NULL,
+PRIMARY KEY (`pk`),
+KEY `col_int_key` (`col_int_key`),
+KEY `col_time_key` (`col_time_key`),
+KEY `col_varchar_key` (`col_varchar_key`,`col_int_key`)
+) ENGINE=MyISAM AUTO_INCREMENT=101 DEFAULT CHARSET=latin1;
+INSERT INTO `t4` VALUES (1,6,NULL,NULL,'r','r');
+INSERT INTO `t4` VALUES (2,8,0,'14:34:45','c','c');
+INSERT INTO `t4` VALUES (3,6,0,'11:49:48','o','o');
+INSERT INTO `t4` VALUES (4,6,7,'18:12:55','c','c');
+INSERT INTO `t4` VALUES (5,3,8,'18:30:05','d','d');
+INSERT INTO `t4` VALUES (6,9,4,'14:19:30','v','v');
+INSERT INTO `t4` VALUES (7,2,6,'05:20:04','m','m');
+INSERT INTO `t4` VALUES (8,1,5,'20:29:31','j','j');
+INSERT INTO `t4` VALUES (9,8,NULL,'07:08:09','f','f');
+INSERT INTO `t4` VALUES (10,0,NULL,'14:49:14','n','n');
+INSERT INTO `t4` VALUES (11,9,8,'00:00:00','z','z');
+INSERT INTO `t4` VALUES (12,8,8,'09:58:06','h','h');
+INSERT INTO `t4` VALUES (13,NULL,8,NULL,'q','q');
+INSERT INTO `t4` VALUES (14,0,1,'18:24:16','w','w');
+INSERT INTO `t4` VALUES (15,5,1,'17:39:57','z','z');
+INSERT INTO `t4` VALUES (16,1,5,'08:23:21','j','j');
+INSERT INTO `t4` VALUES (17,1,2,NULL,'a','a');
+INSERT INTO `t4` VALUES (18,6,7,'21:50:46','m','m');
+INSERT INTO `t4` VALUES (19,6,6,'12:33:17','n','n');
+INSERT INTO `t4` VALUES (20,1,4,'03:06:43','e','e');
+INSERT INTO `t4` VALUES (21,8,7,'03:46:14','u','u');
+INSERT INTO `t4` VALUES (22,1,0,'20:34:52','s','s');
+INSERT INTO `t4` VALUES (23,0,9,NULL,'u','u');
+INSERT INTO `t4` VALUES (24,4,3,'10:41:20','r','r');
+INSERT INTO `t4` VALUES (25,9,5,'08:43:11','g','g');
+INSERT INTO `t4` VALUES (26,8,1,NULL,'o','o');
+INSERT INTO `t4` VALUES (27,5,1,'10:17:51','w','w');
+INSERT INTO `t4` VALUES (28,9,5,'06:34:09','b','b');
+INSERT INTO `t4` VALUES (29,5,9,'21:22:47',NULL,NULL);
+INSERT INTO `t4` VALUES (30,NULL,2,'04:02:32','y','y');
+INSERT INTO `t4` VALUES (31,NULL,5,'02:33:14','y','y');
+INSERT INTO `t4` VALUES (32,105,248,'16:32:56','u','u');
+INSERT INTO `t4` VALUES (33,0,0,'21:32:42','p','p');
+INSERT INTO `t4` VALUES (34,3,8,'23:04:47','s','s');
+INSERT INTO `t4` VALUES (35,1,1,'22:05:43','e','e');
+INSERT INTO `t4` VALUES (36,75,255,'02:05:45','d','d');
+INSERT INTO `t4` VALUES (37,9,9,'00:00:00','d','d');
+INSERT INTO `t4` VALUES (38,7,9,'18:09:07','c','c');
+INSERT INTO `t4` VALUES (39,NULL,3,'10:54:06','b','b');
+INSERT INTO `t4` VALUES (40,NULL,9,'23:15:50','t','t');
+INSERT INTO `t4` VALUES (41,4,6,'10:17:40',NULL,NULL);
+INSERT INTO `t4` VALUES (42,0,4,'03:37:09','y','y');
+INSERT INTO `t4` VALUES (43,204,60,'22:26:06','c','c');
+INSERT INTO `t4` VALUES (44,0,7,'17:10:38','d','d');
+INSERT INTO `t4` VALUES (45,9,1,'00:00:00','x','x');
+INSERT INTO `t4` VALUES (46,8,6,'17:08:49','p','p');
+INSERT INTO `t4` VALUES (47,7,4,'19:04:40','e','e');
+INSERT INTO `t4` VALUES (48,8,NULL,'20:53:28','g','g');
+INSERT INTO `t4` VALUES (49,NULL,8,'11:46:03','x','x');
+INSERT INTO `t4` VALUES (50,6,0,NULL,'s','s');
+INSERT INTO `t4` VALUES (51,5,8,'10:58:07','e','e');
+INSERT INTO `t4` VALUES (52,2,151,'00:00:00','l','l');
+INSERT INTO `t4` VALUES (53,3,7,'09:43:15','p','p');
+INSERT INTO `t4` VALUES (54,7,6,'21:40:32','h','h');
+INSERT INTO `t4` VALUES (55,NULL,NULL,'00:17:44','m','m');
+INSERT INTO `t4` VALUES (56,145,23,'16:47:26','n','n');
+INSERT INTO `t4` VALUES (57,0,2,'00:00:00','v','v');
+INSERT INTO `t4` VALUES (58,1,4,'05:25:59','b','b');
+INSERT INTO `t4` VALUES (59,7,NULL,'00:00:00','x','x');
+INSERT INTO `t4` VALUES (60,3,NULL,'20:33:04','r','r');
+INSERT INTO `t4` VALUES (61,NULL,77,'00:46:12','t','t');
+INSERT INTO `t4` VALUES (62,2,NULL,'00:00:00','w','w');
+INSERT INTO `t4` VALUES (63,2,NULL,'02:13:59','w','w');
+INSERT INTO `t4` VALUES (64,2,7,'02:54:47','k','k');
+INSERT INTO `t4` VALUES (65,8,1,'18:13:59','a','a');
+INSERT INTO `t4` VALUES (66,6,9,'13:53:08','t','t');
+INSERT INTO `t4` VALUES (67,1,6,'22:21:30','z','z');
+INSERT INTO `t4` VALUES (68,NULL,2,'11:41:50','e','e');
+INSERT INTO `t4` VALUES (69,1,3,'15:20:02','q','q');
+INSERT INTO `t4` VALUES (70,0,0,NULL,'e','e');
+INSERT INTO `t4` VALUES (71,4,NULL,NULL,'v','v');
+INSERT INTO `t4` VALUES (72,1,6,'07:51:52','d','d');
+INSERT INTO `t4` VALUES (73,1,3,'00:00:00','u','u');
+INSERT INTO `t4` VALUES (74,27,195,NULL,'o','o');
+INSERT INTO `t4` VALUES (75,4,5,NULL,'b','b');
+INSERT INTO `t4` VALUES (76,6,2,'12:31:05','c','c');
+INSERT INTO `t4` VALUES (77,2,7,'00:00:00','q','q');
+INSERT INTO `t4` VALUES (78,248,25,'01:16:45',NULL,NULL);
+INSERT INTO `t4` VALUES (79,NULL,NULL,'20:38:54','h','h');
+INSERT INTO `t4` VALUES (80,9,0,'00:30:15','d','d');
+INSERT INTO `t4` VALUES (81,75,98,'23:46:36','w','w');
+INSERT INTO `t4` VALUES (82,2,6,'19:03:13','m','m');
+INSERT INTO `t4` VALUES (83,9,5,'10:54:27','i','i');
+INSERT INTO `t4` VALUES (84,4,0,'00:25:47','w','w');
+INSERT INTO `t4` VALUES (85,0,3,'08:44:27','f','f');
+INSERT INTO `t4` VALUES (86,0,1,'08:15:38','k','k');
+INSERT INTO `t4` VALUES (87,1,1,'19:56:21','v','v');
+INSERT INTO `t4` VALUES (88,119,147,'00:00:00','c','c');
+INSERT INTO `t4` VALUES (89,1,3,'20:50:52','y','y');
+INSERT INTO `t4` VALUES (90,7,3,'03:54:39','h','h');
+INSERT INTO `t4` VALUES (91,2,NULL,'23:58:17',NULL,NULL);
+INSERT INTO `t4` VALUES (92,7,2,'12:54:58','t','t');
+INSERT INTO `t4` VALUES (93,2,1,'04:02:43','l','l');
+INSERT INTO `t4` VALUES (94,6,8,'11:31:12','a','a');
+INSERT INTO `t4` VALUES (95,4,8,'20:20:04','r','r');
+INSERT INTO `t4` VALUES (96,5,8,'00:22:24','s','s');
+INSERT INTO `t4` VALUES (97,7,0,'10:09:31','z','z');
+INSERT INTO `t4` VALUES (98,1,1,NULL,'j','j');
+INSERT INTO `t4` VALUES (99,7,8,'18:45:35','c','c');
+INSERT INTO `t4` VALUES (100,2,5,'11:49:25','f','f');
+CREATE TABLE `t1` (
+`pk` int(11) NOT NULL AUTO_INCREMENT,
+`col_int_nokey` int(11) DEFAULT NULL,
+`col_int_key` int(11) DEFAULT NULL,
+`col_time_key` time DEFAULT NULL,
+`col_varchar_key` varchar(1) DEFAULT NULL,
+`col_varchar_nokey` varchar(1) DEFAULT NULL,
+PRIMARY KEY (`pk`),
+KEY `col_int_key` (`col_int_key`),
+KEY `col_time_key` (`col_time_key`),
+KEY `col_varchar_key` (`col_varchar_key`,`col_int_key`)
+) ENGINE=MyISAM AUTO_INCREMENT=21 DEFAULT CHARSET=latin1;
+INSERT INTO `t1` VALUES (1,NULL,2,'11:28:45','w','w');
+INSERT INTO `t1` VALUES (2,7,9,'20:25:14','m','m');
+INSERT INTO `t1` VALUES (3,9,3,'13:47:24','m','m');
+INSERT INTO `t1` VALUES (4,7,9,'19:24:11','k','k');
+INSERT INTO `t1` VALUES (5,4,NULL,'15:59:13','r','r');
+INSERT INTO `t1` VALUES (6,2,9,'00:00:00','t','t');
+INSERT INTO `t1` VALUES (7,6,3,'15:15:04','j','j');
+INSERT INTO `t1` VALUES (8,8,8,'11:32:06','u','u');
+INSERT INTO `t1` VALUES (9,NULL,8,'18:32:33','h','h');
+INSERT INTO `t1` VALUES (10,5,53,'15:19:25','o','o');
+INSERT INTO `t1` VALUES (11,NULL,0,'19:03:19',NULL,NULL);
+INSERT INTO `t1` VALUES (12,6,5,'00:39:46','k','k');
+INSERT INTO `t1` VALUES (13,188,166,NULL,'e','e');
+INSERT INTO `t1` VALUES (14,2,3,'00:00:00','n','n');
+INSERT INTO `t1` VALUES (15,1,0,'13:12:11','t','t');
+INSERT INTO `t1` VALUES (16,1,1,'04:56:48','c','c');
+INSERT INTO `t1` VALUES (17,0,9,'19:56:05','m','m');
+INSERT INTO `t1` VALUES (18,9,5,'19:35:19','y','y');
+INSERT INTO `t1` VALUES (19,NULL,6,'05:03:03','f','f');
+INSERT INTO `t1` VALUES (20,4,2,'18:38:59','d','d');
+CREATE TABLE `t3` (
+`pk` int(11) NOT NULL AUTO_INCREMENT,
+`col_int_nokey` int(11) DEFAULT NULL,
+`col_int_key` int(11) DEFAULT NULL,
+`col_time_key` time DEFAULT NULL,
+`col_varchar_key` varchar(1) DEFAULT NULL,
+`col_varchar_nokey` varchar(1) DEFAULT NULL,
+PRIMARY KEY (`pk`),
+KEY `col_int_key` (`col_int_key`),
+KEY `col_time_key` (`col_time_key`),
+KEY `col_varchar_key` (`col_varchar_key`,`col_int_key`)
+) ENGINE=MyISAM AUTO_INCREMENT=11 DEFAULT CHARSET=latin1;
+INSERT INTO `t3` VALUES (10,8,8,'18:27:58',NULL,NULL);
+CREATE TABLE `t5` (
+`pk` int(11) NOT NULL AUTO_INCREMENT,
+`col_int_nokey` int(11) DEFAULT NULL,
+`col_int_key` int(11) DEFAULT NULL,
+`col_time_key` time DEFAULT NULL,
+`col_varchar_key` varchar(1) DEFAULT NULL,
+`col_varchar_nokey` varchar(1) DEFAULT NULL,
+PRIMARY KEY (`pk`),
+KEY `col_int_key` (`col_int_key`),
+KEY `col_time_key` (`col_time_key`),
+KEY `col_varchar_key` (`col_varchar_key`,`col_int_key`)
+) ENGINE=MyISAM AUTO_INCREMENT=2 DEFAULT CHARSET=latin1;
+INSERT INTO `t5` VALUES (1,1,7,'01:13:38','f','f');
+SET @@optimizer_switch='subquery_cache=off';
+/* cache is off */ SELECT SQL_SMALL_RESULT MAX( DISTINCT table1 . `col_varchar_key` ) AS field1 , MIN( table1 . `col_varchar_nokey` ) AS field2 , COUNT( table1 . `col_varchar_key` ) AS field3 , table2 . `col_time_key` AS field4 , COUNT( DISTINCT table2 . `col_int_key` ) AS field5 , (
+SELECT MAX( SUBQUERY1_t2 . `col_int_nokey` ) AS SUBQUERY1_field1
+FROM ( t3 AS SUBQUERY1_t1 INNER JOIN t1 AS SUBQUERY1_t2 ON (SUBQUERY1_t2 . `col_varchar_key` = SUBQUERY1_t1 . `col_varchar_nokey` ) )
+WHERE SUBQUERY1_t2 . `pk` < SUBQUERY1_t2 . `pk` ) AS field6 , COUNT( table1 . `col_varchar_nokey` ) AS field7 , COUNT( table2 . `pk` ) AS field8 , (
+SELECT MAX( SUBQUERY2_t1 . `col_int_key` ) AS SUBQUERY2_field1
+FROM ( t5 AS SUBQUERY2_t1 LEFT JOIN t2 AS SUBQUERY2_t2 ON (SUBQUERY2_t2 . `col_int_key` = SUBQUERY2_t1 . `col_int_key` ) )
+WHERE SUBQUERY2_t2 . `col_varchar_nokey` != table1 . `col_varchar_key` OR SUBQUERY2_t1 . `col_varchar_nokey` >= 'o' ) AS field9 , CONCAT ( table1 . `col_varchar_key` , table2 . `col_varchar_nokey` ) AS field10
+FROM ( t4 AS table1 LEFT JOIN ( ( t1 AS table2 STRAIGHT_JOIN t1 AS table3 ON (table3 . `col_int_nokey` = table2 . `pk` ) ) ) ON (table3 . `col_varchar_key` = table2 . `col_varchar_key` ) )
+WHERE ( EXISTS (
+SELECT SUBQUERY3_t1 . `pk` AS SUBQUERY3_field1
+FROM ( t4 AS SUBQUERY3_t1 INNER JOIN t4 AS SUBQUERY3_t2 ON (SUBQUERY3_t2 . `col_varchar_key` = SUBQUERY3_t1 . `col_varchar_key` ) )
+WHERE SUBQUERY3_t1 . `col_int_key` > table3 . `pk` AND SUBQUERY3_t1 . `pk` != table3 . `pk` ) ) AND ( table1 . `pk` > 116 AND table1 . `pk` < ( 116 + 175 ) OR table1 . `pk` IN (251) ) OR table1 . `col_int_nokey` = table1 . `col_int_nokey`
+GROUP BY field4, field6, field9, field10
+HAVING field10 = 'c'
+;
+field1 field2 field3 field4 field5 field6 field7 field8 field9 field10
+SET @@optimizer_switch='subquery_cache=on';
+/* cache is on */ SELECT SQL_SMALL_RESULT MAX( DISTINCT table1 . `col_varchar_key` ) AS field1 , MIN( table1 . `col_varchar_nokey` ) AS field2 , COUNT( table1 . `col_varchar_key` ) AS field3 , table2 . `col_time_key` AS field4 , COUNT( DISTINCT table2 . `col_int_key` ) AS field5 , (
+SELECT MAX( SUBQUERY1_t2 . `col_int_nokey` ) AS SUBQUERY1_field1
+FROM ( t3 AS SUBQUERY1_t1 INNER JOIN t1 AS SUBQUERY1_t2 ON (SUBQUERY1_t2 . `col_varchar_key` = SUBQUERY1_t1 . `col_varchar_nokey` ) )
+WHERE SUBQUERY1_t2 . `pk` < SUBQUERY1_t2 . `pk` ) AS field6 , COUNT( table1 . `col_varchar_nokey` ) AS field7 , COUNT( table2 . `pk` ) AS field8 , (
+SELECT MAX( SUBQUERY2_t1 . `col_int_key` ) AS SUBQUERY2_field1
+FROM ( t5 AS SUBQUERY2_t1 LEFT JOIN t2 AS SUBQUERY2_t2 ON (SUBQUERY2_t2 . `col_int_key` = SUBQUERY2_t1 . `col_int_key` ) )
+WHERE SUBQUERY2_t2 . `col_varchar_nokey` != table1 . `col_varchar_key` OR SUBQUERY2_t1 . `col_varchar_nokey` >= 'o' ) AS field9 , CONCAT ( table1 . `col_varchar_key` , table2 . `col_varchar_nokey` ) AS field10
+FROM ( t4 AS table1 LEFT JOIN ( ( t1 AS table2 STRAIGHT_JOIN t1 AS table3 ON (table3 . `col_int_nokey` = table2 . `pk` ) ) ) ON (table3 . `col_varchar_key` = table2 . `col_varchar_key` ) )
+WHERE ( EXISTS (
+SELECT SUBQUERY3_t1 . `pk` AS SUBQUERY3_field1
+FROM ( t4 AS SUBQUERY3_t1 INNER JOIN t4 AS SUBQUERY3_t2 ON (SUBQUERY3_t2 . `col_varchar_key` = SUBQUERY3_t1 . `col_varchar_key` ) )
+WHERE SUBQUERY3_t1 . `col_int_key` > table3 . `pk` AND SUBQUERY3_t1 . `pk` != table3 . `pk` ) ) AND ( table1 . `pk` > 116 AND table1 . `pk` < ( 116 + 175 ) OR table1 . `pk` IN (251) ) OR table1 . `col_int_nokey` = table1 . `col_int_nokey`
+GROUP BY field4, field6, field9, field10
+HAVING field10 = 'c'
+;
+field1 field2 field3 field4 field5 field6 field7 field8 field9 field10
+drop table t1,t2,t3,t4,t5;
+set @@optimizer_switch= default;
=== modified file 'mysql-test/t/subquery_cache.test'
--- a/mysql-test/t/subquery_cache.test 2010-07-10 10:37:30 +0000
+++ b/mysql-test/t/subquery_cache.test 2010-07-29 11:13:48 +0000
@@ -507,3 +507,698 @@
drop table t0,t1,t2;
set optimizer_switch='default';
+
+#
+--echo #launchpad BUG#608834
+#
+CREATE TABLE `t2` (
+ `pk` int(11) NOT NULL AUTO_INCREMENT,
+ `col_int_nokey` int(11) DEFAULT NULL,
+ `col_int_key` int(11) DEFAULT NULL,
+ `col_time_key` time DEFAULT NULL,
+ `col_varchar_key` varchar(1) DEFAULT NULL,
+ `col_varchar_nokey` varchar(1) DEFAULT NULL,
+ PRIMARY KEY (`pk`),
+ KEY `col_int_key` (`col_int_key`),
+ KEY `col_time_key` (`col_time_key`),
+ KEY `col_varchar_key` (`col_varchar_key`,`col_int_key`)
+) ENGINE=MyISAM AUTO_INCREMENT=30 DEFAULT CHARSET=latin1;
+INSERT INTO `t2` VALUES (10,7,8,'01:27:35','v','v');
+INSERT INTO `t2` VALUES (11,1,9,'19:48:31','r','r');
+INSERT INTO `t2` VALUES (12,5,9,'00:00:00','a','a');
+INSERT INTO `t2` VALUES (13,3,186,'19:53:05','m','m');
+INSERT INTO `t2` VALUES (14,6,NULL,'19:18:56','y','y');
+INSERT INTO `t2` VALUES (15,92,2,'10:55:12','j','j');
+INSERT INTO `t2` VALUES (16,7,3,'00:25:00','d','d');
+INSERT INTO `t2` VALUES (17,NULL,0,'12:35:47','z','z');
+INSERT INTO `t2` VALUES (18,3,133,'19:53:03','e','e');
+INSERT INTO `t2` VALUES (19,5,1,'17:53:30','h','h');
+INSERT INTO `t2` VALUES (20,1,8,'11:35:49','b','b');
+INSERT INTO `t2` VALUES (21,2,5,NULL,'s','s');
+INSERT INTO `t2` VALUES (22,NULL,5,'06:01:40','e','e');
+INSERT INTO `t2` VALUES (23,1,8,'05:45:11','j','j');
+INSERT INTO `t2` VALUES (24,0,6,'00:00:00','e','e');
+INSERT INTO `t2` VALUES (25,210,51,'00:00:00','f','f');
+INSERT INTO `t2` VALUES (26,8,4,'06:11:01','v','v');
+INSERT INTO `t2` VALUES (27,7,7,'13:02:46','x','x');
+INSERT INTO `t2` VALUES (28,5,6,'21:44:25','m','m');
+INSERT INTO `t2` VALUES (29,NULL,4,'22:43:58','c','c');
+CREATE TABLE `t1` (
+ `pk` int(11) NOT NULL AUTO_INCREMENT,
+ `col_int_nokey` int(11) DEFAULT NULL,
+ `col_int_key` int(11) DEFAULT NULL,
+ `col_time_key` time DEFAULT NULL,
+ `col_varchar_key` varchar(1) DEFAULT NULL,
+ `col_varchar_nokey` varchar(1) DEFAULT NULL,
+ PRIMARY KEY (`pk`),
+ KEY `col_int_key` (`col_int_key`),
+ KEY `col_time_key` (`col_time_key`),
+ KEY `col_varchar_key` (`col_varchar_key`,`col_int_key`)
+) ENGINE=MyISAM AUTO_INCREMENT=21 DEFAULT CHARSET=latin1;
+INSERT INTO `t1` VALUES (1,NULL,2,'11:28:45','w','w');
+INSERT INTO `t1` VALUES (2,7,9,'20:25:14','m','m');
+INSERT INTO `t1` VALUES (3,9,3,'13:47:24','m','m');
+INSERT INTO `t1` VALUES (4,7,9,'19:24:11','k','k');
+INSERT INTO `t1` VALUES (5,4,NULL,'15:59:13','r','r');
+INSERT INTO `t1` VALUES (6,2,9,'00:00:00','t','t');
+INSERT INTO `t1` VALUES (7,6,3,'15:15:04','j','j');
+INSERT INTO `t1` VALUES (8,8,8,'11:32:06','u','u');
+INSERT INTO `t1` VALUES (9,NULL,8,'18:32:33','h','h');
+INSERT INTO `t1` VALUES (10,5,53,'15:19:25','o','o');
+INSERT INTO `t1` VALUES (11,NULL,0,'19:03:19',NULL,NULL);
+INSERT INTO `t1` VALUES (12,6,5,'00:39:46','k','k');
+INSERT INTO `t1` VALUES (13,188,166,NULL,'e','e');
+INSERT INTO `t1` VALUES (14,2,3,'00:00:00','n','n');
+INSERT INTO `t1` VALUES (15,1,0,'13:12:11','t','t');
+INSERT INTO `t1` VALUES (16,1,1,'04:56:48','c','c');
+INSERT INTO `t1` VALUES (17,0,9,'19:56:05','m','m');
+INSERT INTO `t1` VALUES (18,9,5,'19:35:19','y','y');
+INSERT INTO `t1` VALUES (19,NULL,6,'05:03:03','f','f');
+INSERT INTO `t1` VALUES (20,4,2,'18:38:59','d','d');
+
+set @@optimizer_switch='subquery_cache=off';
+
+/* cache is off */ SELECT (
+SELECT 4
+FROM DUAL ) AS field1 , SUM( DISTINCT table1 . `pk` ) AS field2 , (
+SELECT MAX( SUBQUERY2_t1 . `col_int_nokey` ) AS SUBQUERY2_field1
+FROM ( t1 AS SUBQUERY2_t1 INNER JOIN t1 AS SUBQUERY2_t2 ON (SUBQUERY2_t2 . `col_int_key` = SUBQUERY2_t1 . `pk` ) )
+WHERE SUBQUERY2_t2 . `col_varchar_nokey` <= table1 . `col_varchar_key` OR SUBQUERY2_t1 . `col_int_nokey` < table1 . `pk` ) AS field3 , table1 . `col_time_key` AS field4 , table1 . `col_int_key` AS field5 , CONCAT ( table2 . `col_varchar_nokey` , table1 . `col_varchar_key` ) AS field6
+FROM ( t1 AS table1 INNER JOIN ( ( t1 AS table2 LEFT JOIN t2 AS table3 ON (table3 . `col_varchar_key` = table2 . `col_varchar_key` ) ) ) ON (table3 . `col_varchar_key` = table2 . `col_varchar_nokey` ) )
+WHERE ( table2 . `col_varchar_nokey` NOT IN (
+SELECT 'd' UNION
+SELECT 'u' ) ) OR table3 . `col_varchar_nokey` <= table1 . `col_varchar_key`
+GROUP BY field1, field3, field4, field5, field6
+ORDER BY table1 . `col_int_key` , field1, field2, field3, field4, field5, field6
+;
+
+set @@optimizer_switch='subquery_cache=on';
+
+/* cache is on */ SELECT (
+SELECT 4
+FROM DUAL ) AS field1 , SUM( DISTINCT table1 . `pk` ) AS field2 , (
+SELECT MAX( SUBQUERY2_t1 . `col_int_nokey` ) AS SUBQUERY2_field1
+FROM ( t1 AS SUBQUERY2_t1 INNER JOIN t1 AS SUBQUERY2_t2 ON (SUBQUERY2_t2 . `col_int_key` = SUBQUERY2_t1 . `pk` ) )
+WHERE SUBQUERY2_t2 . `col_varchar_nokey` <= table1 . `col_varchar_key` OR SUBQUERY2_t1 . `col_int_nokey` < table1 . `pk` ) AS field3 , table1 . `col_time_key` AS field4 , table1 . `col_int_key` AS field5 , CONCAT ( table2 . `col_varchar_nokey` , table1 . `col_varchar_key` ) AS field6
+FROM ( t1 AS table1 INNER JOIN ( ( t1 AS table2 LEFT JOIN t2 AS table3 ON (table3 . `col_varchar_key` = table2 . `col_varchar_key` ) ) ) ON (table3 . `col_varchar_key` = table2 . `col_varchar_nokey` ) )
+WHERE ( table2 . `col_varchar_nokey` NOT IN (
+SELECT 'd' UNION
+SELECT 'u' ) ) OR table3 . `col_varchar_nokey` <= table1 . `col_varchar_key`
+GROUP BY field1, field3, field4, field5, field6
+ORDER BY table1 . `col_int_key` , field1, field2, field3, field4, field5, field6
+;
+
+drop table t1,t2;
+set @@optimizer_switch= default;
+
+#
+--echo #launchpad BUG#609045
+#
+CREATE TABLE `t1` (
+ `pk` int(11) NOT NULL AUTO_INCREMENT,
+ `col_int_nokey` int(11) DEFAULT NULL,
+ `col_int_key` int(11) DEFAULT NULL,
+ `col_date_key` date DEFAULT NULL,
+ `col_date_nokey` date DEFAULT NULL,
+ `col_time_key` time DEFAULT NULL,
+ `col_time_nokey` time DEFAULT NULL,
+ `col_datetime_key` datetime DEFAULT NULL,
+ `col_datetime_nokey` datetime DEFAULT NULL,
+ `col_varchar_key` varchar(1) DEFAULT NULL,
+ `col_varchar_nokey` varchar(1) DEFAULT NULL,
+ PRIMARY KEY (`pk`),
+ KEY `col_int_key` (`col_int_key`),
+ KEY `col_date_key` (`col_date_key`),
+ KEY `col_time_key` (`col_time_key`),
+ KEY `col_datetime_key` (`col_datetime_key`),
+ KEY `col_varchar_key` (`col_varchar_key`,`col_int_key`)
+) ENGINE=MyISAM AUTO_INCREMENT=21 DEFAULT CHARSET=latin1;
+
+INSERT INTO `t1` VALUES (1,NULL,2,NULL,NULL,'11:28:45','11:28:45','2004-10-11 18:13:16','2004-10-11 18:13:16','w','w');
+INSERT INTO `t1` VALUES (2,7,9,'2001-09-19','2001-09-19','20:25:14','20:25:14',NULL,NULL,'m','m');
+INSERT INTO `t1` VALUES (3,9,3,'2004-09-12','2004-09-12','13:47:24','13:47:24','1900-01-01 00:00:00','1900-01-01 00:00:00','m','m');
+INSERT INTO `t1` VALUES (4,7,9,NULL,NULL,'19:24:11','19:24:11','2009-07-25 00:00:00','2009-07-25 00:00:00','k','k');
+INSERT INTO `t1` VALUES (5,4,NULL,'2002-07-19','2002-07-19','15:59:13','15:59:13',NULL,NULL,'r','r');
+INSERT INTO `t1` VALUES (6,2,9,'2002-12-16','2002-12-16','00:00:00','00:00:00','2008-07-27 00:00:00','2008-07-27 00:00:00','t','t');
+INSERT INTO `t1` VALUES (7,6,3,'2006-02-08','2006-02-08','15:15:04','15:15:04','2002-11-13 16:37:31','2002-11-13 16:37:31','j','j');
+INSERT INTO `t1` VALUES (8,8,8,'2006-08-28','2006-08-28','11:32:06','11:32:06','1900-01-01 00:00:00','1900-01-01 00:00:00','u','u');
+INSERT INTO `t1` VALUES (9,NULL,8,'2001-04-14','2001-04-14','18:32:33','18:32:33','2003-12-10 00:00:00','2003-12-10 00:00:00','h','h');
+INSERT INTO `t1` VALUES (10,5,53,'2000-01-05','2000-01-05','15:19:25','15:19:25','2001-12-21 22:38:22','2001-12-21 22:38:22','o','o');
+INSERT INTO `t1` VALUES (11,NULL,0,'2003-12-06','2003-12-06','19:03:19','19:03:19','2008-12-13 23:16:44','2008-12-13 23:16:44',NULL,NULL);
+INSERT INTO `t1` VALUES (12,6,5,'1900-01-01','1900-01-01','00:39:46','00:39:46','2005-08-15 12:39:41','2005-08-15 12:39:41','k','k');
+INSERT INTO `t1` VALUES (13,188,166,'2002-11-27','2002-11-27',NULL,NULL,NULL,NULL,'e','e');
+INSERT INTO `t1` VALUES (14,2,3,NULL,NULL,'00:00:00','00:00:00','2006-09-11 12:06:14','2006-09-11 12:06:14','n','n');
+INSERT INTO `t1` VALUES (15,1,0,'2003-05-27','2003-05-27','13:12:11','13:12:11','2007-12-15 12:39:34','2007-12-15 12:39:34','t','t');
+INSERT INTO `t1` VALUES (16,1,1,'2005-05-03','2005-05-03','04:56:48','04:56:48','2005-08-09 00:00:00','2005-08-09 00:00:00','c','c');
+INSERT INTO `t1` VALUES (17,0,9,'2001-04-18','2001-04-18','19:56:05','19:56:05','2001-09-02 22:50:02','2001-09-02 22:50:02','m','m');
+INSERT INTO `t1` VALUES (18,9,5,'2005-12-27','2005-12-27','19:35:19','19:35:19','2005-12-16 22:58:11','2005-12-16 22:58:11','y','y');
+INSERT INTO `t1` VALUES (19,NULL,6,'2004-08-20','2004-08-20','05:03:03','05:03:03','2007-04-19 00:19:53','2007-04-19 00:19:53','f','f');
+INSERT INTO `t1` VALUES (20,4,2,'1900-01-01','1900-01-01','18:38:59','18:38:59','1900-01-01 00:00:00','1900-01-01 00:00:00','d','d');
+
+CREATE TABLE `t2` (
+ `pk` int(11) NOT NULL AUTO_INCREMENT,
+ `col_int_nokey` int(11) DEFAULT NULL,
+ `col_int_key` int(11) DEFAULT NULL,
+ `col_date_key` date DEFAULT NULL,
+ `col_date_nokey` date DEFAULT NULL,
+ `col_time_key` time DEFAULT NULL,
+ `col_time_nokey` time DEFAULT NULL,
+ `col_datetime_key` datetime DEFAULT NULL,
+ `col_datetime_nokey` datetime DEFAULT NULL,
+ `col_varchar_key` varchar(1) DEFAULT NULL,
+ `col_varchar_nokey` varchar(1) DEFAULT NULL,
+ PRIMARY KEY (`pk`),
+ KEY `col_int_key` (`col_int_key`),
+ KEY `col_date_key` (`col_date_key`),
+ KEY `col_time_key` (`col_time_key`),
+ KEY `col_datetime_key` (`col_datetime_key`),
+ KEY `col_varchar_key` (`col_varchar_key`,`col_int_key`)
+);
+
+INSERT INTO `t2` VALUES (10,7,8,NULL,NULL,'01:27:35','01:27:35','2002-02-26 06:14:37','2002-02-26 06:14:37','v','v');
+INSERT INTO `t2` VALUES (11,1,9,'2006-06-14','2006-06-14','19:48:31','19:48:31','1900-01-01 00:00:00','1900-01-01 00:00:00','r','r');
+INSERT INTO `t2` VALUES (12,5,9,'2002-09-12','2002-09-12','00:00:00','00:00:00','2006-12-03 09:37:26','2006-12-03 09:37:26','a','a');
+INSERT INTO `t2` VALUES (13,3,186,'2005-02-15','2005-02-15','19:53:05','19:53:05','2008-05-26 12:27:10','2008-05-26 12:27:10','m','m');
+INSERT INTO `t2` VALUES (14,6,NULL,NULL,NULL,'19:18:56','19:18:56','2004-12-14 16:37:30','2004-12-14 16:37:30','y','y');
+INSERT INTO `t2` VALUES (15,92,2,'2008-11-04','2008-11-04','10:55:12','10:55:12','2003-02-11 21:19:41','2003-02-11 21:19:41','j','j');
+INSERT INTO `t2` VALUES (16,7,3,'2004-09-04','2004-09-04','00:25:00','00:25:00','2009-10-18 02:27:49','2009-10-18 02:27:49','d','d');
+INSERT INTO `t2` VALUES (17,NULL,0,'2006-06-05','2006-06-05','12:35:47','12:35:47','2000-09-26 07:45:57','2000-09-26 07:45:57','z','z');
+INSERT INTO `t2` VALUES (18,3,133,'1900-01-01','1900-01-01','19:53:03','19:53:03',NULL,NULL,'e','e');
+INSERT INTO `t2` VALUES (19,5,1,'1900-01-01','1900-01-01','17:53:30','17:53:30','2005-11-10 12:40:29','2005-11-10 12:40:29','h','h');
+INSERT INTO `t2` VALUES (20,1,8,'1900-01-01','1900-01-01','11:35:49','11:35:49','2009-04-25 00:00:00','2009-04-25 00:00:00','b','b');
+INSERT INTO `t2` VALUES (21,2,5,'2005-01-13','2005-01-13',NULL,NULL,'2002-11-27 00:00:00','2002-11-27 00:00:00','s','s');
+INSERT INTO `t2` VALUES (22,NULL,5,'2006-05-21','2006-05-21','06:01:40','06:01:40','2004-01-26 20:32:32','2004-01-26 20:32:32','e','e');
+INSERT INTO `t2` VALUES (23,1,8,'2003-09-08','2003-09-08','05:45:11','05:45:11','2007-10-26 11:41:40','2007-10-26 11:41:40','j','j');
+INSERT INTO `t2` VALUES (24,0,6,'2006-12-23','2006-12-23','00:00:00','00:00:00','2005-10-07 00:00:00','2005-10-07 00:00:00','e','e');
+INSERT INTO `t2` VALUES (25,210,51,'2006-10-15','2006-10-15','00:00:00','00:00:00','2000-07-15 05:00:34','2000-07-15 05:00:34','f','f');
+INSERT INTO `t2` VALUES (26,8,4,'2005-04-06','2005-04-06','06:11:01','06:11:01','2000-04-03 16:33:32','2000-04-03 16:33:32','v','v');
+INSERT INTO `t2` VALUES (27,7,7,'2008-04-07','2008-04-07','13:02:46','13:02:46',NULL,NULL,'x','x');
+INSERT INTO `t2` VALUES (28,5,6,'2006-10-10','2006-10-10','21:44:25','21:44:25','2001-04-25 01:26:12','2001-04-25 01:26:12','m','m');
+INSERT INTO `t2` VALUES (29,NULL,4,'1900-01-01','1900-01-01','22:43:58','22:43:58','2000-12-27 00:00:00','2000-12-27 00:00:00','c','c');
+
+CREATE TABLE `t3` (
+ `pk` int(11) NOT NULL AUTO_INCREMENT,
+ `col_int_nokey` int(11) DEFAULT NULL,
+ `col_int_key` int(11) DEFAULT NULL,
+ `col_date_key` date DEFAULT NULL,
+ `col_date_nokey` date DEFAULT NULL,
+ `col_time_key` time DEFAULT NULL,
+ `col_time_nokey` time DEFAULT NULL,
+ `col_datetime_key` datetime DEFAULT NULL,
+ `col_datetime_nokey` datetime DEFAULT NULL,
+ `col_varchar_key` varchar(1) DEFAULT NULL,
+ `col_varchar_nokey` varchar(1) DEFAULT NULL,
+ PRIMARY KEY (`pk`),
+ KEY `col_int_key` (`col_int_key`),
+ KEY `col_date_key` (`col_date_key`),
+ KEY `col_time_key` (`col_time_key`),
+ KEY `col_datetime_key` (`col_datetime_key`),
+ KEY `col_varchar_key` (`col_varchar_key`,`col_int_key`)
+);
+
+INSERT INTO `t3` VALUES (1,1,7,'1900-01-01','1900-01-01','01:13:38','01:13:38','2005-02-05 00:00:00','2005-02-05 00:00:00','f','f');
+
+CREATE TABLE `t4` (
+ `pk` int(11) NOT NULL AUTO_INCREMENT,
+ `col_int_nokey` int(11) DEFAULT NULL,
+ `col_int_key` int(11) DEFAULT NULL,
+ `col_date_key` date DEFAULT NULL,
+ `col_date_nokey` date DEFAULT NULL,
+ `col_time_key` time DEFAULT NULL,
+ `col_time_nokey` time DEFAULT NULL,
+ `col_datetime_key` datetime DEFAULT NULL,
+ `col_datetime_nokey` datetime DEFAULT NULL,
+ `col_varchar_key` varchar(1) DEFAULT NULL,
+ `col_varchar_nokey` varchar(1) DEFAULT NULL,
+ PRIMARY KEY (`pk`),
+ KEY `col_int_key` (`col_int_key`),
+ KEY `col_date_key` (`col_date_key`),
+ KEY `col_time_key` (`col_time_key`),
+ KEY `col_datetime_key` (`col_datetime_key`),
+ KEY `col_varchar_key` (`col_varchar_key`,`col_int_key`)
+);
+
+INSERT INTO `t4` VALUES (1,6,NULL,'2003-05-12','2003-05-12',NULL,NULL,'2000-09-12 00:00:00','2000-09-12 00:00:00','r','r');
+INSERT INTO `t4` VALUES (2,8,0,'2003-01-07','2003-01-07','14:34:45','14:34:45','2004-08-10 09:09:31','2004-08-10 09:09:31','c','c');
+INSERT INTO `t4` VALUES (3,6,0,NULL,NULL,'11:49:48','11:49:48','2005-03-21 04:31:40','2005-03-21 04:31:40','o','o');
+INSERT INTO `t4` VALUES (4,6,7,'2005-03-12','2005-03-12','18:12:55','18:12:55','2002-10-25 23:50:35','2002-10-25 23:50:35','c','c');
+INSERT INTO `t4` VALUES (5,3,8,'2000-08-02','2000-08-02','18:30:05','18:30:05','2001-04-01 21:14:04','2001-04-01 21:14:04','d','d');
+INSERT INTO `t4` VALUES (6,9,4,'1900-01-01','1900-01-01','14:19:30','14:19:30','2005-03-12 06:02:34','2005-03-12 06:02:34','v','v');
+INSERT INTO `t4` VALUES (7,2,6,'2006-07-06','2006-07-06','05:20:04','05:20:04','2001-05-06 14:49:12','2001-05-06 14:49:12','m','m');
+INSERT INTO `t4` VALUES (8,1,5,'2006-12-24','2006-12-24','20:29:31','20:29:31','2004-04-25 00:00:00','2004-04-25 00:00:00','j','j');
+INSERT INTO `t4` VALUES (9,8,NULL,'2004-11-16','2004-11-16','07:08:09','07:08:09','2001-03-22 18:38:43','2001-03-22 18:38:43','f','f');
+INSERT INTO `t4` VALUES (10,0,NULL,'2002-09-09','2002-09-09','14:49:14','14:49:14','2006-04-25 21:03:02','2006-04-25 21:03:02','n','n');
+INSERT INTO `t4` VALUES (11,9,8,NULL,NULL,'00:00:00','00:00:00','2009-09-07 18:40:43','2009-09-07 18:40:43','z','z');
+INSERT INTO `t4` VALUES (12,8,8,'2008-06-24','2008-06-24','09:58:06','09:58:06','2004-03-23 00:00:00','2004-03-23 00:00:00','h','h');
+INSERT INTO `t4` VALUES (13,NULL,8,'2001-04-21','2001-04-21',NULL,NULL,'2009-04-15 00:08:29','2009-04-15 00:08:29','q','q');
+INSERT INTO `t4` VALUES (14,0,1,'2003-11-22','2003-11-22','18:24:16','18:24:16','2000-04-21 00:00:00','2000-04-21 00:00:00','w','w');
+INSERT INTO `t4` VALUES (15,5,1,'2004-09-12','2004-09-12','17:39:57','17:39:57','2000-02-17 19:41:23','2000-02-17 19:41:23','z','z');
+INSERT INTO `t4` VALUES (16,1,5,'2006-06-20','2006-06-20','08:23:21','08:23:21','2003-09-20 07:38:14','2003-09-20 07:38:14','j','j');
+INSERT INTO `t4` VALUES (17,1,2,NULL,NULL,NULL,NULL,'2000-11-28 20:42:12','2000-11-28 20:42:12','a','a');
+INSERT INTO `t4` VALUES (18,6,7,'2001-11-25','2001-11-25','21:50:46','21:50:46','2005-06-12 11:13:17','2005-06-12 11:13:17','m','m');
+INSERT INTO `t4` VALUES (19,6,6,'2004-10-26','2004-10-26','12:33:17','12:33:17','1900-01-01 00:00:00','1900-01-01 00:00:00','n','n');
+INSERT INTO `t4` VALUES (20,1,4,'2005-01-19','2005-01-19','03:06:43','03:06:43','2006-02-09 20:41:06','2006-02-09 20:41:06','e','e');
+INSERT INTO `t4` VALUES (21,8,7,'2008-07-06','2008-07-06','03:46:14','03:46:14','2004-05-22 01:05:57','2004-05-22 01:05:57','u','u');
+INSERT INTO `t4` VALUES (22,1,0,'1900-01-01','1900-01-01','20:34:52','20:34:52','2004-03-04 13:46:31','2004-03-04 13:46:31','s','s');
+INSERT INTO `t4` VALUES (23,0,9,'1900-01-01','1900-01-01',NULL,NULL,'1900-01-01 00:00:00','1900-01-01 00:00:00','u','u');
+INSERT INTO `t4` VALUES (24,4,3,'2004-06-08','2004-06-08','10:41:20','10:41:20','2004-10-20 07:20:19','2004-10-20 07:20:19','r','r');
+INSERT INTO `t4` VALUES (25,9,5,'2007-02-20','2007-02-20','08:43:11','08:43:11','2006-04-17 00:00:00','2006-04-17 00:00:00','g','g');
+INSERT INTO `t4` VALUES (26,8,1,'2008-06-18','2008-06-18',NULL,NULL,'2000-10-27 00:00:00','2000-10-27 00:00:00','o','o');
+INSERT INTO `t4` VALUES (27,5,1,'2008-05-15','2008-05-15','10:17:51','10:17:51','2007-04-14 08:54:06','2007-04-14 08:54:06','w','w');
+INSERT INTO `t4` VALUES (28,9,5,'2005-10-06','2005-10-06','06:34:09','06:34:09','2008-04-12 17:03:52','2008-04-12 17:03:52','b','b');
+INSERT INTO `t4` VALUES (29,5,9,NULL,NULL,'21:22:47','21:22:47','2007-02-19 17:37:09','2007-02-19 17:37:09',NULL,NULL);
+INSERT INTO `t4` VALUES (30,NULL,2,'2006-10-12','2006-10-12','04:02:32','04:02:32','1900-01-01 00:00:00','1900-01-01 00:00:00','y','y');
+INSERT INTO `t4` VALUES (31,NULL,5,'2005-01-24','2005-01-24','02:33:14','02:33:14','2001-10-10 08:32:27','2001-10-10 08:32:27','y','y');
+INSERT INTO `t4` VALUES (32,105,248,'2009-06-27','2009-06-27','16:32:56','16:32:56',NULL,NULL,'u','u');
+INSERT INTO `t4` VALUES (33,0,0,NULL,NULL,'21:32:42','21:32:42','2001-12-16 05:31:53','2001-12-16 05:31:53','p','p');
+INSERT INTO `t4` VALUES (34,3,8,NULL,NULL,'23:04:47','23:04:47','2003-07-19 18:03:28','2003-07-19 18:03:28','s','s');
+INSERT INTO `t4` VALUES (35,1,1,'1900-01-01','1900-01-01','22:05:43','22:05:43','2001-03-27 11:44:10','2001-03-27 11:44:10','e','e');
+INSERT INTO `t4` VALUES (36,75,255,'2005-12-22','2005-12-22','02:05:45','02:05:45','2008-06-15 02:13:00','2008-06-15 02:13:00','d','d');
+INSERT INTO `t4` VALUES (37,9,9,'2005-05-03','2005-05-03','00:00:00','00:00:00','2009-03-14 21:29:56','2009-03-14 21:29:56','d','d');
+INSERT INTO `t4` VALUES (38,7,9,'2003-05-27','2003-05-27','18:09:07','18:09:07','2005-01-02 00:00:00','2005-01-02 00:00:00','c','c');
+INSERT INTO `t4` VALUES (39,NULL,3,'2006-05-25','2006-05-25','10:54:06','10:54:06','2007-07-16 04:44:07','2007-07-16 04:44:07','b','b');
+INSERT INTO `t4` VALUES (40,NULL,9,NULL,NULL,'23:15:50','23:15:50','2003-08-26 21:38:26','2003-08-26 21:38:26','t','t');
+INSERT INTO `t4` VALUES (41,4,6,'2009-01-04','2009-01-04','10:17:40','10:17:40','2004-04-19 04:18:47','2004-04-19 04:18:47',NULL,NULL);
+INSERT INTO `t4` VALUES (42,0,4,'2009-02-14','2009-02-14','03:37:09','03:37:09','2000-01-06 20:32:48','2000-01-06 20:32:48','y','y');
+INSERT INTO `t4` VALUES (43,204,60,'2003-01-16','2003-01-16','22:26:06','22:26:06','2006-06-23 13:27:17','2006-06-23 13:27:17','c','c');
+INSERT INTO `t4` VALUES (44,0,7,'1900-01-01','1900-01-01','17:10:38','17:10:38','2007-11-27 00:00:00','2007-11-27 00:00:00','d','d');
+INSERT INTO `t4` VALUES (45,9,1,'2007-06-26','2007-06-26','00:00:00','00:00:00','2002-04-03 12:06:51','2002-04-03 12:06:51','x','x');
+INSERT INTO `t4` VALUES (46,8,6,'2004-03-27','2004-03-27','17:08:49','17:08:49','2008-12-28 09:47:42','2008-12-28 09:47:42','p','p');
+INSERT INTO `t4` VALUES (47,7,4,NULL,NULL,'19:04:40','19:04:40','2002-04-04 10:07:54','2002-04-04 10:07:54','e','e');
+INSERT INTO `t4` VALUES (48,8,NULL,'2005-06-06','2005-06-06','20:53:28','20:53:28','2003-04-26 02:55:13','2003-04-26 02:55:13','g','g');
+INSERT INTO `t4` VALUES (49,NULL,8,'2003-03-02','2003-03-02','11:46:03','11:46:03',NULL,NULL,'x','x');
+INSERT INTO `t4` VALUES (50,6,0,'2004-05-13','2004-05-13',NULL,NULL,'2009-02-19 03:17:06','2009-02-19 03:17:06','s','s');
+INSERT INTO `t4` VALUES (51,5,8,'2005-09-13','2005-09-13','10:58:07','10:58:07','1900-01-01 00:00:00','1900-01-01 00:00:00','e','e');
+INSERT INTO `t4` VALUES (52,2,151,'2005-10-03','2005-10-03','00:00:00','00:00:00','2000-11-10 08:20:01','2000-11-10 08:20:01','l','l');
+INSERT INTO `t4` VALUES (53,3,7,'2005-10-14','2005-10-14','09:43:15','09:43:15','2008-02-10 00:00:00','2008-02-10 00:00:00','p','p');
+INSERT INTO `t4` VALUES (54,7,6,NULL,NULL,'21:40:32','21:40:32','1900-01-01 00:00:00','1900-01-01 00:00:00','h','h');
+INSERT INTO `t4` VALUES (55,NULL,NULL,'2005-09-16','2005-09-16','00:17:44','00:17:44',NULL,NULL,'m','m');
+INSERT INTO `t4` VALUES (56,145,23,'2005-03-10','2005-03-10','16:47:26','16:47:26','2001-02-05 02:01:50','2001-02-05 02:01:50','n','n');
+INSERT INTO `t4` VALUES (57,0,2,'2000-06-19','2000-06-19','00:00:00','00:00:00','2000-10-28 08:44:25','2000-10-28 08:44:25','v','v');
+INSERT INTO `t4` VALUES (58,1,4,'2002-11-03','2002-11-03','05:25:59','05:25:59','2005-03-20 10:53:59','2005-03-20 10:53:59','b','b');
+INSERT INTO `t4` VALUES (59,7,NULL,'2009-01-05','2009-01-05','00:00:00','00:00:00','2001-06-02 13:54:13','2001-06-02 13:54:13','x','x');
+INSERT INTO `t4` VALUES (60,3,NULL,'2003-05-22','2003-05-22','20:33:04','20:33:04','1900-01-01 00:00:00','1900-01-01 00:00:00','r','r');
+INSERT INTO `t4` VALUES (61,NULL,77,'2005-07-02','2005-07-02','00:46:12','00:46:12','2009-07-16 13:05:43','2009-07-16 13:05:43','t','t');
+INSERT INTO `t4` VALUES (62,2,NULL,'1900-01-01','1900-01-01','00:00:00','00:00:00','2009-03-26 23:16:20','2009-03-26 23:16:20','w','w');
+INSERT INTO `t4` VALUES (63,2,NULL,'2006-06-21','2006-06-21','02:13:59','02:13:59','2003-02-06 18:12:15','2003-02-06 18:12:15','w','w');
+INSERT INTO `t4` VALUES (64,2,7,NULL,NULL,'02:54:47','02:54:47','2006-06-05 03:22:51','2006-06-05 03:22:51','k','k');
+INSERT INTO `t4` VALUES (65,8,1,'2005-12-16','2005-12-16','18:13:59','18:13:59','2002-02-10 05:47:27','2002-02-10 05:47:27','a','a');
+INSERT INTO `t4` VALUES (66,6,9,'2004-11-05','2004-11-05','13:53:08','13:53:08','2001-08-01 08:50:52','2001-08-01 08:50:52','t','t');
+INSERT INTO `t4` VALUES (67,1,6,NULL,NULL,'22:21:30','22:21:30','1900-01-01 00:00:00','1900-01-01 00:00:00','z','z');
+INSERT INTO `t4` VALUES (68,NULL,2,'2004-09-14','2004-09-14','11:41:50','11:41:50',NULL,NULL,'e','e');
+INSERT INTO `t4` VALUES (69,1,3,'2002-04-06','2002-04-06','15:20:02','15:20:02','1900-01-01 00:00:00','1900-01-01 00:00:00','q','q');
+INSERT INTO `t4` VALUES (70,0,0,NULL,NULL,NULL,NULL,'2000-09-23 00:00:00','2000-09-23 00:00:00','e','e');
+INSERT INTO `t4` VALUES (71,4,NULL,'2002-11-13','2002-11-13',NULL,NULL,'2007-07-09 08:32:49','2007-07-09 08:32:49','v','v');
+INSERT INTO `t4` VALUES (72,1,6,'2006-05-27','2006-05-27','07:51:52','07:51:52','2000-01-05 00:00:00','2000-01-05 00:00:00','d','d');
+INSERT INTO `t4` VALUES (73,1,3,'2000-12-22','2000-12-22','00:00:00','00:00:00','2000-09-24 00:00:00','2000-09-24 00:00:00','u','u');
+INSERT INTO `t4` VALUES (74,27,195,'2004-02-21','2004-02-21',NULL,NULL,'2005-05-06 00:00:00','2005-05-06 00:00:00','o','o');
+INSERT INTO `t4` VALUES (75,4,5,'2009-05-15','2009-05-15',NULL,NULL,'2000-03-11 00:00:00','2000-03-11 00:00:00','b','b');
+INSERT INTO `t4` VALUES (76,6,2,'2008-12-12','2008-12-12','12:31:05','12:31:05','2001-09-02 16:17:35','2001-09-02 16:17:35','c','c');
+INSERT INTO `t4` VALUES (77,2,7,'2000-04-15','2000-04-15','00:00:00','00:00:00','2006-04-25 05:43:44','2006-04-25 05:43:44','q','q');
+INSERT INTO `t4` VALUES (78,248,25,NULL,NULL,'01:16:45','01:16:45','2009-10-25 22:04:02','2009-10-25 22:04:02',NULL,NULL);
+INSERT INTO `t4` VALUES (79,NULL,NULL,'2001-10-18','2001-10-18','20:38:54','20:38:54','2004-08-06 00:00:00','2004-08-06 00:00:00','h','h');
+INSERT INTO `t4` VALUES (80,9,0,'2008-05-25','2008-05-25','00:30:15','00:30:15','2001-11-27 05:07:57','2001-11-27 05:07:57','d','d');
+INSERT INTO `t4` VALUES (81,75,98,'2004-12-02','2004-12-02','23:46:36','23:46:36','2009-06-28 03:18:39','2009-06-28 03:18:39','w','w');
+INSERT INTO `t4` VALUES (82,2,6,'2002-02-15','2002-02-15','19:03:13','19:03:13','2000-03-12 00:00:00','2000-03-12 00:00:00','m','m');
+INSERT INTO `t4` VALUES (83,9,5,'2002-03-03','2002-03-03','10:54:27','10:54:27',NULL,NULL,'i','i');
+INSERT INTO `t4` VALUES (84,4,0,NULL,NULL,'00:25:47','00:25:47','2007-10-20 00:00:00','2007-10-20 00:00:00','w','w');
+INSERT INTO `t4` VALUES (85,0,3,'2003-01-26','2003-01-26','08:44:27','08:44:27','2009-09-27 00:00:00','2009-09-27 00:00:00','f','f');
+INSERT INTO `t4` VALUES (86,0,1,'2001-12-19','2001-12-19','08:15:38','08:15:38','2002-07-16 00:00:00','2002-07-16 00:00:00','k','k');
+INSERT INTO `t4` VALUES (87,1,1,'2001-08-07','2001-08-07','19:56:21','19:56:21','2005-02-20 00:00:00','2005-02-20 00:00:00','v','v');
+INSERT INTO `t4` VALUES (88,119,147,'2005-02-16','2005-02-16','00:00:00','00:00:00',NULL,NULL,'c','c');
+INSERT INTO `t4` VALUES (89,1,3,'2006-06-10','2006-06-10','20:50:52','20:50:52','2001-07-16 00:00:00','2001-07-16 00:00:00','y','y');
+INSERT INTO `t4` VALUES (90,7,3,NULL,NULL,'03:54:39','03:54:39','2009-05-20 21:04:12','2009-05-20 21:04:12','h','h');
+INSERT INTO `t4` VALUES (91,2,NULL,'2005-04-06','2005-04-06','23:58:17','23:58:17','2002-03-13 10:55:40','2002-03-13 10:55:40',NULL,NULL);
+INSERT INTO `t4` VALUES (92,7,2,'2003-04-27','2003-04-27','12:54:58','12:54:58','2005-07-12 00:00:00','2005-07-12 00:00:00','t','t');
+INSERT INTO `t4` VALUES (93,2,1,'2005-10-13','2005-10-13','04:02:43','04:02:43','2006-07-22 09:46:34','2006-07-22 09:46:34','l','l');
+INSERT INTO `t4` VALUES (94,6,8,'2003-10-02','2003-10-02','11:31:12','11:31:12','2001-09-01 00:00:00','2001-09-01 00:00:00','a','a');
+INSERT INTO `t4` VALUES (95,4,8,'2005-09-09','2005-09-09','20:20:04','20:20:04','2002-05-27 18:38:45','2002-05-27 18:38:45','r','r');
+INSERT INTO `t4` VALUES (96,5,8,NULL,NULL,'00:22:24','00:22:24',NULL,NULL,'s','s');
+INSERT INTO `t4` VALUES (97,7,0,'2006-02-15','2006-02-15','10:09:31','10:09:31',NULL,NULL,'z','z');
+INSERT INTO `t4` VALUES (98,1,1,'1900-01-01','1900-01-01',NULL,NULL,'2009-08-08 22:38:53','2009-08-08 22:38:53','j','j');
+INSERT INTO `t4` VALUES (99,7,8,'2003-12-24','2003-12-24','18:45:35','18:45:35',NULL,NULL,'c','c');
+INSERT INTO `t4` VALUES (100,2,5,'2001-07-26','2001-07-26','11:49:25','11:49:25','2007-04-25 05:08:49','2007-04-25 05:08:49','f','f');
+
+SET @@optimizer_switch='subquery_cache=off';
+
+/* cache is off */ SELECT COUNT( DISTINCT table2 .`col_int_key` ) , (
+SELECT SUBQUERY2_t1 .`col_int_key`
+FROM t3 SUBQUERY2_t1 JOIN t2 ON SUBQUERY2_t1 .`col_int_key`
+WHERE table1 .`col_varchar_key` ) , table2 .`col_varchar_nokey` field10
+FROM t4 table1 JOIN ( t1 table2 STRAIGHT_JOIN t1 table3 ON table2 .`pk` ) ON table3 .`col_varchar_key` = table2 .`col_varchar_key`
+GROUP BY field10 ;
+
+SET @@optimizer_switch='subquery_cache=on';
+
+/* cache is on */ SELECT COUNT( DISTINCT table2 .`col_int_key` ) , (
+SELECT SUBQUERY2_t1 .`col_int_key`
+FROM t3 SUBQUERY2_t1 JOIN t2 ON SUBQUERY2_t1 .`col_int_key`
+WHERE table1 .`col_varchar_key` ) , table2 .`col_varchar_nokey` field10
+FROM t4 table1 JOIN ( t1 table2 STRAIGHT_JOIN t1 table3 ON table2 .`pk` ) ON table3 .`col_varchar_key` = table2 .`col_varchar_key`
+GROUP BY field10 ;
+
+drop table t1,t2,t3,t4;
+set @@optimizer_switch= default;
+
+#
+--echo #launchpad BUG#609045
+#
+CREATE TABLE `t2` (
+ `pk` int(11) NOT NULL AUTO_INCREMENT,
+ `col_int_nokey` int(11) DEFAULT NULL,
+ `col_int_key` int(11) DEFAULT NULL,
+ `col_varchar_key` varchar(1) DEFAULT NULL,
+ `col_varchar_nokey` varchar(1) DEFAULT NULL,
+ PRIMARY KEY (`pk`),
+ KEY `col_int_key` (`col_int_key`),
+ KEY `col_varchar_key` (`col_varchar_key`,`col_int_key`)
+) ENGINE=MyISAM AUTO_INCREMENT=30 DEFAULT CHARSET=latin1;
+INSERT INTO `t2` VALUES (10,7,8,'v','v');
+INSERT INTO `t2` VALUES (11,1,9,'r','r');
+INSERT INTO `t2` VALUES (12,5,9,'a','a');
+INSERT INTO `t2` VALUES (13,3,186,'m','m');
+INSERT INTO `t2` VALUES (14,6,NULL,'y','y');
+INSERT INTO `t2` VALUES (15,92,2,'j','j');
+INSERT INTO `t2` VALUES (16,7,3,'d','d');
+INSERT INTO `t2` VALUES (17,NULL,0,'z','z');
+INSERT INTO `t2` VALUES (18,3,133,'e','e');
+INSERT INTO `t2` VALUES (19,5,1,'h','h');
+INSERT INTO `t2` VALUES (20,1,8,'b','b');
+INSERT INTO `t2` VALUES (21,2,5,'s','s');
+INSERT INTO `t2` VALUES (22,NULL,5,'e','e');
+INSERT INTO `t2` VALUES (23,1,8,'j','j');
+INSERT INTO `t2` VALUES (24,0,6,'e','e');
+INSERT INTO `t2` VALUES (25,210,51,'f','f');
+INSERT INTO `t2` VALUES (26,8,4,'v','v');
+INSERT INTO `t2` VALUES (27,7,7,'x','x');
+INSERT INTO `t2` VALUES (28,5,6,'m','m');
+INSERT INTO `t2` VALUES (29,NULL,4,'c','c');
+CREATE TABLE `t1` (
+ `pk` int(11) NOT NULL AUTO_INCREMENT,
+ `col_int_nokey` int(11) DEFAULT NULL,
+ `col_int_key` int(11) DEFAULT NULL,
+ `col_varchar_key` varchar(1) DEFAULT NULL,
+ `col_varchar_nokey` varchar(1) DEFAULT NULL,
+ PRIMARY KEY (`pk`),
+ KEY `col_int_key` (`col_int_key`),
+ KEY `col_varchar_key` (`col_varchar_key`,`col_int_key`)
+) ENGINE=MyISAM AUTO_INCREMENT=21 DEFAULT CHARSET=latin1;
+INSERT INTO `t1` VALUES (1,NULL,2,'w','w');
+INSERT INTO `t1` VALUES (2,7,9,'m','m');
+INSERT INTO `t1` VALUES (3,9,3,'m','m');
+INSERT INTO `t1` VALUES (4,7,9,'k','k');
+INSERT INTO `t1` VALUES (5,4,NULL,'r','r');
+INSERT INTO `t1` VALUES (6,2,9,'t','t');
+INSERT INTO `t1` VALUES (7,6,3,'j','j');
+INSERT INTO `t1` VALUES (8,8,8,'u','u');
+INSERT INTO `t1` VALUES (9,NULL,8,'h','h');
+INSERT INTO `t1` VALUES (10,5,53,'o','o');
+INSERT INTO `t1` VALUES (11,NULL,0,NULL,NULL);
+INSERT INTO `t1` VALUES (12,6,5,'k','k');
+INSERT INTO `t1` VALUES (13,188,166,'e','e');
+INSERT INTO `t1` VALUES (14,2,3,'n','n');
+INSERT INTO `t1` VALUES (15,1,0,'t','t');
+INSERT INTO `t1` VALUES (16,1,1,'c','c');
+INSERT INTO `t1` VALUES (17,0,9,'m','m');
+INSERT INTO `t1` VALUES (18,9,5,'y','y');
+INSERT INTO `t1` VALUES (19,NULL,6,'f','f');
+INSERT INTO `t1` VALUES (20,4,2,'d','d');
+
+SET @@optimizer_switch = 'subquery_cache=off';
+
+/* cache is off */ SELECT SUM( DISTINCT table1 .`pk` ) , (
+ SELECT MAX( `col_int_nokey` )
+ FROM t1
+ WHERE table1 .`pk` ) field3
+FROM t1 table1
+JOIN (
+ t1 table2
+ JOIN t2 table3
+ ON table3 .`col_varchar_key` = table2 .`col_varchar_key`
+)
+ON table3 .`col_varchar_key` = table2 .`col_varchar_nokey`
+GROUP BY field3 ;
+
+SET @@optimizer_switch = 'subquery_cache=on';
+
+/* cache is on */ SELECT SUM( DISTINCT table1 .`pk` ) , (
+ SELECT MAX( `col_int_nokey` )
+ FROM t1
+ WHERE table1 .`pk` ) field3
+FROM t1 table1
+JOIN (
+ t1 table2
+ JOIN t2 table3
+ ON table3 .`col_varchar_key` = table2 .`col_varchar_key`
+)
+ON table3 .`col_varchar_key` = table2 .`col_varchar_nokey`
+GROUP BY field3 ;
+
+drop table t1,t2;
+set @@optimizer_switch= default;
+
+#
+--echo #launchpad BUG#609052
+#
+CREATE TABLE `t2` (
+ `pk` int(11) NOT NULL AUTO_INCREMENT,
+ `col_int_nokey` int(11) DEFAULT NULL,
+ `col_int_key` int(11) DEFAULT NULL,
+ `col_time_key` time DEFAULT NULL,
+ `col_varchar_key` varchar(1) DEFAULT NULL,
+ `col_varchar_nokey` varchar(1) DEFAULT NULL,
+ PRIMARY KEY (`pk`),
+ KEY `col_int_key` (`col_int_key`),
+ KEY `col_time_key` (`col_time_key`),
+ KEY `col_varchar_key` (`col_varchar_key`,`col_int_key`)
+) ENGINE=MyISAM AUTO_INCREMENT=30 DEFAULT CHARSET=latin1;
+INSERT INTO `t2` VALUES (10,7,8,'01:27:35','v','v');
+INSERT INTO `t2` VALUES (11,1,9,'19:48:31','r','r');
+INSERT INTO `t2` VALUES (12,5,9,'00:00:00','a','a');
+INSERT INTO `t2` VALUES (13,3,186,'19:53:05','m','m');
+INSERT INTO `t2` VALUES (14,6,NULL,'19:18:56','y','y');
+INSERT INTO `t2` VALUES (15,92,2,'10:55:12','j','j');
+INSERT INTO `t2` VALUES (16,7,3,'00:25:00','d','d');
+INSERT INTO `t2` VALUES (17,NULL,0,'12:35:47','z','z');
+INSERT INTO `t2` VALUES (18,3,133,'19:53:03','e','e');
+INSERT INTO `t2` VALUES (19,5,1,'17:53:30','h','h');
+INSERT INTO `t2` VALUES (20,1,8,'11:35:49','b','b');
+INSERT INTO `t2` VALUES (21,2,5,NULL,'s','s');
+INSERT INTO `t2` VALUES (22,NULL,5,'06:01:40','e','e');
+INSERT INTO `t2` VALUES (23,1,8,'05:45:11','j','j');
+INSERT INTO `t2` VALUES (24,0,6,'00:00:00','e','e');
+INSERT INTO `t2` VALUES (25,210,51,'00:00:00','f','f');
+INSERT INTO `t2` VALUES (26,8,4,'06:11:01','v','v');
+INSERT INTO `t2` VALUES (27,7,7,'13:02:46','x','x');
+INSERT INTO `t2` VALUES (28,5,6,'21:44:25','m','m');
+INSERT INTO `t2` VALUES (29,NULL,4,'22:43:58','c','c');
+CREATE TABLE `t4` (
+ `pk` int(11) NOT NULL AUTO_INCREMENT,
+ `col_int_nokey` int(11) DEFAULT NULL,
+ `col_int_key` int(11) DEFAULT NULL,
+ `col_time_key` time DEFAULT NULL,
+ `col_varchar_key` varchar(1) DEFAULT NULL,
+ `col_varchar_nokey` varchar(1) DEFAULT NULL,
+ PRIMARY KEY (`pk`),
+ KEY `col_int_key` (`col_int_key`),
+ KEY `col_time_key` (`col_time_key`),
+ KEY `col_varchar_key` (`col_varchar_key`,`col_int_key`)
+) ENGINE=MyISAM AUTO_INCREMENT=101 DEFAULT CHARSET=latin1;
+INSERT INTO `t4` VALUES (1,6,NULL,NULL,'r','r');
+INSERT INTO `t4` VALUES (2,8,0,'14:34:45','c','c');
+INSERT INTO `t4` VALUES (3,6,0,'11:49:48','o','o');
+INSERT INTO `t4` VALUES (4,6,7,'18:12:55','c','c');
+INSERT INTO `t4` VALUES (5,3,8,'18:30:05','d','d');
+INSERT INTO `t4` VALUES (6,9,4,'14:19:30','v','v');
+INSERT INTO `t4` VALUES (7,2,6,'05:20:04','m','m');
+INSERT INTO `t4` VALUES (8,1,5,'20:29:31','j','j');
+INSERT INTO `t4` VALUES (9,8,NULL,'07:08:09','f','f');
+INSERT INTO `t4` VALUES (10,0,NULL,'14:49:14','n','n');
+INSERT INTO `t4` VALUES (11,9,8,'00:00:00','z','z');
+INSERT INTO `t4` VALUES (12,8,8,'09:58:06','h','h');
+INSERT INTO `t4` VALUES (13,NULL,8,NULL,'q','q');
+INSERT INTO `t4` VALUES (14,0,1,'18:24:16','w','w');
+INSERT INTO `t4` VALUES (15,5,1,'17:39:57','z','z');
+INSERT INTO `t4` VALUES (16,1,5,'08:23:21','j','j');
+INSERT INTO `t4` VALUES (17,1,2,NULL,'a','a');
+INSERT INTO `t4` VALUES (18,6,7,'21:50:46','m','m');
+INSERT INTO `t4` VALUES (19,6,6,'12:33:17','n','n');
+INSERT INTO `t4` VALUES (20,1,4,'03:06:43','e','e');
+INSERT INTO `t4` VALUES (21,8,7,'03:46:14','u','u');
+INSERT INTO `t4` VALUES (22,1,0,'20:34:52','s','s');
+INSERT INTO `t4` VALUES (23,0,9,NULL,'u','u');
+INSERT INTO `t4` VALUES (24,4,3,'10:41:20','r','r');
+INSERT INTO `t4` VALUES (25,9,5,'08:43:11','g','g');
+INSERT INTO `t4` VALUES (26,8,1,NULL,'o','o');
+INSERT INTO `t4` VALUES (27,5,1,'10:17:51','w','w');
+INSERT INTO `t4` VALUES (28,9,5,'06:34:09','b','b');
+INSERT INTO `t4` VALUES (29,5,9,'21:22:47',NULL,NULL);
+INSERT INTO `t4` VALUES (30,NULL,2,'04:02:32','y','y');
+INSERT INTO `t4` VALUES (31,NULL,5,'02:33:14','y','y');
+INSERT INTO `t4` VALUES (32,105,248,'16:32:56','u','u');
+INSERT INTO `t4` VALUES (33,0,0,'21:32:42','p','p');
+INSERT INTO `t4` VALUES (34,3,8,'23:04:47','s','s');
+INSERT INTO `t4` VALUES (35,1,1,'22:05:43','e','e');
+INSERT INTO `t4` VALUES (36,75,255,'02:05:45','d','d');
+INSERT INTO `t4` VALUES (37,9,9,'00:00:00','d','d');
+INSERT INTO `t4` VALUES (38,7,9,'18:09:07','c','c');
+INSERT INTO `t4` VALUES (39,NULL,3,'10:54:06','b','b');
+INSERT INTO `t4` VALUES (40,NULL,9,'23:15:50','t','t');
+INSERT INTO `t4` VALUES (41,4,6,'10:17:40',NULL,NULL);
+INSERT INTO `t4` VALUES (42,0,4,'03:37:09','y','y');
+INSERT INTO `t4` VALUES (43,204,60,'22:26:06','c','c');
+INSERT INTO `t4` VALUES (44,0,7,'17:10:38','d','d');
+INSERT INTO `t4` VALUES (45,9,1,'00:00:00','x','x');
+INSERT INTO `t4` VALUES (46,8,6,'17:08:49','p','p');
+INSERT INTO `t4` VALUES (47,7,4,'19:04:40','e','e');
+INSERT INTO `t4` VALUES (48,8,NULL,'20:53:28','g','g');
+INSERT INTO `t4` VALUES (49,NULL,8,'11:46:03','x','x');
+INSERT INTO `t4` VALUES (50,6,0,NULL,'s','s');
+INSERT INTO `t4` VALUES (51,5,8,'10:58:07','e','e');
+INSERT INTO `t4` VALUES (52,2,151,'00:00:00','l','l');
+INSERT INTO `t4` VALUES (53,3,7,'09:43:15','p','p');
+INSERT INTO `t4` VALUES (54,7,6,'21:40:32','h','h');
+INSERT INTO `t4` VALUES (55,NULL,NULL,'00:17:44','m','m');
+INSERT INTO `t4` VALUES (56,145,23,'16:47:26','n','n');
+INSERT INTO `t4` VALUES (57,0,2,'00:00:00','v','v');
+INSERT INTO `t4` VALUES (58,1,4,'05:25:59','b','b');
+INSERT INTO `t4` VALUES (59,7,NULL,'00:00:00','x','x');
+INSERT INTO `t4` VALUES (60,3,NULL,'20:33:04','r','r');
+INSERT INTO `t4` VALUES (61,NULL,77,'00:46:12','t','t');
+INSERT INTO `t4` VALUES (62,2,NULL,'00:00:00','w','w');
+INSERT INTO `t4` VALUES (63,2,NULL,'02:13:59','w','w');
+INSERT INTO `t4` VALUES (64,2,7,'02:54:47','k','k');
+INSERT INTO `t4` VALUES (65,8,1,'18:13:59','a','a');
+INSERT INTO `t4` VALUES (66,6,9,'13:53:08','t','t');
+INSERT INTO `t4` VALUES (67,1,6,'22:21:30','z','z');
+INSERT INTO `t4` VALUES (68,NULL,2,'11:41:50','e','e');
+INSERT INTO `t4` VALUES (69,1,3,'15:20:02','q','q');
+INSERT INTO `t4` VALUES (70,0,0,NULL,'e','e');
+INSERT INTO `t4` VALUES (71,4,NULL,NULL,'v','v');
+INSERT INTO `t4` VALUES (72,1,6,'07:51:52','d','d');
+INSERT INTO `t4` VALUES (73,1,3,'00:00:00','u','u');
+INSERT INTO `t4` VALUES (74,27,195,NULL,'o','o');
+INSERT INTO `t4` VALUES (75,4,5,NULL,'b','b');
+INSERT INTO `t4` VALUES (76,6,2,'12:31:05','c','c');
+INSERT INTO `t4` VALUES (77,2,7,'00:00:00','q','q');
+INSERT INTO `t4` VALUES (78,248,25,'01:16:45',NULL,NULL);
+INSERT INTO `t4` VALUES (79,NULL,NULL,'20:38:54','h','h');
+INSERT INTO `t4` VALUES (80,9,0,'00:30:15','d','d');
+INSERT INTO `t4` VALUES (81,75,98,'23:46:36','w','w');
+INSERT INTO `t4` VALUES (82,2,6,'19:03:13','m','m');
+INSERT INTO `t4` VALUES (83,9,5,'10:54:27','i','i');
+INSERT INTO `t4` VALUES (84,4,0,'00:25:47','w','w');
+INSERT INTO `t4` VALUES (85,0,3,'08:44:27','f','f');
+INSERT INTO `t4` VALUES (86,0,1,'08:15:38','k','k');
+INSERT INTO `t4` VALUES (87,1,1,'19:56:21','v','v');
+INSERT INTO `t4` VALUES (88,119,147,'00:00:00','c','c');
+INSERT INTO `t4` VALUES (89,1,3,'20:50:52','y','y');
+INSERT INTO `t4` VALUES (90,7,3,'03:54:39','h','h');
+INSERT INTO `t4` VALUES (91,2,NULL,'23:58:17',NULL,NULL);
+INSERT INTO `t4` VALUES (92,7,2,'12:54:58','t','t');
+INSERT INTO `t4` VALUES (93,2,1,'04:02:43','l','l');
+INSERT INTO `t4` VALUES (94,6,8,'11:31:12','a','a');
+INSERT INTO `t4` VALUES (95,4,8,'20:20:04','r','r');
+INSERT INTO `t4` VALUES (96,5,8,'00:22:24','s','s');
+INSERT INTO `t4` VALUES (97,7,0,'10:09:31','z','z');
+INSERT INTO `t4` VALUES (98,1,1,NULL,'j','j');
+INSERT INTO `t4` VALUES (99,7,8,'18:45:35','c','c');
+INSERT INTO `t4` VALUES (100,2,5,'11:49:25','f','f');
+CREATE TABLE `t1` (
+ `pk` int(11) NOT NULL AUTO_INCREMENT,
+ `col_int_nokey` int(11) DEFAULT NULL,
+ `col_int_key` int(11) DEFAULT NULL,
+ `col_time_key` time DEFAULT NULL,
+ `col_varchar_key` varchar(1) DEFAULT NULL,
+ `col_varchar_nokey` varchar(1) DEFAULT NULL,
+ PRIMARY KEY (`pk`),
+ KEY `col_int_key` (`col_int_key`),
+ KEY `col_time_key` (`col_time_key`),
+ KEY `col_varchar_key` (`col_varchar_key`,`col_int_key`)
+) ENGINE=MyISAM AUTO_INCREMENT=21 DEFAULT CHARSET=latin1;
+INSERT INTO `t1` VALUES (1,NULL,2,'11:28:45','w','w');
+INSERT INTO `t1` VALUES (2,7,9,'20:25:14','m','m');
+INSERT INTO `t1` VALUES (3,9,3,'13:47:24','m','m');
+INSERT INTO `t1` VALUES (4,7,9,'19:24:11','k','k');
+INSERT INTO `t1` VALUES (5,4,NULL,'15:59:13','r','r');
+INSERT INTO `t1` VALUES (6,2,9,'00:00:00','t','t');
+INSERT INTO `t1` VALUES (7,6,3,'15:15:04','j','j');
+INSERT INTO `t1` VALUES (8,8,8,'11:32:06','u','u');
+INSERT INTO `t1` VALUES (9,NULL,8,'18:32:33','h','h');
+INSERT INTO `t1` VALUES (10,5,53,'15:19:25','o','o');
+INSERT INTO `t1` VALUES (11,NULL,0,'19:03:19',NULL,NULL);
+INSERT INTO `t1` VALUES (12,6,5,'00:39:46','k','k');
+INSERT INTO `t1` VALUES (13,188,166,NULL,'e','e');
+INSERT INTO `t1` VALUES (14,2,3,'00:00:00','n','n');
+INSERT INTO `t1` VALUES (15,1,0,'13:12:11','t','t');
+INSERT INTO `t1` VALUES (16,1,1,'04:56:48','c','c');
+INSERT INTO `t1` VALUES (17,0,9,'19:56:05','m','m');
+INSERT INTO `t1` VALUES (18,9,5,'19:35:19','y','y');
+INSERT INTO `t1` VALUES (19,NULL,6,'05:03:03','f','f');
+INSERT INTO `t1` VALUES (20,4,2,'18:38:59','d','d');
+CREATE TABLE `t3` (
+ `pk` int(11) NOT NULL AUTO_INCREMENT,
+ `col_int_nokey` int(11) DEFAULT NULL,
+ `col_int_key` int(11) DEFAULT NULL,
+ `col_time_key` time DEFAULT NULL,
+ `col_varchar_key` varchar(1) DEFAULT NULL,
+ `col_varchar_nokey` varchar(1) DEFAULT NULL,
+ PRIMARY KEY (`pk`),
+ KEY `col_int_key` (`col_int_key`),
+ KEY `col_time_key` (`col_time_key`),
+ KEY `col_varchar_key` (`col_varchar_key`,`col_int_key`)
+) ENGINE=MyISAM AUTO_INCREMENT=11 DEFAULT CHARSET=latin1;
+INSERT INTO `t3` VALUES (10,8,8,'18:27:58',NULL,NULL);
+CREATE TABLE `t5` (
+ `pk` int(11) NOT NULL AUTO_INCREMENT,
+ `col_int_nokey` int(11) DEFAULT NULL,
+ `col_int_key` int(11) DEFAULT NULL,
+ `col_time_key` time DEFAULT NULL,
+ `col_varchar_key` varchar(1) DEFAULT NULL,
+ `col_varchar_nokey` varchar(1) DEFAULT NULL,
+ PRIMARY KEY (`pk`),
+ KEY `col_int_key` (`col_int_key`),
+ KEY `col_time_key` (`col_time_key`),
+ KEY `col_varchar_key` (`col_varchar_key`,`col_int_key`)
+) ENGINE=MyISAM AUTO_INCREMENT=2 DEFAULT CHARSET=latin1;
+INSERT INTO `t5` VALUES (1,1,7,'01:13:38','f','f');
+
+
+SET @@optimizer_switch='subquery_cache=off';
+
+/* cache is off */ SELECT SQL_SMALL_RESULT MAX( DISTINCT table1 . `col_varchar_key` ) AS field1 , MIN( table1 . `col_varchar_nokey` ) AS field2 , COUNT( table1 . `col_varchar_key` ) AS field3 , table2 . `col_time_key` AS field4 , COUNT( DISTINCT table2 . `col_int_key` ) AS field5 , (
+SELECT MAX( SUBQUERY1_t2 . `col_int_nokey` ) AS SUBQUERY1_field1
+FROM ( t3 AS SUBQUERY1_t1 INNER JOIN t1 AS SUBQUERY1_t2 ON (SUBQUERY1_t2 . `col_varchar_key` = SUBQUERY1_t1 . `col_varchar_nokey` ) )
+WHERE SUBQUERY1_t2 . `pk` < SUBQUERY1_t2 . `pk` ) AS field6 , COUNT( table1 . `col_varchar_nokey` ) AS field7 , COUNT( table2 . `pk` ) AS field8 , (
+SELECT MAX( SUBQUERY2_t1 . `col_int_key` ) AS SUBQUERY2_field1
+FROM ( t5 AS SUBQUERY2_t1 LEFT JOIN t2 AS SUBQUERY2_t2 ON (SUBQUERY2_t2 . `col_int_key` = SUBQUERY2_t1 . `col_int_key` ) )
+WHERE SUBQUERY2_t2 . `col_varchar_nokey` != table1 . `col_varchar_key` OR SUBQUERY2_t1 . `col_varchar_nokey` >= 'o' ) AS field9 , CONCAT ( table1 . `col_varchar_key` , table2 . `col_varchar_nokey` ) AS field10
+FROM ( t4 AS table1 LEFT JOIN ( ( t1 AS table2 STRAIGHT_JOIN t1 AS table3 ON (table3 . `col_int_nokey` = table2 . `pk` ) ) ) ON (table3 . `col_varchar_key` = table2 . `col_varchar_key` ) )
+WHERE ( EXISTS (
+SELECT SUBQUERY3_t1 . `pk` AS SUBQUERY3_field1
+FROM ( t4 AS SUBQUERY3_t1 INNER JOIN t4 AS SUBQUERY3_t2 ON (SUBQUERY3_t2 . `col_varchar_key` = SUBQUERY3_t1 . `col_varchar_key` ) )
+WHERE SUBQUERY3_t1 . `col_int_key` > table3 . `pk` AND SUBQUERY3_t1 . `pk` != table3 . `pk` ) ) AND ( table1 . `pk` > 116 AND table1 . `pk` < ( 116 + 175 ) OR table1 . `pk` IN (251) ) OR table1 . `col_int_nokey` = table1 . `col_int_nokey`
+GROUP BY field4, field6, field9, field10
+HAVING field10 = 'c'
+;
+
+SET @@optimizer_switch='subquery_cache=on';
+
+/* cache is on */ SELECT SQL_SMALL_RESULT MAX( DISTINCT table1 . `col_varchar_key` ) AS field1 , MIN( table1 . `col_varchar_nokey` ) AS field2 , COUNT( table1 . `col_varchar_key` ) AS field3 , table2 . `col_time_key` AS field4 , COUNT( DISTINCT table2 . `col_int_key` ) AS field5 , (
+SELECT MAX( SUBQUERY1_t2 . `col_int_nokey` ) AS SUBQUERY1_field1
+FROM ( t3 AS SUBQUERY1_t1 INNER JOIN t1 AS SUBQUERY1_t2 ON (SUBQUERY1_t2 . `col_varchar_key` = SUBQUERY1_t1 . `col_varchar_nokey` ) )
+WHERE SUBQUERY1_t2 . `pk` < SUBQUERY1_t2 . `pk` ) AS field6 , COUNT( table1 . `col_varchar_nokey` ) AS field7 , COUNT( table2 . `pk` ) AS field8 , (
+SELECT MAX( SUBQUERY2_t1 . `col_int_key` ) AS SUBQUERY2_field1
+FROM ( t5 AS SUBQUERY2_t1 LEFT JOIN t2 AS SUBQUERY2_t2 ON (SUBQUERY2_t2 . `col_int_key` = SUBQUERY2_t1 . `col_int_key` ) )
+WHERE SUBQUERY2_t2 . `col_varchar_nokey` != table1 . `col_varchar_key` OR SUBQUERY2_t1 . `col_varchar_nokey` >= 'o' ) AS field9 , CONCAT ( table1 . `col_varchar_key` , table2 . `col_varchar_nokey` ) AS field10
+FROM ( t4 AS table1 LEFT JOIN ( ( t1 AS table2 STRAIGHT_JOIN t1 AS table3 ON (table3 . `col_int_nokey` = table2 . `pk` ) ) ) ON (table3 . `col_varchar_key` = table2 . `col_varchar_key` ) )
+WHERE ( EXISTS (
+SELECT SUBQUERY3_t1 . `pk` AS SUBQUERY3_field1
+FROM ( t4 AS SUBQUERY3_t1 INNER JOIN t4 AS SUBQUERY3_t2 ON (SUBQUERY3_t2 . `col_varchar_key` = SUBQUERY3_t1 . `col_varchar_key` ) )
+WHERE SUBQUERY3_t1 . `col_int_key` > table3 . `pk` AND SUBQUERY3_t1 . `pk` != table3 . `pk` ) ) AND ( table1 . `pk` > 116 AND table1 . `pk` < ( 116 + 175 ) OR table1 . `pk` IN (251) ) OR table1 . `col_int_nokey` = table1 . `col_int_nokey`
+GROUP BY field4, field6, field9, field10
+HAVING field10 = 'c'
+;
+
+drop table t1,t2,t3,t4,t5;
+set @@optimizer_switch= default;
=== modified file 'sql/item.cc'
--- a/sql/item.cc 2010-07-10 10:37:30 +0000
+++ b/sql/item.cc 2010-07-29 11:13:48 +0000
@@ -6966,6 +6966,14 @@
}
+Item* Item_cache_wrapper::get_tmp_table_item(THD *thd_arg)
+{
+ if (!orig_item->with_sum_func && !orig_item->const_item())
+ return new Item_field(result_field);
+ return copy_or_same(thd_arg);
+}
+
+
/**
Prepare referenced field then call usual Item_direct_ref::fix_fields .
=== modified file 'sql/item.h'
--- a/sql/item.h 2010-07-10 10:37:30 +0000
+++ b/sql/item.h 2010-07-29 11:13:48 +0000
@@ -2624,6 +2624,7 @@
{
save_val(result_field);
}
+ Item* get_tmp_table_item(THD *thd_arg);
/* Following methods make this item transparent as much as possible */
1
0
Re: [Maria-developers] [Commits] bzr commit into MariaDB 5.1, with Maria 1.5:maria branch (igor:2869) Bug#52005
by Sergey Petrunya 27 Jul '10
by Sergey Petrunya 27 Jul '10
27 Jul '10
Hello Igor,
Ok to push. I'm sorry for the delay.
On Sun, Jul 25, 2010 at 10:50:03PM -0700, Igor Babaev wrote:
> #At lp:maria based on revid:monty@askmonty.org-20100615220051-2xp3g51fysxle1r1
>
> 2869 Igor Babaev 2010-07-25
> Fixed bug #52005.
> Corrected coding for Warshall's algorithm.
> modified:
> mysql-test/r/join_outer.result
> mysql-test/t/join_outer.test
> sql/sql_select.cc
>
> === modified file 'mysql-test/r/join_outer.result'
> --- a/mysql-test/r/join_outer.result 2010-03-19 06:21:37 +0000
> +++ b/mysql-test/r/join_outer.result 2010-07-26 05:49:51 +0000
> @@ -1308,4 +1308,63 @@ WHERE (COALESCE(t1.f1, t2.f1), f3) IN ((
> f1 f2 f3 f1 f2
> 1 NULL 3 NULL NULL
> DROP TABLE t1, t2;
> +#
> +# Bug#46091 STRAIGHT_JOIN + RIGHT JOIN returns different result
> +#
> +CREATE TABLE t1 (f1 INT NOT NULL);
> +INSERT INTO t1 VALUES (9),(0);
> +CREATE TABLE t2 (f1 INT NOT NULL);
> +INSERT INTO t2 VALUES
> +(5),(3),(0),(3),(1),(0),(1),(7),(1),(0),(0),(8),(4),(9),(0),(2),(0),(8),(5),(1);
> +SELECT STRAIGHT_JOIN COUNT(*) FROM t1 TA1
> +RIGHT JOIN t2 TA2 JOIN t2 TA3 ON TA2.f1 ON TA3.f1;
> +COUNT(*)
> +476
> +EXPLAIN SELECT STRAIGHT_JOIN COUNT(*) FROM t1 TA1
> +RIGHT JOIN t2 TA2 JOIN t2 TA3 ON TA2.f1 ON TA3.f1;
> +id select_type table type possible_keys key key_len ref rows Extra
> +1 SIMPLE TA2 ALL NULL NULL NULL NULL 20 Using where
> +1 SIMPLE TA3 ALL NULL NULL NULL NULL 20 Using join buffer
> +1 SIMPLE TA1 ALL NULL NULL NULL NULL 2
> +DROP TABLE t1, t2;
> +#
> +# Bug#48971 Segfault in add_found_match_trig_cond () at sql_select.cc:5990
> +#
> +CREATE TABLE t1(f1 INT, PRIMARY KEY (f1));
> +INSERT INTO t1 VALUES (1),(2);
> +EXPLAIN EXTENDED SELECT STRAIGHT_JOIN jt1.f1 FROM t1 AS jt1
> +LEFT JOIN t1 AS jt2
> +RIGHT JOIN t1 AS jt3
> +JOIN t1 AS jt4 ON 1
> +LEFT JOIN t1 AS jt5 ON 1
> +ON 1
> +RIGHT JOIN t1 AS jt6 ON jt6.f1
> +ON 1;
> +id select_type table type possible_keys key key_len ref rows filtered Extra
> +1 SIMPLE jt1 index NULL PRIMARY 4 NULL 2 100.00 Using index
> +1 SIMPLE jt6 index NULL PRIMARY 4 NULL 2 100.00 Using index
> +1 SIMPLE jt3 index NULL PRIMARY 4 NULL 2 100.00 Using index
> +1 SIMPLE jt4 index NULL PRIMARY 4 NULL 2 100.00 Using index
> +1 SIMPLE jt5 index NULL PRIMARY 4 NULL 2 100.00 Using index
> +1 SIMPLE jt2 index NULL PRIMARY 4 NULL 2 100.00 Using index
> +Warnings:
> +Note 1003 select straight_join `test`.`jt1`.`f1` AS `f1` from `test`.`t1` `jt1` left join (`test`.`t1` `jt6` left join (`test`.`t1` `jt3` join `test`.`t1` `jt4` left join `test`.`t1` `jt5` on(1) left join `test`.`t1` `jt2` on(1)) on((`test`.`jt6`.`f1` and 1))) on(1) where 1
> +EXPLAIN EXTENDED SELECT STRAIGHT_JOIN jt1.f1 FROM t1 AS jt1
> +RIGHT JOIN t1 AS jt2
> +RIGHT JOIN t1 AS jt3
> +JOIN t1 AS jt4 ON 1
> +LEFT JOIN t1 AS jt5 ON 1
> +ON 1
> +RIGHT JOIN t1 AS jt6 ON jt6.f1
> +ON 1;
> +id select_type table type possible_keys key key_len ref rows filtered Extra
> +1 SIMPLE jt6 index NULL PRIMARY 4 NULL 2 100.00 Using index
> +1 SIMPLE jt3 index NULL PRIMARY 4 NULL 2 100.00 Using index
> +1 SIMPLE jt4 index NULL PRIMARY 4 NULL 2 100.00 Using index
> +1 SIMPLE jt5 index NULL PRIMARY 4 NULL 2 100.00 Using index
> +1 SIMPLE jt2 index NULL PRIMARY 4 NULL 2 100.00 Using index
> +1 SIMPLE jt1 index NULL PRIMARY 4 NULL 2 100.00 Using index
> +Warnings:
> +Note 1003 select straight_join `test`.`jt1`.`f1` AS `f1` from `test`.`t1` `jt6` left join (`test`.`t1` `jt3` join `test`.`t1` `jt4` left join `test`.`t1` `jt5` on(1) left join `test`.`t1` `jt2` on(1)) on((`test`.`jt6`.`f1` and 1)) left join `test`.`t1` `jt1` on(1) where 1
> +DROP TABLE t1;
> End of 5.1 tests
>
> === modified file 'mysql-test/t/join_outer.test'
> --- a/mysql-test/t/join_outer.test 2010-03-19 06:21:37 +0000
> +++ b/mysql-test/t/join_outer.test 2010-07-26 05:49:51 +0000
> @@ -913,4 +913,48 @@ WHERE (COALESCE(t1.f1, t2.f1), f3) IN ((
>
> DROP TABLE t1, t2;
>
> +--echo #
> +--echo # Bug#46091 STRAIGHT_JOIN + RIGHT JOIN returns different result
> +--echo #
> +CREATE TABLE t1 (f1 INT NOT NULL);
> +INSERT INTO t1 VALUES (9),(0);
> +
> +CREATE TABLE t2 (f1 INT NOT NULL);
> +INSERT INTO t2 VALUES
> +(5),(3),(0),(3),(1),(0),(1),(7),(1),(0),(0),(8),(4),(9),(0),(2),(0),(8),(5),(1);
> +
> +SELECT STRAIGHT_JOIN COUNT(*) FROM t1 TA1
> +RIGHT JOIN t2 TA2 JOIN t2 TA3 ON TA2.f1 ON TA3.f1;
> +
> +EXPLAIN SELECT STRAIGHT_JOIN COUNT(*) FROM t1 TA1
> +RIGHT JOIN t2 TA2 JOIN t2 TA3 ON TA2.f1 ON TA3.f1;
> +
> +DROP TABLE t1, t2;
> +
> +--echo #
> +--echo # Bug#48971 Segfault in add_found_match_trig_cond () at sql_select.cc:5990
> +--echo #
> +CREATE TABLE t1(f1 INT, PRIMARY KEY (f1));
> +INSERT INTO t1 VALUES (1),(2);
> +
> +EXPLAIN EXTENDED SELECT STRAIGHT_JOIN jt1.f1 FROM t1 AS jt1
> + LEFT JOIN t1 AS jt2
> + RIGHT JOIN t1 AS jt3
> + JOIN t1 AS jt4 ON 1
> + LEFT JOIN t1 AS jt5 ON 1
> + ON 1
> + RIGHT JOIN t1 AS jt6 ON jt6.f1
> + ON 1;
> +
> +EXPLAIN EXTENDED SELECT STRAIGHT_JOIN jt1.f1 FROM t1 AS jt1
> + RIGHT JOIN t1 AS jt2
> + RIGHT JOIN t1 AS jt3
> + JOIN t1 AS jt4 ON 1
> + LEFT JOIN t1 AS jt5 ON 1
> + ON 1
> + RIGHT JOIN t1 AS jt6 ON jt6.f1
> + ON 1;
> +
> +DROP TABLE t1;
> +
> --echo End of 5.1 tests
>
> === modified file 'sql/sql_select.cc'
> --- a/sql/sql_select.cc 2010-05-26 18:55:40 +0000
> +++ b/sql/sql_select.cc 2010-07-26 05:49:51 +0000
> @@ -2717,15 +2717,29 @@ make_join_statistics(JOIN *join, TABLE_L
> as well as allow us to catch illegal cross references/
> Warshall's algorithm is used to build the transitive closure.
> As we use bitmaps to represent the relation the complexity
> - of the algorithm is O((number of tables)^2).
> + of the algorithm is O((number of tables)^2).
> +
> + The classic form of the Warshall's algorithm would look like:
> + for (i= 0; i < table_count; i++)
> + {
> + for (j= 0; j < table_count; j++)
> + {
> + for (k= 0; k < table_count; k++)
> + {
> + if (bitmap_is_set(stat[j], i) && bitmap_is_set(stat[i], k)
> + bitmap_set_bit(stat[j], k);
> + }
> + }
> + }
> */
> for (i= 0, s= stat ; i < table_count ; i++, s++)
> {
> - for (uint j= 0 ; j < table_count ; j++)
> + table= s->table;
> + JOIN_TAB *t;
> + for (uint j= 0, t= stat ; j < table_count ; j++, t++)
> {
> - table= stat[j].table;
> - if (s->dependent & table->map)
> - s->dependent |= table->reginfo.join_tab->dependent;
> + if (t->dependent & table->map)
> + t->dependent |= table->reginfo.join_tab->dependent;
> }
> if (outer_join & s->table->map)
> s->table->maybe_null= 1;
> @@ -8784,6 +8798,7 @@ simplify_joins(JOIN *join, List<TABLE_LI
> NESTED_JOIN *nested_join;
> TABLE_LIST *prev_table= 0;
> List_iterator<TABLE_LIST> li(*join_list);
> + bool straight_join= test(join->select_options & SELECT_STRAIGHT_JOIN);
> DBUG_ENTER("simplify_joins");
>
> /*
> @@ -8896,7 +8911,7 @@ simplify_joins(JOIN *join, List<TABLE_LI
> if (prev_table)
> {
> /* The order of tables is reverse: prev_table follows table */
> - if (prev_table->straight)
> + if (prev_table->straight || straight_join)
> prev_table->dep_tables|= used_tables;
> if (prev_table->on_expr)
> {
>
BR
Sergey
--
Sergey Petrunia, Software Developer
Monty Program AB, http://askmonty.org
Blog: http://s.petrunia.net/blog
1
0
[Maria-developers] WL#127 New (by Sergei): generalize mtr to support per-suite extensions
by worklog-noreply@askmonty.org 26 Jul '10
by worklog-noreply@askmonty.org 26 Jul '10
26 Jul '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: generalize mtr to support per-suite extensions
CREATION DATE..: Mon, 26 Jul 2010, 08:45
SUPERVISOR.....: Sergei
IMPLEMENTOR....: Sergei
COPIES TO......:
CATEGORY.......: Server-Sprint
TASK ID........: 127 (http://askmonty.org/worklog/?tid=127)
VERSION........: Server-5.2
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 16 (hours remain)
ORIG. ESTIMATE.: 16
PROGRESS NOTES:
DESCRIPTION:
to test the sphinxse we need to needs to start the sphinx daemon and preload the
data. obviously we don't want any sphinxse specific code in the mysql-test-run,
so we need a generic way for a suite to hook in startup/shutdown code in the mtr.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
[Maria-developers] bzr commit into file:///home/tsk/mprog/src/5.3/ branch (timour:2806)
by timour@askmonty.org 23 Jul '10
by timour@askmonty.org 23 Jul '10
23 Jul '10
#At file:///home/tsk/mprog/src/5.3/ based on revid:timour@askmonty.org-20100716110215-toh8erf6p93d1n6i
2806 timour(a)askmonty.org 2010-07-23
Removed dead code that was made obsolete by the introduction of
check_join_cache_usage() by the change:
Revno: 2793
Revision Id: igor(a)askmonty.org-20091221022615-kx5ieiu0okmiupuc
Timestamp: Sun 2009-12-20 18:26:15 -0800
Backport into MariaDB-5.2 the following:
WL#2771 "Block Nested Loop Join and Batched Key Access Join"
modified:
sql/sql_select.cc
=== modified file 'sql/sql_select.cc'
--- a/sql/sql_select.cc 2010-07-15 13:59:10 +0000
+++ b/sql/sql_select.cc 2010-07-23 08:25:00 +0000
@@ -7560,7 +7560,6 @@ make_join_readinfo(JOIN *join, ulonglong
{
uint i;
bool statistics= test(!(join->select_options & SELECT_DESCRIBE));
- bool ordered_set= 0;
bool sorted= 1;
uint first_sjm_table= MAX_TABLES;
uint last_sjm_table= MAX_TABLES;
@@ -7580,21 +7579,6 @@ make_join_readinfo(JOIN *join, ulonglong
tab->read_record.file=table->file;
tab->read_record.unlock_row= rr_unlock_row;
tab->next_select=sub_select; /* normal select */
-
- /*
- Determine if the set is already ordered for ORDER BY, so it can
- disable join cache because it will change the ordering of the results.
- Code handles sort table that is at any location (not only first after
- the const tables) despite the fact that it's currently prohibited.
- We must disable join cache if the first non-const table alone is
- ordered. If there is a temp table the ordering is done as a last
- operation and doesn't prevent join cache usage.
- */
- if (!ordered_set && !join->need_tmp &&
- (table == join->sort_by_table ||
- (join->sort_by_table == (TABLE *) 1 && i != join->const_tables)))
- ordered_set= 1;
-
tab->sorted= sorted;
sorted= 0; // only first must be sorted
if (tab->loosescan_match_tab)
1
0
22 Jul '10
Hello,
First off, is this the right list to post questions about MariaDB's
source code? If it is not, I apologize, and can somebody please direct
me to the right alias? I could not find anything on MariaDB's website.
I have noticed the following behavior in MariaDB 5.1.47 with our
storage engine that is different than MySQL. I see many cases where
handler::start_bulk_insert before insertions, but
handler::end_bulk_insert is NOT called. In MySQL 5.1.46,
handler::end_bulk_insert is always called if there was a call to
handler::start_bulk_insert.
The commands run are:
MariaDB [test]> create table ttt(a int);
MariaDB [test]> insert into ttt values (1),(2),(3);
Query OK, 3 rows affected (0.01 sec)
Records: 3 Duplicates: 0 Warnings: 0
Is there a reason for this? Does some flag need to be exposed for this
function to be called?
Thanks
-Zardosht
2
3
Re: [Maria-developers] [Fwd: [Commits] bzr commit into Mariadb 5.2, with Maria 2.0:maria/5.2 branch (igor:2821) Bug#604549]
by Sergey Petrunya 22 Jul '10
by Sergey Petrunya 22 Jul '10
22 Jul '10
Hello Igor,
Please find the feedback below.
On Mon, Jul 12, 2010 at 07:08:34PM -0700, Igor Babaev wrote:
> Please review this patch for the 5.2 tree.
>
> Regards,
> Igor.
>
> -------- Original Message --------
> Subject: [Commits] bzr commit into Mariadb 5.2, with Maria 2.0:maria/5.2
> branch (igor:2821) Bug#604549
> Date: Mon, 12 Jul 2010 18:23:26 -0700 (PDT)
> From: Igor Babaev <igor(a)askmonty.org>
> Reply-To: maria-developers(a)lists.launchpad.net
> To: commits(a)mariadb.org
>
> #At lp:maria/5.2 based on
> revid:knielsen@knielsen-hq.org-20100709120309-xzhk02q8coq7m6tl
>
> 2821 Igor Babaev 2010-07-12
> Fixed bug #604549.
> There was no error thrown when creating a table with a virtual table
> computed by an expression returning a row.
> This caused a crash when inserting into the table.
>
> Removed periods at the end of the error messages for virtual columns.
> Adjusted output in test result files accordingly.
Periods at the end of error messages were apparent for the whole time. Why do
we suddenly decide to remove them now?
> === modified file 'mysql-test/r/plugin.result'
> --- a/mysql-test/r/plugin.result 2010-04-30 20:04:35 +0000
> +++ b/mysql-test/r/plugin.result 2010-07-13 01:23:07 +0000
> @@ -75,9 +75,9 @@ SET SQL_MODE='IGNORE_BAD_TABLE_OPTIONS';
> #illegal value fixed
> CREATE TABLE t1 (a int) ENGINE=example ULL=10000000000000000000
> one_or_two='ttt' YESNO=SSS;
> Warnings:
> -Warning 1651 Incorrect value '10000000000000000000' for option 'ULL'
> -Warning 1651 Incorrect value 'ttt' for option 'one_or_two'
> -Warning 1651 Incorrect value 'SSS' for option 'YESNO'
> +Warning 1652 Incorrect value '10000000000000000000' for option 'ULL'
> +Warning 1652 Incorrect value 'ttt' for option 'one_or_two'
> +Warning 1652 Incorrect value 'SSS' for option 'YESNO'
Why did the warning code change? Is this intentional?
> === modified file 'sql/share/errmsg.txt'
> --- a/sql/share/errmsg.txt 2010-06-01 19:52:20 +0000
> +++ b/sql/share/errmsg.txt 2010-07-13 01:23:07 +0000
> @@ -6211,28 +6211,31 @@ ER_VCOL_BASED_ON_VCOL
> eng "A computed column cannot be based on a computed column"
>
> ER_VIRTUAL_COLUMN_FUNCTION_IS_NOT_ALLOWED
> - eng "Function or expression is not allowed for column '%s'."
> + eng "Function or expression is not allowed for column '%s'"
>
> ER_DATA_CONVERSION_ERROR_FOR_VIRTUAL_COLUMN
> - eng "Generated value for computed column '%s' cannot be
> converted to type '%s'."
> + eng "Generated value for computed column '%s' cannot be
> converted to type '%s'"
>
> ER_PRIMARY_KEY_BASED_ON_VIRTUAL_COLUMN
> - eng "Primary key cannot be defined upon a computed column."
> + eng "Primary key cannot be defined upon a computed column"
>
> ER_KEY_BASED_ON_GENERATED_VIRTUAL_COLUMN
> - eng "Key/Index cannot be defined on a non-stored computed column."
> + eng "Key/Index cannot be defined on a non-stored computed column"
>
> ER_WRONG_FK_OPTION_FOR_VIRTUAL_COLUMN
> - eng "Cannot define foreign key with %s clause on a computed
> column."
> + eng "Cannot define foreign key with %s clause on a computed column"
>
> ER_WARNING_NON_DEFAULT_VALUE_FOR_VIRTUAL_COLUMN
> - eng "The value specified for computed column '%s' in table '%s'
> ignored."
> + eng "The value specified for computed column '%s' in table '%s'
> ignored"
>
> ER_UNSUPPORTED_ACTION_ON_VIRTUAL_COLUMN
> - eng "'%s' is not yet supported for computed columns."
> + eng "'%s' is not yet supported for computed columns"
>
> ER_CONST_EXPR_IN_VCOL
> - eng "Constant expression in computed column function is not
> allowed."
> + eng "Constant expression in computed column function is not
> allowed"
> +
> +ER_ROW_EXPR_FOR_VCOL
> + eng "Expression for computed column cannot return a row"
>
When one sees this pair of codes ER_CONST_EXPR_IN_VCOL and ER_ROW_EXPR_FOR_VCOL,
one can't help asking himself whether that's the only disallowed expressions,
and if not, do we have error codes for vcol expressions with
- user variables
- subqueries
- SP calls
- etc, etc.
Do we handle such cases at all?
> ER_DEBUG_SYNC_TIMEOUT
> eng "debug sync point wait timed out"
>
> === modified file 'sql/table.cc'
> --- a/sql/table.cc 2010-06-05 14:53:36 +0000
> +++ b/sql/table.cc 2010-07-13 01:23:07 +0000
> @@ -1859,6 +1859,14 @@ bool fix_vcol_expr(THD *thd,
> goto end;
> }
> thd->where= save_where;
> +#if 0
> +#else
> + if (unlikely(func_expr->result_type() == ROW_RESULT))
> + {
> + my_error(ER_ROW_EXPR_FOR_VCOL, MYF(0));
> + goto end;
> + }
> +#endif
Please remove #if/#else.
> #ifdef PARANOID
> /*
> Walk through the Item tree checking if all items are valid
>
BR
Sergey
--
Sergey Petrunia, Software Developer
Monty Program AB, http://askmonty.org
Blog: http://s.petrunia.net/blog
2
1
Antony,
Hi there! How goes? I have a question pertaining to bug 571200. I have
hand-coded the fix from Federated as shown on
http://lists.mysql.com/commits/102419. Obviously, FederatedX has changed
enough that there's a fair amount of work I had to do (see my branch at
bzr+ssh://bazaar.launchpad.net/~capttofu/maria_bug_571200) Most of it I
think I have a good solution to. The one thing remaining is how to get
anything that pertains to getting or setting the "result->data_cursor"
value, as you'll see is in that patch. Since "result" is now a
FEDERATEDX_IO_RESULT, it doesn't have that structure member, although
FEDERATEDX_IO_RESULT for federatedx_io_mysql class is a MYSQL_RES, but
trying to access results->data_cursor results in errors such as:
ha_federatedx.cc: In member function ‘int
ha_federatedx::read_next(uchar*, FEDERATEDX_IO_RESULT*)’:
ha_federatedx.cc:2888: error: invalid use of incomplete type ‘struct
st_federatedx_result’
ha_federatedx.h:127: error: forward declaration of ‘struct
st_federatedx_result’
So, the solution I think is to have a virtual method in the
federatedx_io class called "get_data_cursor_pos" and
"set_data_cursor_pos" that within federatedx_io_mysql class obtains
result->data_cursor. For other driver classes, it'll require some
thinking, but for now this would get the mysql driver working.
What are your thoughts on this and what I have thus far?
Thanks, hope all is going well!
regards,
Patrick
1
0
Re: [Maria-developers] [Fwd: [Commits] bzr commit into Mariadb 5.2, with Maria 2.0:maria/5.2 branch (igor:2827) Bug#607177]
by Sergey Petrunya 21 Jul '10
by Sergey Petrunya 21 Jul '10
21 Jul '10
Hello Igor,
Ok to push.
On Tue, Jul 20, 2010 at 10:01:30PM -0700, Igor Babaev wrote:
> Sergey,
>
> Please review this trivial patch for the 5.2 tree.
>
> Regards,
> Igor.
>
> -------- Original Message --------
> Subject: [Commits] bzr commit into Mariadb 5.2, with Maria 2.0:maria/5.2
> branch (igor:2827) Bug#607177
> Date: Tue, 20 Jul 2010 22:00:00 -0700 (PDT)
> From: Igor Babaev <igor(a)askmonty.org>
> Reply-To: maria-developers(a)lists.launchpad.net
> To: commits(a)mariadb.org
>
> #At lp:maria/5.2 based on
> revid:igor@askmonty.org-20100717195808-mvh782jvt6c32u2d
>
> 2827 Igor Babaev 2010-07-20
> Fixed bug #607177.
> Due to an invalid check for NULL of the second argument of the
> Item_func_round items performed in the code of
> Item_func_round::real_op
> the function ROUND sometimes could return wrong results.
> modified:
> mysql-test/suite/vcol/r/vcol_misc.result
> mysql-test/suite/vcol/t/vcol_misc.test
> sql/item_func.cc
>
> === modified file 'mysql-test/suite/vcol/r/vcol_misc.result'
> --- a/mysql-test/suite/vcol/r/vcol_misc.result 2010-07-17 19:58:08 +0000
> +++ b/mysql-test/suite/vcol/r/vcol_misc.result 2010-07-21 04:59:47 +0000
> @@ -87,3 +87,23 @@ a v
> 2002-02-15 00:00:00 0
> 2000-10-15 00:00:00 1
> DROP TABLE t1, t2;
> +CREATE TABLE t1 (p int, a double NOT NULL, v double AS (ROUND(a,p))
> VIRTUAL);
> +INSERT INTO t1 VALUES (0,1,0);
> +Warnings:
> +Warning 1645 The value specified for computed column 'v' in table 't1'
> ignored
> +INSERT INTO t1 VALUES (NULL,0,0);
> +Warnings:
> +Warning 1645 The value specified for computed column 'v' in table 't1'
> ignored
> +SELECT a, p, v, ROUND(a,p), ROUND(a,p+NULL) FROM t1;
> +a p v ROUND(a,p) ROUND(a,p+NULL)
> +1 0 1 1 NULL
> +0 NULL NULL NULL NULL
> +DROP TABLE t1;
> +CREATE TABLE t1 (p int, a double NOT NULL);
> +INSERT INTO t1(p,a) VALUES (0,1);
> +INSERT INTO t1(p,a) VALUES (NULL,0);
> +SELECT a, p, ROUND(a,p), ROUND(a,p+NULL) FROM t1;
> +a p ROUND(a,p) ROUND(a,p+NULL)
> +1 0 1 NULL
> +0 NULL NULL NULL
> +DROP TABLE t1;
>
> === modified file 'mysql-test/suite/vcol/t/vcol_misc.test'
> --- a/mysql-test/suite/vcol/t/vcol_misc.test 2010-07-17 19:58:08 +0000
> +++ b/mysql-test/suite/vcol/t/vcol_misc.test 2010-07-21 04:59:47 +0000
> @@ -87,3 +87,19 @@ INSERT INTO t2(a) VALUES ('2000-10-15');
> SELECT * FROM t2;
>
> DROP TABLE t1, t2;
> +
> +#
> +# Bug#607177: ROUND function in the expression for a virtual function
> +#
> +
> +CREATE TABLE t1 (p int, a double NOT NULL, v double AS (ROUND(a,p))
> VIRTUAL);
> +INSERT INTO t1 VALUES (0,1,0);
> +INSERT INTO t1 VALUES (NULL,0,0);
> +SELECT a, p, v, ROUND(a,p), ROUND(a,p+NULL) FROM t1;
> +DROP TABLE t1;
> +
> +CREATE TABLE t1 (p int, a double NOT NULL);
> +INSERT INTO t1(p,a) VALUES (0,1);
> +INSERT INTO t1(p,a) VALUES (NULL,0);
> +SELECT a, p, ROUND(a,p), ROUND(a,p+NULL) FROM t1;
> +DROP TABLE t1;
>
> === modified file 'sql/item_func.cc'
> --- a/sql/item_func.cc 2010-06-01 19:52:20 +0000
> +++ b/sql/item_func.cc 2010-07-21 04:59:47 +0000
> @@ -2040,10 +2040,12 @@ double Item_func_round::real_op()
> {
> double value= args[0]->val_real();
>
> - if (!(null_value= args[0]->null_value || args[1]->null_value))
> - return my_double_round(value, args[1]->val_int(),
> args[1]->unsigned_flag,
> - truncate);
> -
> + if (!(null_value= args[0]->null_value))
> + {
> + longlong dec= args[1]->val_int();
> + if (!(null_value= args[1]->null_value))
> + return my_double_round(value, dec, args[1]->unsigned_flag, truncate);
> + }
> return 0.0;
> }
>
>
> _______________________________________________
> commits mailing list
> commits(a)mariadb.org
> https://lists.askmonty.org/cgi-bin/mailman/listinfo/commits
--
BR
Sergey
--
Sergey Petrunia, Software Developer
Monty Program AB, http://askmonty.org
Blog: http://s.petrunia.net/blog
1
0
Re: [Maria-developers] [Fwd: [Commits] bzr commit into Mariadb 5.2, with Maria 2.0:maria/5.2 branch (igor:2827) Bug#607566]
by Sergey Petrunya 20 Jul '10
by Sergey Petrunya 20 Jul '10
20 Jul '10
Hello Igor,
Ok to push.
On Mon, Jul 19, 2010 at 10:43:18PM -0700, Igor Babaev wrote:
> Sergey,
>
> Please review this patch for the 5.2 tree.
>
> Regards,
> Igor.
>
>
> -------- Original Message --------
> Subject: [Commits] bzr commit into Mariadb 5.2, with Maria 2.0:maria/5.2
> branch (igor:2827) Bug#607566
> Date: Mon, 19 Jul 2010 22:41:37 -0700 (PDT)
> From: Igor Babaev <igor(a)askmonty.org>
> Reply-To: maria-developers(a)lists.launchpad.net
> To: commits(a)mariadb.org
>
> #At lp:maria/5.2 based on
> revid:igor@askmonty.org-20100717195808-mvh782jvt6c32u2d
>
> 2827 Igor Babaev 2010-07-19
> Fixed bug #607566.
> For queries with order by clauses that employed filesort usage of
> virtual column references in select lists could trigger assertion
> failures. It happened because a wrong vcol_set bitmap was used for
> filesort. It turned out that filesort required its own vcol_set
> bitmap.
>
> Made management of the vcol_set bitmaps similar to the management
> of the read_set and write_set bitmaps.
> modified:
> mysql-test/suite/vcol/r/vcol_misc.result
> mysql-test/suite/vcol/t/vcol_misc.test
> sql/field.cc
> sql/filesort.cc
> sql/sql_insert.cc
> sql/sql_select.cc
> sql/table.cc
> sql/table.h
>
> === modified file 'mysql-test/suite/vcol/r/vcol_misc.result'
> --- a/mysql-test/suite/vcol/r/vcol_misc.result 2010-07-17 19:58:08 +0000
> +++ b/mysql-test/suite/vcol/r/vcol_misc.result 2010-07-20 05:41:24 +0000
> @@ -87,3 +87,13 @@ a v
> 2002-02-15 00:00:00 0
> 2000-10-15 00:00:00 1
> DROP TABLE t1, t2;
> +CREATE TABLE t1 (
> +a char(255), b char(255), c char(255), d char(255),
> +v char(255) AS (CONCAT(c,d) ) VIRTUAL
> +);
> +INSERT INTO t1(a,b,c,d) VALUES ('w','x','y','z'), ('W','X','Y','Z');
> +SELECT v FROM t1 ORDER BY CONCAT(a,b);
> +v
> +yz
> +YZ
> +DROP TABLE t1;
>
> === modified file 'mysql-test/suite/vcol/t/vcol_misc.test'
> --- a/mysql-test/suite/vcol/t/vcol_misc.test 2010-07-17 19:58:08 +0000
> +++ b/mysql-test/suite/vcol/t/vcol_misc.test 2010-07-20 05:41:24 +0000
> @@ -87,3 +87,18 @@ INSERT INTO t2(a) VALUES ('2000-10-15');
> SELECT * FROM t2;
>
> DROP TABLE t1, t2;
> +
> +#
> +# Bug#607566: Virtual column in the select list of SELECT with ORDER BY
> +#
> +
> +CREATE TABLE t1 (
> + a char(255), b char(255), c char(255), d char(255),
> + v char(255) AS (CONCAT(c,d) ) VIRTUAL
> +);
> +
> +INSERT INTO t1(a,b,c,d) VALUES ('w','x','y','z'), ('W','X','Y','Z');
> +
> +SELECT v FROM t1 ORDER BY CONCAT(a,b);
> +
> +DROP TABLE t1;
>
> === modified file 'sql/field.cc'
> --- a/sql/field.cc 2010-06-01 19:52:20 +0000
> +++ b/sql/field.cc 2010-07-20 05:41:24 +0000
> @@ -57,7 +57,7 @@ const char field_separator=',';
> ((ulong) ((LL(1) << min(arg, 4) * 8) - LL(1)))
>
> #define ASSERT_COLUMN_MARKED_FOR_READ DBUG_ASSERT(!table ||
> (!table->read_set || bitmap_is_set(table->read_set, field_index)))
> -#define ASSERT_COLUMN_MARKED_FOR_WRITE_OR_COMPUTED DBUG_ASSERT(!table
> || (!table->write_set || bitmap_is_set(table->write_set, field_index) ||
> bitmap_is_set(&table->vcol_set, field_index)))
> +#define ASSERT_COLUMN_MARKED_FOR_WRITE_OR_COMPUTED DBUG_ASSERT(!table
> || (!table->write_set || bitmap_is_set(table->write_set, field_index) ||
> bitmap_is_set(table->vcol_set, field_index)))
>
> /*
> Rules for merging different types of fields in UNION
>
> === modified file 'sql/filesort.cc'
> --- a/sql/filesort.cc 2010-07-17 19:58:08 +0000
> +++ b/sql/filesort.cc 2010-07-20 05:41:24 +0000
> @@ -515,7 +515,7 @@ static ha_rows find_all_keys(SORTPARAM *
> THD *thd= current_thd;
> volatile THD::killed_state *killed= &thd->killed;
> handler *file;
> - MY_BITMAP *save_read_set, *save_write_set;
> + MY_BITMAP *save_read_set, *save_write_set, *save_vcol_set;
> DBUG_ENTER("find_all_keys");
> DBUG_PRINT("info",("using: %s",
> (select ? select->quick ? "ranges" : "where":
> @@ -552,6 +552,7 @@ static ha_rows find_all_keys(SORTPARAM *
> /* Remember original bitmaps */
> save_read_set= sort_form->read_set;
> save_write_set= sort_form->write_set;
> + save_vcol_set= sort_form->vcol_set;
> /* Set up temporary column read map for columns used by sort */
> bitmap_clear_all(&sort_form->tmp_set);
> /* Temporary set for register_used_fields and
> register_field_in_read_map */
> @@ -560,7 +561,8 @@ static ha_rows find_all_keys(SORTPARAM *
> if (select && select->cond)
> select->cond->walk(&Item::register_field_in_read_map, 1,
> (uchar*) sort_form);
> - sort_form->column_bitmaps_set(&sort_form->tmp_set, &sort_form->tmp_set);
> + sort_form->column_bitmaps_set(&sort_form->tmp_set, &sort_form->tmp_set,
> + &sort_form->tmp_set);
>
> for (;;)
> {
> @@ -643,7 +645,7 @@ static ha_rows find_all_keys(SORTPARAM *
> DBUG_RETURN(HA_POS_ERROR);
>
> /* Signal we should use orignal column read and write maps */
> - sort_form->column_bitmaps_set(save_read_set, save_write_set);
> + sort_form->column_bitmaps_set(save_read_set, save_write_set,
> save_vcol_set);
>
> DBUG_PRINT("test",("error: %d indexpos: %d",error,indexpos));
> if (error != HA_ERR_END_OF_FILE)
>
> === modified file 'sql/sql_insert.cc'
> --- a/sql/sql_insert.cc 2010-07-15 23:51:05 +0000
> +++ b/sql/sql_insert.cc 2010-07-20 05:41:24 +0000
> @@ -2109,7 +2109,7 @@ TABLE *Delayed_insert::get_local_table(T
> copy= (TABLE*) client_thd->alloc(sizeof(*copy)+
> (share->fields+1)*sizeof(Field**)+
> share->reclength +
> - share->column_bitmap_size*2);
> + share->column_bitmap_size*3);
> if (!copy)
> goto error;
>
> @@ -2119,7 +2119,7 @@ TABLE *Delayed_insert::get_local_table(T
> /* Assign the pointers for the field pointers array and the record. */
> field= copy->field= (Field**) (copy + 1);
> bitmap= (uchar*) (field + share->fields + 1);
> - copy->record[0]= (bitmap + share->column_bitmap_size * 2);
> + copy->record[0]= (bitmap + share->column_bitmap_size*3);
> memcpy((char*) copy->record[0], (char*) table->record[0],
> share->reclength);
> /*
> Make a copy of all fields.
> @@ -2161,10 +2161,13 @@ TABLE *Delayed_insert::get_local_table(T
> copy->def_read_set.bitmap= (my_bitmap_map*) bitmap;
> copy->def_write_set.bitmap= ((my_bitmap_map*)
> (bitmap + share->column_bitmap_size));
> + copy->def_vcol_set.bitmap= ((my_bitmap_map*)
> + (bitmap + 2*share->column_bitmap_size));
> copy->tmp_set.bitmap= 0; // To catch errors
> - bzero((char*) bitmap, share->column_bitmap_size*2);
> + bzero((char*) bitmap, share->column_bitmap_size*3);
> copy->read_set= ©->def_read_set;
> copy->write_set= ©->def_write_set;
> + copy->vcol_set= ©->def_vcol_set;
>
> DBUG_RETURN(copy);
>
>
> === modified file 'sql/sql_select.cc'
> --- a/sql/sql_select.cc 2010-07-17 19:58:08 +0000
> +++ b/sql/sql_select.cc 2010-07-20 05:41:24 +0000
> @@ -5513,7 +5513,7 @@ static void calc_used_field_length(THD *
> {
> uint null_fields,blobs,fields,rec_length;
> Field **f_ptr,*field;
> - MY_BITMAP *read_set= join_tab->table->read_set;;
> + MY_BITMAP *read_set= join_tab->table->read_set;
>
> null_fields= blobs= fields= rec_length=0;
> for (f_ptr=join_tab->table->field ; (field= *f_ptr) ; f_ptr++)
> @@ -9877,11 +9877,11 @@ void setup_tmp_table_column_bitmaps(TABL
> uint field_count= table->s->fields;
> bitmap_init(&table->def_read_set, (my_bitmap_map*) bitmaps, field_count,
> FALSE);
> - bitmap_init(&table->tmp_set,
> + bitmap_init(&table->def_vcol_set,
> (my_bitmap_map*) (bitmaps+ bitmap_buffer_size(field_count)),
> field_count, FALSE);
> - bitmap_init(&table->vcol_set,
> - (my_bitmap_map*) (bitmaps+
> 2+bitmap_buffer_size(field_count)),
> + bitmap_init(&table->tmp_set,
> + (my_bitmap_map*) (bitmaps+
> 2*bitmap_buffer_size(field_count)),
> field_count, FALSE);
>
> /* write_set and all_set are copies of read_set */
>
> === modified file 'sql/table.cc'
> --- a/sql/table.cc 2010-07-17 19:58:08 +0000
> +++ b/sql/table.cc 2010-07-20 05:41:24 +0000
> @@ -2343,9 +2343,9 @@ partititon_err:
> (my_bitmap_map*) bitmaps, share->fields, FALSE);
> bitmap_init(&outparam->def_write_set,
> (my_bitmap_map*) (bitmaps+bitmap_size), share->fields,
> FALSE);
> - bitmap_init(&outparam->tmp_set,
> + bitmap_init(&outparam->def_vcol_set,
> (my_bitmap_map*) (bitmaps+bitmap_size*2), share->fields,
> FALSE);
> - bitmap_init(&outparam->vcol_set,
> + bitmap_init(&outparam->tmp_set,
> (my_bitmap_map*) (bitmaps+bitmap_size*3), share->fields,
> FALSE);
> outparam->default_column_bitmaps();
>
> @@ -4809,10 +4809,10 @@ void st_table::clear_column_bitmaps()
> Reset column read/write usage. It's identical to:
> bitmap_clear_all(&table->def_read_set);
> bitmap_clear_all(&table->def_write_set);
> + bitmap_clear_all(&table->def_vcol_set);
> */
> - bzero((char*) def_read_set.bitmap, s->column_bitmap_size*2);
> - bzero((char*) def_read_set.bitmap, s->column_bitmap_size*4);
> - column_bitmaps_set(&def_read_set, &def_write_set);
> + bzero((char*) def_read_set.bitmap, s->column_bitmap_size*3);
> + column_bitmaps_set(&def_read_set, &def_write_set, &def_vcol_set);
> }
>
>
> @@ -5085,7 +5085,7 @@ bool st_table::mark_virtual_col(Field *f
> {
> bool res;
> DBUG_ASSERT(field->vcol_info);
> - if (!(res= bitmap_fast_test_and_set(&vcol_set, field->field_index)))
> + if (!(res= bitmap_fast_test_and_set(vcol_set, field->field_index)))
> {
> Item *vcol_item= field->vcol_info->expr_item;
> DBUG_ASSERT(vcol_item);
> @@ -5464,7 +5464,7 @@ int update_virtual_fields(THD *thd, TABL
> vfield= (*vfield_ptr);
> DBUG_ASSERT(vfield->vcol_info && vfield->vcol_info->expr_item);
> /* Only update those fields that are marked in the vcol_set bitmap */
> - if (bitmap_is_set(&table->vcol_set, vfield->field_index) &&
> + if (bitmap_is_set(table->vcol_set, vfield->field_index) &&
> (for_write || !vfield->stored_in_db))
> {
> /* Compute the actual value of the virtual fields */
>
> === modified file 'sql/table.h'
> --- a/sql/table.h 2010-07-17 19:58:08 +0000
> +++ b/sql/table.h 2010-07-20 05:41:24 +0000
> @@ -719,9 +719,8 @@ struct st_table {
> const char *alias; /* alias or table name */
> uchar *null_flags;
> my_bitmap_map *bitmap_init_value;
> - MY_BITMAP def_read_set, def_write_set, tmp_set; /* containers */
> - MY_BITMAP vcol_set; /* set of used virtual columns */
> - MY_BITMAP *read_set, *write_set; /* Active column sets */
> + MY_BITMAP def_read_set, def_write_set, def_vcol_set, tmp_set;
> + MY_BITMAP *read_set, *write_set, *vcol_set; /* Active column sets */
> /*
> The ID of the query that opened and is using this table. Has different
> meanings depending on the table type.
> @@ -904,12 +903,30 @@ struct st_table {
> if (file)
> file->column_bitmaps_signal();
> }
> + inline void column_bitmaps_set(MY_BITMAP *read_set_arg,
> + MY_BITMAP *write_set_arg,
> + MY_BITMAP *vcol_set_arg)
> + {
> + read_set= read_set_arg;
> + write_set= write_set_arg;
> + vcol_set= vcol_set_arg;
> + if (file)
> + file->column_bitmaps_signal();
> + }
> inline void column_bitmaps_set_no_signal(MY_BITMAP *read_set_arg,
> MY_BITMAP *write_set_arg)
> {
> read_set= read_set_arg;
> write_set= write_set_arg;
> }
> + inline void column_bitmaps_set_no_signal(MY_BITMAP *read_set_arg,
> + MY_BITMAP *write_set_arg,
> + MY_BITMAP *vcol_set_arg)
> + {
> + read_set= read_set_arg;
> + write_set= write_set_arg;
> + vcol_set= vcol_set_arg;
> + }
> inline void use_all_columns()
> {
> column_bitmaps_set(&s->all_set, &s->all_set);
> @@ -918,6 +935,7 @@ struct st_table {
> {
> read_set= &def_read_set;
> write_set= &def_write_set;
> + vcol_set= &def_vcol_set;
> }
> /* Is table open or should be treated as such by name-locking? */
> inline bool is_name_opened() { return db_stat || open_placeholder; }
>
> _______________________________________________
> commits mailing list
> commits(a)mariadb.org
> https://lists.askmonty.org/cgi-bin/mailman/listinfo/commits
--
BR
Sergey
--
Sergey Petrunia, Software Developer
Monty Program AB, http://askmonty.org
Blog: http://s.petrunia.net/blog
1
0
Welcome,
I am a programmer in Comarch (Poland) and I would like to know, what
should we (Comarch) do,
to release our Custom Storage Engine called CLDB on GPLv2 licence... Our
custom storage engine
is using MySQL/MariaDB custom storage engine API and some sources from "sql"
directory to
use with condition pushdown... CLDB is column oriented database... Currently
one-threaded...
What documents should we have/fill to obtain GPL lecence ??
Should we share our codes and where ??
Thank you for Your answers..
-----------------------------------------------------------------------------
Mateusz Matan
IT Security R&D Department, C/C++ programmer
ComArch S.A., Al. Jana Pawła II 41d, 31-864 Kraków
tel: (+48 12) 684 8411
e-mail: Mateusz.Matan(a)comarch.pl
2
1
[Maria-developers] bzr commit into file:///home/tsk/mprog/src/5.3-mwl89/ branch (timour:2804)
by timour@askmonty.org 18 Jul '10
by timour@askmonty.org 18 Jul '10
18 Jul '10
#At file:///home/tsk/mprog/src/5.3-mwl89/ based on revid:timour@askmonty.org-20100718114608-wiz9ji9z80pzjw2k
2804 timour(a)askmonty.org 2010-07-18
MWL#89: Cost-based choice between Materialization and IN->EXISTS transformation
Step2 in the separation of the creation of IN->EXISTS equi-join conditions from
their injection. The goal of this separation is to make it possible that the
IN->EXISTS conditions can be used for cost estimation without actually modifying
the subquery.
This patch separates row_value_in_to_exists_transformer() into two methods:
- create_row_value_in_to_exists_cond(), and
- inject_row_value_in_to_exists_cond()
The patch performs minimal refactoring of the code so that it is easier to solve
problems resulting from the separation. There is a lot to be simplified in this
code, but this will be done separately.
modified:
sql/item_subselect.cc
sql/item_subselect.h
=== modified file 'sql/item_subselect.cc'
--- a/sql/item_subselect.cc 2010-07-18 11:46:08 +0000
+++ b/sql/item_subselect.cc 2010-07-18 12:59:24 +0000
@@ -1524,16 +1524,16 @@ Item_subselect::trans_res
Item_in_subselect::single_value_in_to_exists_transformer(JOIN * join,
Comp_creator *func)
{
- Item *where_term;
- Item *having_term;
+ Item *where_item;
+ Item *having_item;
Item_subselect::trans_res res;
res= create_single_value_in_to_exists_cond(join, func,
- &where_term, &having_term);
+ &where_item, &having_item);
if (res != RES_OK)
return res;
res= inject_single_value_in_to_exists_cond(join, func,
- where_term, having_term);
+ where_item, having_item);
return res;
}
@@ -1541,8 +1541,8 @@ Item_in_subselect::single_value_in_to_ex
Item_subselect::trans_res
Item_in_subselect::create_single_value_in_to_exists_cond(JOIN * join,
Comp_creator *func,
- Item **where_term,
- Item **having_term)
+ Item **where_item,
+ Item **having_item)
{
SELECT_LEX *select_lex= join->select_lex;
DBUG_ENTER("Item_in_subselect::create_single_value_in_to_exists_cond");
@@ -1569,8 +1569,8 @@ Item_in_subselect::create_single_value_i
if (item->fix_fields(thd, 0))
DBUG_RETURN(RES_ERROR);
- *having_term= item;
- *where_term= NULL;
+ *having_item= item;
+ *where_item= NULL;
}
else
{
@@ -1595,7 +1595,7 @@ Item_in_subselect::create_single_value_i
if (having->fix_fields(thd, 0))
DBUG_RETURN(RES_ERROR);
- *having_term= having;
+ *having_item= having;
item= new Item_cond_or(item,
new Item_func_isnull(orig_item));
@@ -1613,7 +1613,7 @@ Item_in_subselect::create_single_value_i
if (item->fix_fields(thd, 0))
DBUG_RETURN(RES_ERROR);
- *where_term= item;
+ *where_item= item;
}
else
{
@@ -1640,13 +1640,13 @@ Item_in_subselect::create_single_value_i
if (new_having->fix_fields(thd, 0))
DBUG_RETURN(RES_ERROR);
- *having_term= new_having;
- *where_term= NULL;
+ *having_item= new_having;
+ *where_item= NULL;
}
else
{
- *having_term= NULL;
- *where_term= (Item*) select_lex->item_list.head();
+ *having_item= NULL;
+ *where_item= (Item*) select_lex->item_list.head();
}
}
}
@@ -1659,8 +1659,8 @@ Item_in_subselect::create_single_value_i
Item_subselect::trans_res
Item_in_subselect::inject_single_value_in_to_exists_cond(JOIN * join,
Comp_creator *func,
- Item *where_term,
- Item *having_term)
+ Item *where_item,
+ Item *having_item)
{
SELECT_LEX *select_lex= join->select_lex;
bool fix_res;
@@ -1675,9 +1675,9 @@ Item_in_subselect::inject_single_value_i
we can assign select_lex->having here, and pass 0 as last
argument (reference) to fix_fields()
*/
- select_lex->having= join->having= and_items(join->having, having_term);
- if (join->having == having_term)
- having_term->name= (char*)in_having_cond;
+ select_lex->having= join->having= and_items(join->having, having_item);
+ if (join->having == having_item)
+ having_item->name= (char*)in_having_cond;
select_lex->having_fix_field= 1;
/*
we do not check join->having->fixed, because Item_and (from and_items)
@@ -1707,8 +1707,8 @@ Item_in_subselect::inject_single_value_i
we can assign select_lex->having here, and pass 0 as last
argument (reference) to fix_fields()
*/
- having_term->name= (char*)in_having_cond;
- select_lex->having= join->having= having_term;
+ having_item->name= (char*)in_having_cond;
+ select_lex->having= join->having= having_item;
select_lex->having_fix_field= 1;
/*
we do not check join->having->fixed, because Item_and (from
@@ -1726,14 +1726,14 @@ Item_in_subselect::inject_single_value_i
single_value_transformer but there is no corresponding action in
row_value_transformer?
*/
- where_term->name= (char *)in_additional_cond;
+ where_item->name= (char *)in_additional_cond;
/*
AND can't be changed during fix_fields()
we can assign select_lex->having here, and pass 0 as last
argument (reference) to fix_fields()
*/
- select_lex->where= join->conds= and_items(join->conds, where_term);
+ select_lex->where= join->conds= and_items(join->conds, where_item);
select_lex->where->top_level_item();
/*
we do not check join->conds->fixed, because Item_and can't be fixed
@@ -1746,8 +1746,8 @@ Item_in_subselect::inject_single_value_i
{
if (select_lex->master_unit()->is_union())
{
- having_term->name= (char*)in_having_cond;
- select_lex->having= join->having= having_term;
+ having_item->name= (char*)in_having_cond;
+ select_lex->having= join->having= having_item;
select_lex->having_fix_field= 1;
/*
@@ -1765,11 +1765,11 @@ Item_in_subselect::inject_single_value_i
// it is single select without tables => possible optimization
// remove the dependence mark since the item is moved to upper
// select and is not outer anymore.
- where_term->walk(&Item::remove_dependence_processor, 0,
+ where_item->walk(&Item::remove_dependence_processor, 0,
(uchar *) select_lex->outer_select());
- where_term= func->create(left_expr, where_term);
+ where_item= func->create(left_expr, where_item);
// fix_field of item will be done in time of substituting
- substitution= where_term;
+ substitution= where_item;
have_to_be_excluded= 1;
if (thd->lex->describe)
{
@@ -1866,20 +1866,37 @@ Item_in_subselect::row_value_transformer
add the equi-join and the "is null" to WHERE
add the is_not_null_test to HAVING
*/
-
Item_subselect::trans_res
Item_in_subselect::row_value_in_to_exists_transformer(JOIN * join)
{
+ Item *where_item;
+ Item *having_item;
+ Item_subselect::trans_res res;
+
+ res= create_row_value_in_to_exists_cond(join, &where_item, &having_item);
+ if (res != RES_OK)
+ return res;
+ res= inject_row_value_in_to_exists_cond(join, where_item, having_item);
+ return res;
+}
+
+
+Item_subselect::trans_res
+Item_in_subselect::create_row_value_in_to_exists_cond(JOIN * join,
+ Item **where_item,
+ Item **having_item)
+{
SELECT_LEX *select_lex= join->select_lex;
- Item *having_item= 0;
uint cols_num= left_expr->cols();
bool is_having_used= (join->having || select_lex->with_sum_func ||
select_lex->group_list.first ||
!select_lex->table_list.elements);
- DBUG_ENTER("Item_in_subselect::row_value_in_to_exists_transformer");
+ DBUG_ENTER("Item_in_subselect::create_row_value_in_to_exists_cond");
+
+ *where_item= NULL;
+ *having_item= NULL;
- select_lex->uncacheable|= UNCACHEABLE_DEPENDENT;
if (is_having_used)
{
/*
@@ -1899,6 +1916,7 @@ Item_in_subselect::row_value_in_to_exist
for (uint i= 0; i < cols_num; i++)
{
DBUG_ASSERT((left_expr->fixed &&
+
select_lex->ref_pointer_array[i]->fixed) ||
(select_lex->ref_pointer_array[i]->type() == REF_ITEM &&
((Item_ref*)(select_lex->ref_pointer_array[i]))->ref_type() ==
@@ -1932,8 +1950,8 @@ Item_in_subselect::row_value_in_to_exist
if (!(col_item= new Item_func_trig_cond(col_item, get_cond_guard(i))))
DBUG_RETURN(RES_ERROR);
}
- having_item= and_items(having_item, col_item);
-
+ *having_item= and_items(*having_item, col_item);
+
Item *item_nnull_test=
new Item_is_not_null_test(this,
new Item_ref(&select_lex->context,
@@ -1950,8 +1968,8 @@ Item_in_subselect::row_value_in_to_exist
item_having_part2= and_items(item_having_part2, item_nnull_test);
item_having_part2->top_level_item();
}
- having_item= and_items(having_item, item_having_part2);
- having_item->top_level_item();
+ *having_item= and_items(*having_item, item_having_part2);
+ (*having_item)->top_level_item();
}
else
{
@@ -1972,7 +1990,6 @@ Item_in_subselect::row_value_in_to_exist
(l2 = v2) and
(l3 = v3)
*/
- Item *where_item= 0;
for (uint i= 0; i < cols_num; i++)
{
Item *item, *item_isnull;
@@ -2030,10 +2047,33 @@ Item_in_subselect::row_value_in_to_exist
new Item_func_trig_cond(having_col_item, get_cond_guard(i))))
DBUG_RETURN(RES_ERROR);
}
- having_item= and_items(having_item, having_col_item);
+ *having_item= and_items(*having_item, having_col_item);
}
- where_item= and_items(where_item, item);
+ *where_item= and_items(*where_item, item);
}
+ (*where_item)->fix_fields(thd, 0);
+ }
+
+ DBUG_RETURN(RES_OK);
+}
+
+
+Item_subselect::trans_res
+Item_in_subselect::inject_row_value_in_to_exists_cond(JOIN * join,
+ Item *where_item,
+ Item *having_item)
+{
+ SELECT_LEX *select_lex= join->select_lex;
+ bool is_having_used= (join->having || select_lex->with_sum_func ||
+ select_lex->group_list.first ||
+ !select_lex->table_list.elements);
+
+ DBUG_ENTER("Item_in_subselect::inject_row_value_in_to_exists_cond");
+
+ select_lex->uncacheable|= UNCACHEABLE_DEPENDENT;
+
+ if (!is_having_used)
+ {
/*
AND can't be changed during fix_fields()
we can assign select_lex->where here, and pass 0 as last
@@ -2041,9 +2081,10 @@ Item_in_subselect::row_value_in_to_exist
*/
select_lex->where= join->conds= and_items(join->conds, where_item);
select_lex->where->top_level_item();
- if (join->conds->fix_fields(thd, 0))
+ if (!join->conds->fixed && join->conds->fix_fields(thd, 0))
DBUG_RETURN(RES_ERROR);
}
+
if (having_item)
{
bool res;
@@ -2057,12 +2098,11 @@ Item_in_subselect::row_value_in_to_exist
argument (reference) to fix_fields()
*/
select_lex->having_fix_field= 1;
- res= join->having->fix_fields(thd, 0);
+ if (!join->having->fixed)
+ res= join->having->fix_fields(thd, 0);
select_lex->having_fix_field= 0;
if (res)
- {
DBUG_RETURN(RES_ERROR);
- }
}
DBUG_RETURN(RES_OK);
=== modified file 'sql/item_subselect.h'
--- a/sql/item_subselect.h 2010-07-18 11:46:08 +0000
+++ b/sql/item_subselect.h 2010-07-18 12:59:24 +0000
@@ -438,6 +438,13 @@ public:
Item *having_term);
trans_res row_value_in_to_exists_transformer(JOIN * join);
+ trans_res create_row_value_in_to_exists_cond(JOIN * join,
+ Item **where_term,
+ Item **having_term);
+ trans_res inject_row_value_in_to_exists_cond(JOIN * join,
+ Item *where_term,
+ Item *having_term);
+
virtual bool exec();
longlong val_int();
double val_real();
1
0
[Maria-developers] bzr commit into file:///home/tsk/mprog/src/5.3-mwl89/ branch (timour:2803)
by timour@askmonty.org 18 Jul '10
by timour@askmonty.org 18 Jul '10
18 Jul '10
#At file:///home/tsk/mprog/src/5.3-mwl89/ based on revid:timour@askmonty.org-20100716121055-6pesx07gvsmivwm3
2803 timour(a)askmonty.org 2010-07-18
MWL#89: Cost-based choice between Materialization and IN->EXISTS transformation
Step1 in the separation of the creation of IN->EXISTS equi-join conditions from
their injection. The goal of this separation is to make it possible that the
IN->EXISTS conditions can be used for cost estimation without actually modifying
the subquery.
This patch separates single_value_in_to_exists_transformer() into two methods:
- create_single_value_in_to_exists_cond(), and
- inject_single_value_in_to_exists_cond()
The patch performs minimal refactoring of the code so that it is easier to solve
problems resulting from the separation. There is a lot to be simplified in this
code, but this will be done separately.
modified:
sql/item_subselect.cc
sql/item_subselect.h
=== modified file 'sql/item_subselect.cc'
--- a/sql/item_subselect.cc 2010-07-16 12:10:55 +0000
+++ b/sql/item_subselect.cc 2010-07-18 11:46:08 +0000
@@ -1521,16 +1521,35 @@ Item_in_subselect::single_value_transfor
*/
Item_subselect::trans_res
-Item_in_subselect::single_value_in_to_exists_transformer(JOIN * join, Comp_creator *func)
+Item_in_subselect::single_value_in_to_exists_transformer(JOIN * join,
+ Comp_creator *func)
+{
+ Item *where_term;
+ Item *having_term;
+ Item_subselect::trans_res res;
+
+ res= create_single_value_in_to_exists_cond(join, func,
+ &where_term, &having_term);
+ if (res != RES_OK)
+ return res;
+ res= inject_single_value_in_to_exists_cond(join, func,
+ where_term, having_term);
+ return res;
+}
+
+
+Item_subselect::trans_res
+Item_in_subselect::create_single_value_in_to_exists_cond(JOIN * join,
+ Comp_creator *func,
+ Item **where_term,
+ Item **having_term)
{
SELECT_LEX *select_lex= join->select_lex;
- DBUG_ENTER("Item_in_subselect::single_value_in_to_exists_transformer");
+ DBUG_ENTER("Item_in_subselect::create_single_value_in_to_exists_cond");
- select_lex->uncacheable|= UNCACHEABLE_DEPENDENT;
if (join->having || select_lex->with_sum_func ||
select_lex->group_list.elements)
{
- bool tmp;
Item *item= func->create(expr,
new Item_ref_null_helper(&select_lex->context,
this,
@@ -1546,132 +1565,199 @@ Item_in_subselect::single_value_in_to_ex
*/
item= new Item_func_trig_cond(item, get_cond_guard(0));
}
-
+
+ if (item->fix_fields(thd, 0))
+ DBUG_RETURN(RES_ERROR);
+
+ *having_term= item;
+ *where_term= NULL;
+ }
+ else
+ {
+ Item *item= (Item*) select_lex->item_list.head();
+
+ if (select_lex->table_list.elements)
+ {
+ Item *having= item;
+ Item *orig_item= item;
+
+ item= func->create(expr, item);
+ if (!abort_on_null && orig_item->maybe_null)
+ {
+ having= new Item_is_not_null_test(this, having);
+ if (left_expr->maybe_null)
+ {
+ if (!(having= new Item_func_trig_cond(having,
+ get_cond_guard(0))))
+ DBUG_RETURN(RES_ERROR);
+ }
+
+ if (having->fix_fields(thd, 0))
+ DBUG_RETURN(RES_ERROR);
+
+ *having_term= having;
+
+ item= new Item_cond_or(item,
+ new Item_func_isnull(orig_item));
+ }
+ /*
+ If we may encounter NULL IN (SELECT ...) and care whether subquery
+ result is NULL or FALSE, wrap condition in a trig_cond.
+ */
+ if (!abort_on_null && left_expr->maybe_null)
+ {
+ if (!(item= new Item_func_trig_cond(item, get_cond_guard(0))))
+ DBUG_RETURN(RES_ERROR);
+ }
+
+ if (item->fix_fields(thd, 0))
+ DBUG_RETURN(RES_ERROR);
+
+ *where_term= item;
+ }
+ else
+ {
+ if (select_lex->master_unit()->is_union())
+ {
+ /*
+ comparison functions can't be changed during fix_fields()
+ we can assign select_lex->having here, and pass 0 as last
+ argument (reference) to fix_fields()
+ */
+ Item *new_having=
+ func->create(expr,
+ new Item_ref_null_helper(&select_lex->context, this,
+ select_lex->ref_pointer_array,
+ (char *)"<no matter>",
+ (char *)"<result>"));
+ if (!abort_on_null && left_expr->maybe_null)
+ {
+ if (!(new_having= new Item_func_trig_cond(new_having,
+ get_cond_guard(0))))
+ DBUG_RETURN(RES_ERROR);
+ }
+
+ if (new_having->fix_fields(thd, 0))
+ DBUG_RETURN(RES_ERROR);
+
+ *having_term= new_having;
+ *where_term= NULL;
+ }
+ else
+ {
+ *having_term= NULL;
+ *where_term= (Item*) select_lex->item_list.head();
+ }
+ }
+ }
+
+ DBUG_RETURN(RES_OK);
+}
+
+
+
+Item_subselect::trans_res
+Item_in_subselect::inject_single_value_in_to_exists_cond(JOIN * join,
+ Comp_creator *func,
+ Item *where_term,
+ Item *having_term)
+{
+ SELECT_LEX *select_lex= join->select_lex;
+ bool fix_res;
+ DBUG_ENTER("Item_in_subselect::single_value_in_to_exists_transformer");
+
+ select_lex->uncacheable|= UNCACHEABLE_DEPENDENT;
+ if (join->having || select_lex->with_sum_func ||
+ select_lex->group_list.elements)
+ {
/*
AND and comparison functions can't be changed during fix_fields()
we can assign select_lex->having here, and pass 0 as last
argument (reference) to fix_fields()
*/
- select_lex->having= join->having= and_items(join->having, item);
- if (join->having == item)
- item->name= (char*)in_having_cond;
+ select_lex->having= join->having= and_items(join->having, having_term);
+ if (join->having == having_term)
+ having_term->name= (char*)in_having_cond;
select_lex->having_fix_field= 1;
/*
we do not check join->having->fixed, because Item_and (from and_items)
or comparison function (from func->create) can't be fixed after creation
*/
- tmp= join->having->fix_fields(thd, 0);
+ if (!join->having->fixed)
+ fix_res= join->having->fix_fields(thd, 0);
select_lex->having_fix_field= 0;
- if (tmp)
+ if (fix_res)
DBUG_RETURN(RES_ERROR);
}
else
{
- Item *item= (Item*) select_lex->item_list.head();
-
if (select_lex->table_list.elements)
{
- bool tmp;
- Item *having= item, *orig_item= item;
+ Item *orig_item= (Item*) select_lex->item_list.head();
select_lex->item_list.empty();
select_lex->item_list.push_back(new Item_int("Not_used",
(longlong) 1,
MY_INT64_NUM_DECIMAL_DIGITS));
select_lex->ref_pointer_array[0]= select_lex->item_list.head();
- item= func->create(expr, item);
if (!abort_on_null && orig_item->maybe_null)
{
- having= new Item_is_not_null_test(this, having);
- if (left_expr->maybe_null)
- {
- if (!(having= new Item_func_trig_cond(having,
- get_cond_guard(0))))
- DBUG_RETURN(RES_ERROR);
- }
/*
Item_is_not_null_test can't be changed during fix_fields()
we can assign select_lex->having here, and pass 0 as last
argument (reference) to fix_fields()
*/
- having->name= (char*)in_having_cond;
- select_lex->having= join->having= having;
+ having_term->name= (char*)in_having_cond;
+ select_lex->having= join->having= having_term;
select_lex->having_fix_field= 1;
/*
we do not check join->having->fixed, because Item_and (from
and_items) or comparison function (from func->create) can't be
fixed after creation
*/
- tmp= join->having->fix_fields(thd, 0);
+ if (!join->having->fixed)
+ fix_res= join->having->fix_fields(thd, 0);
select_lex->having_fix_field= 0;
- if (tmp)
+ if (fix_res)
DBUG_RETURN(RES_ERROR);
- item= new Item_cond_or(item,
- new Item_func_isnull(orig_item));
- }
- /*
- If we may encounter NULL IN (SELECT ...) and care whether subquery
- result is NULL or FALSE, wrap condition in a trig_cond.
- */
- if (!abort_on_null && left_expr->maybe_null)
- {
- if (!(item= new Item_func_trig_cond(item, get_cond_guard(0))))
- DBUG_RETURN(RES_ERROR);
}
/*
TODO: figure out why the following is done here in
single_value_transformer but there is no corresponding action in
row_value_transformer?
*/
- item->name= (char *)in_additional_cond;
+ where_term->name= (char *)in_additional_cond;
/*
AND can't be changed during fix_fields()
we can assign select_lex->having here, and pass 0 as last
argument (reference) to fix_fields()
*/
- select_lex->where= join->conds= and_items(join->conds, item);
+ select_lex->where= join->conds= and_items(join->conds, where_term);
select_lex->where->top_level_item();
/*
we do not check join->conds->fixed, because Item_and can't be fixed
after creation
*/
- if (join->conds->fix_fields(thd, 0))
- DBUG_RETURN(RES_ERROR);
+ if (!join->conds->fixed && join->conds->fix_fields(thd, 0))
+ DBUG_RETURN(RES_ERROR);
}
else
{
- bool tmp;
if (select_lex->master_unit()->is_union())
{
- /*
- comparison functions can't be changed during fix_fields()
- we can assign select_lex->having here, and pass 0 as last
- argument (reference) to fix_fields()
- */
- Item *new_having=
- func->create(expr,
- new Item_ref_null_helper(&select_lex->context, this,
- select_lex->ref_pointer_array,
- (char *)"<no matter>",
- (char *)"<result>"));
- if (!abort_on_null && left_expr->maybe_null)
- {
- if (!(new_having= new Item_func_trig_cond(new_having,
- get_cond_guard(0))))
- DBUG_RETURN(RES_ERROR);
- }
- new_having->name= (char*)in_having_cond;
- select_lex->having= join->having= new_having;
+ having_term->name= (char*)in_having_cond;
+ select_lex->having= join->having= having_term;
select_lex->having_fix_field= 1;
/*
we do not check join->having->fixed, because comparison function
(from func->create) can't be fixed after creation
*/
- tmp= join->having->fix_fields(thd, 0);
+ if (!join->having->fixed)
+ fix_res= join->having->fix_fields(thd, 0);
select_lex->having_fix_field= 0;
- if (tmp)
+ if (fix_res)
DBUG_RETURN(RES_ERROR);
}
else
@@ -1679,11 +1765,11 @@ Item_in_subselect::single_value_in_to_ex
// it is single select without tables => possible optimization
// remove the dependence mark since the item is moved to upper
// select and is not outer anymore.
- item->walk(&Item::remove_dependence_processor, 0,
- (uchar *) select_lex->outer_select());
- item= func->create(left_expr, item);
+ where_term->walk(&Item::remove_dependence_processor, 0,
+ (uchar *) select_lex->outer_select());
+ where_term= func->create(left_expr, where_term);
// fix_field of item will be done in time of substituting
- substitution= item;
+ substitution= where_term;
have_to_be_excluded= 1;
if (thd->lex->describe)
{
=== modified file 'sql/item_subselect.h'
--- a/sql/item_subselect.h 2010-07-16 10:52:02 +0000
+++ b/sql/item_subselect.h 2010-07-18 11:46:08 +0000
@@ -425,8 +425,18 @@ public:
trans_res select_in_like_transformer(JOIN *join, Comp_creator *func);
trans_res single_value_transformer(JOIN *join, Comp_creator *func);
trans_res row_value_transformer(JOIN * join);
+
trans_res single_value_in_to_exists_transformer(JOIN * join,
Comp_creator *func);
+ trans_res create_single_value_in_to_exists_cond(JOIN * join,
+ Comp_creator *func,
+ Item **where_term,
+ Item **having_term);
+ trans_res inject_single_value_in_to_exists_cond(JOIN * join,
+ Comp_creator *func,
+ Item *where_term,
+ Item *having_term);
+
trans_res row_value_in_to_exists_transformer(JOIN * join);
virtual bool exec();
longlong val_int();
1
0
Re: [Maria-developers] [Fwd: [Commits] bzr commit into Mariadb 5.2, with Maria 2.0:maria/5.2 branch (igor:2823) Bug#604503]
by Sergey Petrunya 17 Jul '10
by Sergey Petrunya 17 Jul '10
17 Jul '10
Hello Igor,
On Sat, Jul 17, 2010 at 12:42:49AM -0700, Igor Babaev wrote:
> === modified file 'sql/table.cc'
> --- a/sql/table.cc 2010-07-13 14:34:14 +0000
> +++ b/sql/table.cc 2010-07-17 07:37:48 +0000
> @@ -1930,8 +1930,6 @@ end:
> semantic analysis of the item by calling the the function
> fix_vcol_expr.
> Since the defining expression is part of the table definition the item
> for it is created in table->memroot within a separate Query_arena.
Please explicitly refer to TABLE::expr_arena in the above comment.
> - The free_list of this arena is saved in field->vcol_info.item_free_list
> - to be freed when the table defition is removed from the TABLE_SHARE
> cache.
>
> @note
> Before passing 'vcol_expr" to the parser the function embraces it in
...
> === modified file 'sql/table.h'
> --- a/sql/table.h 2010-06-03 09:28:54 +0000
> +++ b/sql/table.h 2010-07-17 07:37:48 +0000
> @@ -27,6 +27,7 @@ class st_select_lex;
> class partition_info;
> class COND_EQUAL;
> class Security_context;
> +class Query_arena;
>
> /*************************************************************************/
>
> @@ -869,6 +870,7 @@ struct st_table {
> MEM_ROOT mem_root;
> GRANT_INFO grant;
> FILESORT_INFO sort;
> + Query_arena *expr_arena;
Please add a comment saying what is the new member for.
> #ifdef WITH_PARTITION_STORAGE_ENGINE
> partition_info *part_info; /* Partition related information */
> bool no_partitions_used; /* If true, all partitions have been pruned
> away */
Ok to push after the above is addressed.
BR
Sergey
--
Sergey Petrunia, Software Developer
Monty Program AB, http://askmonty.org
Blog: http://s.petrunia.net/blog
1
0
Re: [Maria-developers] [Fwd: [Commits] bzr commit into Mariadb 5.2, with Maria 2.0:maria/5.2 branch (igor:2823) Bug#603186]
by Sergey Petrunya 16 Jul '10
by Sergey Petrunya 16 Jul '10
16 Jul '10
Hello Igor,
On Thu, Jul 15, 2010 at 04:54:37PM -0700, Igor Babaev wrote:
> Please review this patch for the 5.2 tree.
>
> Regards,
> Igor.
Ok to push.
> -------- Original Message --------
> Subject: [Commits] bzr commit into Mariadb 5.2, with Maria 2.0:maria/5.2
> branch (igor:2823) Bug#603186
> Date: Thu, 15 Jul 2010 16:51:17 -0700 (PDT)
> From: Igor Babaev <igor(a)askmonty.org>
> Reply-To: maria-developers(a)lists.launchpad.net
> To: commits(a)mariadb.org
>
> #At lp:maria/5.2 based on
> revid:igor@askmonty.org-20100713174523-mjvsvvp6ow8dc81x
>
> 2823 Igor Babaev 2010-07-15
> Fixed bug #603186.
> There were two problems that caused wrong results reported with
> this bug.
> 1. In some cases stored(persistent) virtual columns were not marked
> in the write_set and in the vcol_set bitmaps.
> 2. If the list of fields in an insert command was empty then the
> values of
> the stored virtual columns were set to default.
>
> To fix the first problem the function
> st_table::mark_virtual_columns_for_write
> was modified. Now the function has a parameter that says whether
> the virtual
> columns are to be marked for insert or for update.
> To fix the second problem a special handling of empty insert lists is
> added in the function fill_record().
> modified:
> mysql-test/suite/vcol/r/vcol_misc.result
> mysql-test/suite/vcol/t/vcol_misc.test
> sql/sql_base.cc
> sql/sql_insert.cc
> sql/sql_lex.cc
> sql/sql_lex.h
> sql/sql_table.cc
> sql/table.cc
> sql/table.h
>
> === modified file 'mysql-test/suite/vcol/r/vcol_misc.result'
> --- a/mysql-test/suite/vcol/r/vcol_misc.result 2010-07-13 17:45:23 +0000
> +++ b/mysql-test/suite/vcol/r/vcol_misc.result 2010-07-15 23:51:05 +0000
> @@ -45,3 +45,20 @@ C
> 1
> 1
> DROP TABLE t1;
> +CREATE TABLE t1(a int, b int DEFAULT 0, v INT AS (b+10) PERSISTENT);
> +INSERT INTO t1(a) VALUES (1);
> +SELECT b, v FROM t1;
> +b v
> +0 10
> +DROP TABLE t1;
> +CREATE TABLE t1(a int DEFAULT 100, v int AS (a+1) PERSISTENT);
> +INSERT INTO t1 () VALUES ();
> +CREATE TABLE t2(a int DEFAULT 100 , v int AS (a+1));
> +INSERT INTO t2 () VALUES ();
> +SELECT a, v FROM t1;
> +a v
> +100 101
> +SELECT a, v FROM t2;
> +a v
> +100 101
> +DROP TABLE t1,t2;
>
> === modified file 'mysql-test/suite/vcol/t/vcol_misc.test'
> --- a/mysql-test/suite/vcol/t/vcol_misc.test 2010-07-13 17:45:23 +0000
> +++ b/mysql-test/suite/vcol/t/vcol_misc.test 2010-07-15 23:51:05 +0000
> @@ -43,5 +43,22 @@ SELECT 1 AS C FROM t1 ORDER BY v;
>
> DROP TABLE t1;
>
> +#
> +# Bug#603186: Insert for a table with stored vurtual columns
> +#
>
> +CREATE TABLE t1(a int, b int DEFAULT 0, v INT AS (b+10) PERSISTENT);
> +INSERT INTO t1(a) VALUES (1);
> +SELECT b, v FROM t1;
>
> +DROP TABLE t1;
> +
> +CREATE TABLE t1(a int DEFAULT 100, v int AS (a+1) PERSISTENT);
> +INSERT INTO t1 () VALUES ();
> +CREATE TABLE t2(a int DEFAULT 100 , v int AS (a+1));
> +INSERT INTO t2 () VALUES ();
> +
> +SELECT a, v FROM t1;
> +SELECT a, v FROM t2;
> +
> +DROP TABLE t1,t2;
>
> === modified file 'sql/sql_base.cc'
> --- a/sql/sql_base.cc 2010-06-01 19:52:20 +0000
> +++ b/sql/sql_base.cc 2010-07-15 23:51:05 +0000
> @@ -8204,6 +8204,8 @@ fill_record(THD * thd, List<Item> &field
> table->auto_increment_field_not_null= FALSE;
> f.rewind();
> }
> + else if (thd->lex->unit.insert_table_with_stored_vcol)
> + tbl_list.push_back(thd->lex->unit.insert_table_with_stored_vcol);
> while ((fld= f++))
> {
> if (!(field= fld->filed_for_view_update()))
>
> === modified file 'sql/sql_insert.cc'
> --- a/sql/sql_insert.cc 2010-06-01 19:52:20 +0000
> +++ b/sql/sql_insert.cc 2010-07-15 23:51:05 +0000
> @@ -273,7 +273,7 @@ static int check_insert_fields(THD *thd,
> }
> /* Mark virtual columns used in the insert statement */
> if (table->vfield)
> - table->mark_virtual_columns_for_write();
> + table->mark_virtual_columns_for_write(TRUE);
> // For the values we need select_priv
> #ifndef NO_EMBEDDED_ACCESS_CHECKS
> table->grant.want_privilege= (SELECT_ACL & ~table->grant.privilege);
> @@ -1267,7 +1267,6 @@ bool mysql_prepare_insert(THD *thd, TABL
> if (mysql_prepare_insert_check_table(thd, table_list, fields,
> select_insert))
> DBUG_RETURN(TRUE);
>
> -
> /* Prepare the fields in the statement. */
> if (values)
> {
> @@ -1320,6 +1319,18 @@ bool mysql_prepare_insert(THD *thd, TABL
> if (!table)
> table= table_list->table;
>
> + if (!fields.elements && table->vfield)
> + {
> + for (Field **vfield_ptr= table->vfield; *vfield_ptr; vfield_ptr++)
> + {
> + if ((*vfield_ptr)->stored_in_db)
> + {
> + thd->lex->unit.insert_table_with_stored_vcol= table;
> + break;
> + }
> + }
> + }
> +
> if (!select_insert)
> {
> Item *fake_conds= 0;
>
> === modified file 'sql/sql_lex.cc'
> --- a/sql/sql_lex.cc 2010-06-01 19:52:20 +0000
> +++ b/sql/sql_lex.cc 2010-07-15 23:51:05 +0000
> @@ -1590,6 +1590,7 @@ void st_select_lex_unit::init_query()
> item_list.empty();
> describe= 0;
> found_rows_for_union= 0;
> + insert_table_with_stored_vcol= 0;
> }
>
> void st_select_lex::init_query()
>
> === modified file 'sql/sql_lex.h'
> --- a/sql/sql_lex.h 2010-06-01 19:52:20 +0000
> +++ b/sql/sql_lex.h 2010-07-15 23:51:05 +0000
> @@ -532,6 +532,13 @@ public:
> bool describe; /* union exec() called for EXPLAIN */
> Procedure *last_procedure; /* Pointer to procedure, if such exists */
>
> + /*
> + Insert table with stored virtual columns.
> + This is used only in those rare cases
> + when the list of inserted values is empty.
> + */
> + TABLE *insert_table_with_stored_vcol;
> +
> void init_query();
> st_select_lex_unit* master_unit();
> st_select_lex* outer_select();
>
> === modified file 'sql/sql_table.cc'
> --- a/sql/sql_table.cc 2010-06-05 14:53:36 +0000
> +++ b/sql/sql_table.cc 2010-07-15 23:51:05 +0000
> @@ -7876,7 +7876,7 @@ copy_data_between_tables(TABLE *from,TAB
>
> /* Tell handler that we have values for all columns in the to table */
> to->use_all_columns();
> - to->mark_virtual_columns_for_write();
> + to->mark_virtual_columns_for_write(TRUE);
> init_read_record(&info, thd, from, (SQL_SELECT *) 0, 1, 1, FALSE);
> errpos= 4;
> if (ignore)
>
> === modified file 'sql/table.cc'
> --- a/sql/table.cc 2010-07-13 14:34:14 +0000
> +++ b/sql/table.cc 2010-07-15 23:51:05 +0000
> @@ -5024,7 +5024,7 @@ void st_table::mark_columns_needed_for_u
> }
> }
> /* Mark all virtual columns needed for update */
> - mark_virtual_columns_for_write();
> + mark_virtual_columns_for_write(FALSE);
> DBUG_VOID_RETURN;
> }
>
> @@ -5052,7 +5052,7 @@ void st_table::mark_columns_needed_for_i
> if (found_next_number_field)
> mark_auto_increment_column();
> /* Mark virtual columns for insert */
> - mark_virtual_columns_for_write();
> + mark_virtual_columns_for_write(TRUE);
> }
>
>
> @@ -5090,10 +5090,14 @@ bool st_table::mark_virtual_col(Field *f
>
> /*
> @brief Mark virtual columns for update/insert commands
> +
> + @param insert_fl <-> virtual columns are marked for insert command
>
> @details
> The function marks virtual columns used in a update/insert commands
> in the vcol_set bitmap.
> + For an insert command a virtual column is always marked in write_set if
> + it is a stored column.
> If a virtual column is from write_set it is always marked in vcol_set.
> If a stored virtual column is not from write_set but it is computed
> through columns from write_set it is also marked in vcol_set, and,
> @@ -5112,7 +5116,7 @@ bool st_table::mark_virtual_col(Field *f
> be added to read_set either.
> */
>
> -void st_table::mark_virtual_columns_for_write(void)
> +void st_table::mark_virtual_columns_for_write(bool insert_fl)
> {
> Field **vfield_ptr, *tmp_vfield;
> bool bitmap_updated= FALSE;
> @@ -5124,16 +5128,21 @@ void st_table::mark_virtual_columns_for_
> bitmap_updated= mark_virtual_col(tmp_vfield);
> else if (tmp_vfield->stored_in_db)
> {
> - MY_BITMAP *save_read_set;
> - Item *vcol_item= tmp_vfield->vcol_info->expr_item;
> - DBUG_ASSERT(vcol_item);
> - bitmap_clear_all(&tmp_set);
> - save_read_set= read_set;
> - read_set= &tmp_set;
> - vcol_item->walk(&Item::register_field_in_read_map, 1, (uchar *) 0);
> - read_set= save_read_set;
> - bitmap_intersect(&tmp_set, write_set);
> - if (!bitmap_is_clear_all(&tmp_set))
> + bool mark_fl= insert_fl;
> + if (!mark_fl)
> + {
> + MY_BITMAP *save_read_set;
> + Item *vcol_item= tmp_vfield->vcol_info->expr_item;
> + DBUG_ASSERT(vcol_item);
> + bitmap_clear_all(&tmp_set);
> + save_read_set= read_set;
> + read_set= &tmp_set;
> + vcol_item->walk(&Item::register_field_in_read_map, 1, (uchar *) 0);
> + read_set= save_read_set;
> + bitmap_intersect(&tmp_set, write_set);
> + mark_fl= !bitmap_is_clear_all(&tmp_set);
> + }
> + if (mark_fl)
> {
> bitmap_set_bit(write_set, tmp_vfield->field_index);
> mark_virtual_col(tmp_vfield);
>
> === modified file 'sql/table.h'
> --- a/sql/table.h 2010-06-03 09:28:54 +0000
> +++ b/sql/table.h 2010-07-15 23:51:05 +0000
> @@ -886,7 +886,7 @@ struct st_table {
> void mark_columns_needed_for_delete(void);
> void mark_columns_needed_for_insert(void);
> bool mark_virtual_col(Field *field);
> - void mark_virtual_columns_for_write(void);
> + void mark_virtual_columns_for_write(bool insert_fl);
> inline void column_bitmaps_set(MY_BITMAP *read_set_arg,
> MY_BITMAP *write_set_arg)
> {
>
> _______________________________________________
> commits mailing list
> commits(a)mariadb.org
> https://lists.askmonty.org/cgi-bin/mailman/listinfo/commits
--
BR
Sergey
--
Sergey Petrunia, Software Developer
Monty Program AB, http://askmonty.org
Blog: http://s.petrunia.net/blog
1
0
[Maria-developers] bzr commit into file:///home/tsk/mprog/src/5.3-mwl89/ branch (timour:2802)
by timour@askmonty.org 16 Jul '10
by timour@askmonty.org 16 Jul '10
16 Jul '10
#At file:///home/tsk/mprog/src/5.3-mwl89/ based on revid:timour@askmonty.org-20100716105202-8narq4tzhka2n1a5
2802 timour(a)askmonty.org 2010-07-16 [merge]
Merge main 5.3 into 5.3-mwl89.
added:
mysql-test/r/optimizer_switch.result
mysql-test/t/optimizer_switch.test
modified:
.bzrignore
configure.in
include/queues.h
include/thr_alarm.h
mysql-test/r/index_merge_myisam.result
mysql-test/r/myisam_mrr.result
mysql-test/r/order_by.result
mysql-test/r/subselect_mat.result
mysql-test/r/subselect_no_mat.result
mysql-test/r/subselect_no_opts.result
mysql-test/r/subselect_no_semijoin.result
mysql-test/r/subselect_sj.result
mysql-test/r/subselect_sj_jcl6.result
mysql-test/t/index_merge_myisam.test
mysql-test/t/myisam_mrr.test
mysql-test/t/order_by.test
mysql-test/t/subselect_mat.test
mysql-test/t/subselect_no_mat.test
mysql-test/t/subselect_no_opts.test
mysql-test/t/subselect_no_semijoin.test
mysql-test/t/subselect_sj.test
mysys/queues.c
mysys/thr_alarm.c
sql/create_options.cc
sql/event_queue.cc
sql/filesort.cc
sql/ha_partition.cc
sql/ha_partition.h
sql/item_cmpfunc.cc
sql/item_subselect.cc
sql/mysqld.cc
sql/net_serv.cc
sql/opt_range.cc
sql/sql_class.cc
sql/sql_class.h
sql/sql_union.cc
sql/uniques.cc
storage/maria/ma_ft_boolean_search.c
storage/maria/ma_ft_nlq_search.c
storage/maria/ma_sort.c
storage/maria/maria_pack.c
storage/myisam/ft_boolean_search.c
storage/myisam/ft_nlq_search.c
storage/myisam/mi_test_all.sh
storage/myisam/myisampack.c
storage/myisam/sort.c
storage/myisammrg/myrg_queue.c
storage/myisammrg/myrg_rnext.c
storage/myisammrg/myrg_rnext_same.c
storage/myisammrg/myrg_rprev.c
=== modified file '.bzrignore'
--- a/.bzrignore 2010-06-26 10:05:41 +0000
+++ b/.bzrignore 2010-07-16 07:33:01 +0000
@@ -1940,3 +1940,4 @@ sql/client_plugin.c
*.dgcov
libmysqld/create_options.cc
storage/pbxt/bin/xtstat
+libmysqld/sql_expression_cache.cc
=== modified file 'configure.in'
--- a/configure.in 2010-06-26 19:55:33 +0000
+++ b/configure.in 2010-07-16 08:02:05 +0000
@@ -17,7 +17,7 @@ dnl When merging new MySQL releases, upd
dnl MySQL version number.
dnl
dnl Note: the following line must be parseable by win/configure.js:GetVersion()
-AC_INIT([MariaDB Server], [5.2.1-MariaDB-beta], [], [mysql])
+AC_INIT([MariaDB Server], [5.3.0-MariaDB-alpha], [], [mysql])
AC_CONFIG_SRCDIR([sql/mysqld.cc])
AC_CANONICAL_SYSTEM
=== modified file 'include/queues.h'
--- a/include/queues.h 2007-11-14 18:20:31 +0000
+++ b/include/queues.h 2010-07-16 07:33:01 +0000
@@ -1,23 +1,31 @@
-/* Copyright (C) 2000 MySQL AB
+/* Copyright (C) 2010 Monty Program Ab
+ All Rights reserved
- This program is free software; you can redistribute it and/or modify
- it under the terms of the GNU General Public License as published by
- the Free Software Foundation; version 2 of the License.
-
- This program is distributed in the hope that it will be useful,
- but WITHOUT ANY WARRANTY; without even the implied warranty of
- MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
- GNU General Public License for more details.
-
- You should have received a copy of the GNU General Public License
- along with this program; if not, write to the Free Software
- Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */
+ Redistribution and use in source and binary forms, with or without
+ modification, are permitted provided that the following conditions are met:
+ * Redistributions of source code must retain the above copyright
+ notice, this list of conditions and the following disclaimer.
+ * Redistributions in binary form must reproduce the following disclaimer
+ in the documentation and/or other materials provided with the
+ distribution.
+
+ THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
+ FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL
+ <COPYRIGHT HOLDER> BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+ USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+ OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ SUCH DAMAGE.
+*/
/*
Code for generell handling of priority Queues.
Implemention of queues from "Algoritms in C" by Robert Sedgewick.
- Copyright Monty Program KB.
- By monty.
*/
#ifndef _queues_h
@@ -31,30 +39,34 @@ typedef struct st_queue {
void *first_cmp_arg;
uint elements;
uint max_elements;
- uint offset_to_key; /* compare is done on element+offset */
+ uint offset_to_key; /* compare is done on element+offset */
+ uint offset_to_queue_pos; /* If we want to store position in element */
+ uint auto_extent;
int max_at_top; /* Normally 1, set to -1 if queue_top gives max */
int (*compare)(void *, uchar *,uchar *);
- uint auto_extent;
} QUEUE;
+#define queue_first_element(queue) 1
+#define queue_last_element(queue) (queue)->elements
#define queue_top(queue) ((queue)->root[1])
-#define queue_element(queue,index) ((queue)->root[index+1])
+#define queue_element(queue,index) ((queue)->root[index])
#define queue_end(queue) ((queue)->root[(queue)->elements])
-#define queue_replaced(queue) _downheap(queue,1)
+#define queue_replace(queue, idx) _downheap(queue, idx, (queue)->root[idx])
+#define queue_replace_top(queue) _downheap(queue, 1, (queue)->root[1])
#define queue_set_cmp_arg(queue, set_arg) (queue)->first_cmp_arg= set_arg
#define queue_set_max_at_top(queue, set_arg) \
(queue)->max_at_top= set_arg ? -1 : 1
+#define queue_remove_top(queue_arg) queue_remove((queue_arg), queue_first_element(queue_arg))
typedef int (*queue_compare)(void *,uchar *, uchar *);
int init_queue(QUEUE *queue,uint max_elements,uint offset_to_key,
pbool max_at_top, queue_compare compare,
- void *first_cmp_arg);
-int init_queue_ex(QUEUE *queue,uint max_elements,uint offset_to_key,
- pbool max_at_top, queue_compare compare,
- void *first_cmp_arg, uint auto_extent);
+ void *first_cmp_arg, uint offset_to_queue_pos,
+ uint auto_extent);
int reinit_queue(QUEUE *queue,uint max_elements,uint offset_to_key,
pbool max_at_top, queue_compare compare,
- void *first_cmp_arg);
+ void *first_cmp_arg, uint offset_to_queue_pos,
+ uint auto_extent);
int resize_queue(QUEUE *queue, uint max_elements);
void delete_queue(QUEUE *queue);
void queue_insert(QUEUE *queue,uchar *element);
@@ -62,7 +74,7 @@ int queue_insert_safe(QUEUE *queue, ucha
uchar *queue_remove(QUEUE *queue,uint idx);
#define queue_remove_all(queue) { (queue)->elements= 0; }
#define queue_is_full(queue) (queue->elements == queue->max_elements)
-void _downheap(QUEUE *queue,uint idx);
+void _downheap(QUEUE *queue, uint idx, uchar *element);
void queue_fix(QUEUE *queue);
#define is_queue_inited(queue) ((queue)->root != 0)
=== modified file 'include/thr_alarm.h'
--- a/include/thr_alarm.h 2008-04-28 16:24:05 +0000
+++ b/include/thr_alarm.h 2010-07-16 07:33:01 +0000
@@ -34,7 +34,7 @@ extern "C" {
typedef struct st_alarm_info
{
- ulong next_alarm_time;
+ time_t next_alarm_time;
uint active_alarms;
uint max_used_alarms;
} ALARM_INFO;
@@ -78,10 +78,11 @@ typedef int thr_alarm_entry;
typedef thr_alarm_entry* thr_alarm_t;
typedef struct st_alarm {
- ulong expire_time;
+ time_t expire_time;
thr_alarm_entry alarmed; /* set when alarm is due */
pthread_t thread;
my_thread_id thread_id;
+ uint index_in_queue;
my_bool malloced;
} ALARM;
=== modified file 'mysql-test/r/index_merge_myisam.result'
--- a/mysql-test/r/index_merge_myisam.result 2010-07-10 10:37:30 +0000
+++ b/mysql-test/r/index_merge_myisam.result 2010-07-16 08:58:24 +0000
@@ -1413,66 +1413,6 @@ WHERE
`RUNID`= '' AND `SUBMITNR`= '' AND `ORDERNR`='' AND `PROGRAMM`='' AND
`TESTID`='' AND `UCCHECK`='';
drop table t1;
-#
-# Generic @@optimizer_switch tests (move those into a separate file if
-# we get another @@optimizer_switch user)
-#
-select @@optimizer_switch;
-@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
-set optimizer_switch='index_merge=off,index_merge_union=off';
-select @@optimizer_switch;
-@@optimizer_switch
-index_merge=off,index_merge_union=off,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
-set optimizer_switch='index_merge_union=on';
-select @@optimizer_switch;
-@@optimizer_switch
-index_merge=off,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
-set optimizer_switch='default,index_merge_sort_union=off';
-select @@optimizer_switch;
-@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=off,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
-set optimizer_switch=4;
-ERROR 42000: Variable 'optimizer_switch' can't be set to the value of '4'
-set optimizer_switch=NULL;
-ERROR 42000: Variable 'optimizer_switch' can't be set to the value of 'NULL'
-set optimizer_switch='default,index_merge';
-ERROR 42000: Variable 'optimizer_switch' can't be set to the value of 'index_merge'
-set optimizer_switch='index_merge=index_merge';
-ERROR 42000: Variable 'optimizer_switch' can't be set to the value of 'index_merge=index_merge'
-set optimizer_switch='index_merge=on,but...';
-ERROR 42000: Variable 'optimizer_switch' can't be set to the value of 'but...'
-set optimizer_switch='index_merge=';
-ERROR 42000: Variable 'optimizer_switch' can't be set to the value of 'index_merge='
-set optimizer_switch='index_merge';
-ERROR 42000: Variable 'optimizer_switch' can't be set to the value of 'index_merge'
-set optimizer_switch='on';
-ERROR 42000: Variable 'optimizer_switch' can't be set to the value of 'on'
-set optimizer_switch='index_merge=on,index_merge=off';
-ERROR 42000: Variable 'optimizer_switch' can't be set to the value of 'index_merge=off'
-set optimizer_switch='index_merge_union=on,index_merge_union=default';
-ERROR 42000: Variable 'optimizer_switch' can't be set to the value of 'index_merge_union=default'
-set optimizer_switch='default,index_merge=on,index_merge=off,default';
-ERROR 42000: Variable 'optimizer_switch' can't be set to the value of 'index_merge=off,default'
-set optimizer_switch=default;
-set optimizer_switch='index_merge=off,index_merge_union=off,default';
-select @@optimizer_switch;
-@@optimizer_switch
-index_merge=off,index_merge_union=off,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
-set optimizer_switch=default;
-select @@global.optimizer_switch;
-@@global.optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
-set @@global.optimizer_switch=default;
-select @@global.optimizer_switch;
-@@global.optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
-#
-# Check index_merge's @@optimizer_switch flags
-#
-select @@optimizer_switch;
-@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
create table t0 (a int);
insert into t0 values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
create table t1 (a int, b int, c int, filler char(100),
@@ -1580,7 +1520,4 @@ explain select * from t1 where a=10 and
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE t1 index_merge a,b,c a,c 5,5 NULL 54 Using sort_union(a,c); Using where
set optimizer_switch=default;
-show variables like 'optimizer_switch';
-Variable_name Value
-optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
drop table t0, t1;
=== modified file 'mysql-test/r/myisam_mrr.result'
--- a/mysql-test/r/myisam_mrr.result 2010-07-10 10:37:30 +0000
+++ b/mysql-test/r/myisam_mrr.result 2010-07-16 08:58:24 +0000
@@ -392,9 +392,9 @@ drop table t0, t1;
# Part of MWL#67: DS-MRR backport: add an @@optimizer_switch flag for
# index_condition pushdown:
# - engine_condition_pushdown does not affect ICP
-select @@optimizer_switch;
-@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
+select @@optimizer_switch like '%index_condition_pushdown=on%';
+@@optimizer_switch like '%index_condition_pushdown=on%'
+1
create table t0 (a int);
insert into t0 values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
create table t1 (a int, b int, key(a));
=== added file 'mysql-test/r/optimizer_switch.result'
--- a/mysql-test/r/optimizer_switch.result 1970-01-01 00:00:00 +0000
+++ b/mysql-test/r/optimizer_switch.result 2010-07-16 08:58:24 +0000
@@ -0,0 +1,99 @@
+#
+# Generic @@optimizer_switch tests
+#
+#
+select @@optimizer_switch;
+@@optimizer_switch
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
+set optimizer_switch='index_merge=off,index_merge_union=off';
+select @@optimizer_switch;
+@@optimizer_switch
+index_merge=off,index_merge_union=off,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
+set optimizer_switch='index_merge_union=on';
+select @@optimizer_switch;
+@@optimizer_switch
+index_merge=off,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
+set optimizer_switch='default,index_merge_sort_union=off';
+select @@optimizer_switch;
+@@optimizer_switch
+index_merge=on,index_merge_union=on,index_merge_sort_union=off,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
+set optimizer_switch=4;
+ERROR 42000: Variable 'optimizer_switch' can't be set to the value of '4'
+set optimizer_switch=NULL;
+ERROR 42000: Variable 'optimizer_switch' can't be set to the value of 'NULL'
+set optimizer_switch='default,index_merge';
+ERROR 42000: Variable 'optimizer_switch' can't be set to the value of 'index_merge'
+set optimizer_switch='index_merge=index_merge';
+ERROR 42000: Variable 'optimizer_switch' can't be set to the value of 'index_merge=index_merge'
+set optimizer_switch='index_merge=on,but...';
+ERROR 42000: Variable 'optimizer_switch' can't be set to the value of 'but...'
+set optimizer_switch='index_merge=';
+ERROR 42000: Variable 'optimizer_switch' can't be set to the value of 'index_merge='
+set optimizer_switch='index_merge';
+ERROR 42000: Variable 'optimizer_switch' can't be set to the value of 'index_merge'
+set optimizer_switch='on';
+ERROR 42000: Variable 'optimizer_switch' can't be set to the value of 'on'
+set optimizer_switch='index_merge=on,index_merge=off';
+ERROR 42000: Variable 'optimizer_switch' can't be set to the value of 'index_merge=off'
+set optimizer_switch='index_merge_union=on,index_merge_union=default';
+ERROR 42000: Variable 'optimizer_switch' can't be set to the value of 'index_merge_union=default'
+set optimizer_switch='default,index_merge=on,index_merge=off,default';
+ERROR 42000: Variable 'optimizer_switch' can't be set to the value of 'index_merge=off,default'
+set optimizer_switch=default;
+set optimizer_switch='index_merge=off,index_merge_union=off,default';
+select @@optimizer_switch;
+@@optimizer_switch
+index_merge=off,index_merge_union=off,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
+set optimizer_switch=default;
+select @@global.optimizer_switch;
+@@global.optimizer_switch
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
+set @@global.optimizer_switch=default;
+select @@global.optimizer_switch;
+@@global.optimizer_switch
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
+#
+# Check index_merge's @@optimizer_switch flags
+#
+select @@optimizer_switch;
+@@optimizer_switch
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
+
+BUG#37120 optimizer_switch allowable values not according to specification
+
+select @@optimizer_switch;
+@@optimizer_switch
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
+set optimizer_switch='default,materialization=off';
+select @@optimizer_switch;
+@@optimizer_switch
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
+set optimizer_switch='default,semijoin=off';
+select @@optimizer_switch;
+@@optimizer_switch
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
+set optimizer_switch='default,loosescan=off';
+select @@optimizer_switch;
+@@optimizer_switch
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
+set optimizer_switch='default,semijoin=off,materialization=off';
+select @@optimizer_switch;
+@@optimizer_switch
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
+set optimizer_switch='default,materialization=off,semijoin=off';
+select @@optimizer_switch;
+@@optimizer_switch
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
+set optimizer_switch='default,semijoin=off,materialization=off,loosescan=off';
+select @@optimizer_switch;
+@@optimizer_switch
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
+set optimizer_switch='default,semijoin=off,loosescan=off';
+select @@optimizer_switch;
+@@optimizer_switch
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=on,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
+set optimizer_switch='default,materialization=off,loosescan=off';
+select @@optimizer_switch;
+@@optimizer_switch
+index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=off,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
+set optimizer_switch=default;
=== modified file 'mysql-test/r/order_by.result'
--- a/mysql-test/r/order_by.result 2010-03-20 12:01:47 +0000
+++ b/mysql-test/r/order_by.result 2010-07-15 14:07:01 +0000
@@ -607,9 +607,14 @@ FieldKey LongVal StringVal
1 0 2
1 1 3
1 2 1
-EXPLAIN SELECT * FROM t1 WHERE FieldKey > '2' ORDER BY LongVal;
+DS-MRR: use two IGNORE INDEX queries, otherwise we get cost races, because
+DS-MRR: records_in_range/read_time return the same numbers for all three indexes
+EXPLAIN SELECT * FROM t1 IGNORE INDEX (LongField, StringField) WHERE FieldKey > '2' ORDER BY LongVal;
id select_type table type possible_keys key key_len ref rows Extra
-1 SIMPLE t1 range FieldKey,LongField,StringField FieldKey 38 NULL 4 Using index condition; Using where; Using MRR; Using filesort
+1 SIMPLE t1 range FieldKey FieldKey 38 NULL 4 Using index condition; Using MRR; Using filesort
+EXPLAIN SELECT * FROM t1 IGNORE INDEX (FieldKey, LongField) WHERE FieldKey > '2' ORDER BY LongVal;
+id select_type table type possible_keys key key_len ref rows Extra
+1 SIMPLE t1 range StringField StringField 38 NULL 4 Using where; Using filesort
SELECT * FROM t1 WHERE FieldKey > '2' ORDER BY LongVal;
FieldKey LongVal StringVal
3 1 2
=== modified file 'mysql-test/r/subselect_mat.result'
--- a/mysql-test/r/subselect_mat.result 2010-07-16 10:52:02 +0000
+++ b/mysql-test/r/subselect_mat.result 2010-07-16 12:10:55 +0000
@@ -1246,3 +1246,29 @@ i
4
set session optimizer_switch=@save_optimizer_switch;
drop table t1, t2, t3;
+create table t0 (a int);
+insert into t0 values (0),(1),(2);
+create table t1 (a int);
+insert into t1 values (0),(1),(2);
+explain select a, a in (select a from t1) from t0;
+id select_type table type possible_keys key key_len ref rows Extra
+1 PRIMARY t0 ALL NULL NULL NULL NULL 3
+2 SUBQUERY t1 ALL NULL NULL NULL NULL 3
+select a, a in (select a from t1) from t0;
+a a in (select a from t1)
+0 1
+1 1
+2 1
+prepare s from 'select a, a in (select a from t1) from t0';
+execute s;
+a a in (select a from t1)
+0 1
+1 1
+2 1
+update t1 set a=123;
+execute s;
+a a in (select a from t1)
+0 0
+1 0
+2 0
+drop table t0, t1;
=== modified file 'mysql-test/r/subselect_no_mat.result'
--- a/mysql-test/r/subselect_no_mat.result 2010-07-10 10:37:30 +0000
+++ b/mysql-test/r/subselect_no_mat.result 2010-07-16 08:58:24 +0000
@@ -1,6 +1,6 @@
-show variables like 'optimizer_switch';
-Variable_name Value
-optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
+select @@optimizer_switch like '%materialization=on%';
+@@optimizer_switch like '%materialization=on%'
+1
set optimizer_switch='materialization=off';
drop table if exists t1,t2,t3,t4,t5,t6,t7,t8,t11,t12;
set @save_optimizer_switch=@@optimizer_switch;
@@ -4925,6 +4925,6 @@ DROP TABLE t3;
DROP TABLE t2;
DROP TABLE t1;
set optimizer_switch=default;
-show variables like 'optimizer_switch';
-Variable_name Value
-optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
+select @@optimizer_switch like '%materialization=on%';
+@@optimizer_switch like '%materialization=on%'
+1
=== modified file 'mysql-test/r/subselect_no_opts.result'
--- a/mysql-test/r/subselect_no_opts.result 2010-07-10 10:37:30 +0000
+++ b/mysql-test/r/subselect_no_opts.result 2010-07-16 08:58:24 +0000
@@ -1,6 +1,3 @@
-show variables like 'optimizer_switch';
-Variable_name Value
-optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='materialization=off,semijoin=off';
drop table if exists t1,t2,t3,t4,t5,t6,t7,t8,t11,t12;
set @save_optimizer_switch=@@optimizer_switch;
@@ -4925,6 +4922,3 @@ DROP TABLE t3;
DROP TABLE t2;
DROP TABLE t1;
set optimizer_switch=default;
-show variables like 'optimizer_switch';
-Variable_name Value
-optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
=== modified file 'mysql-test/r/subselect_no_semijoin.result'
--- a/mysql-test/r/subselect_no_semijoin.result 2010-07-10 10:37:30 +0000
+++ b/mysql-test/r/subselect_no_semijoin.result 2010-07-16 08:58:24 +0000
@@ -1,6 +1,3 @@
-show variables like 'optimizer_switch';
-Variable_name Value
-optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
set optimizer_switch='semijoin=off';
drop table if exists t1,t2,t3,t4,t5,t6,t7,t8,t11,t12;
set @save_optimizer_switch=@@optimizer_switch;
@@ -4925,6 +4922,3 @@ DROP TABLE t3;
DROP TABLE t2;
DROP TABLE t1;
set optimizer_switch=default;
-show variables like 'optimizer_switch';
-Variable_name Value
-optimizer_switch index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
=== modified file 'mysql-test/r/subselect_sj.result'
--- a/mysql-test/r/subselect_sj.result 2010-07-10 10:37:30 +0000
+++ b/mysql-test/r/subselect_sj.result 2010-07-16 08:58:24 +0000
@@ -197,45 +197,6 @@ id select_type table type possible_keys
1 PRIMARY t1 ALL NULL NULL NULL NULL 103 100.00 Using where; Using join buffer
Warnings:
Note 1003 select `test`.`t1`.`a` AS `a`,`test`.`t1`.`b` AS `b` from `test`.`t10` join `test`.`t1` where ((`test`.`t1`.`a` = `test`.`t10`.`pk`) and (`test`.`t10`.`pk` < 3))
-
-BUG#37120 optimizer_switch allowable values not according to specification
-
-select @@optimizer_switch;
-@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
-set optimizer_switch='default,materialization=off';
-select @@optimizer_switch;
-@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
-set optimizer_switch='default,semijoin=off';
-select @@optimizer_switch;
-@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
-set optimizer_switch='default,loosescan=off';
-select @@optimizer_switch;
-@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
-set optimizer_switch='default,semijoin=off,materialization=off';
-select @@optimizer_switch;
-@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
-set optimizer_switch='default,materialization=off,semijoin=off';
-select @@optimizer_switch;
-@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
-set optimizer_switch='default,semijoin=off,materialization=off,loosescan=off';
-select @@optimizer_switch;
-@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
-set optimizer_switch='default,semijoin=off,loosescan=off';
-select @@optimizer_switch;
-@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=on,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
-set optimizer_switch='default,materialization=off,loosescan=off';
-select @@optimizer_switch;
-@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=off,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
-set optimizer_switch=default;
drop table t0, t1, t2;
drop table t10, t11, t12;
=== modified file 'mysql-test/r/subselect_sj_jcl6.result'
--- a/mysql-test/r/subselect_sj_jcl6.result 2010-07-10 10:37:30 +0000
+++ b/mysql-test/r/subselect_sj_jcl6.result 2010-07-16 08:58:24 +0000
@@ -201,45 +201,6 @@ id select_type table type possible_keys
1 PRIMARY t1 ALL NULL NULL NULL NULL 103 100.00 Using where; Using join buffer
Warnings:
Note 1003 select `test`.`t1`.`a` AS `a`,`test`.`t1`.`b` AS `b` from `test`.`t10` join `test`.`t1` where ((`test`.`t1`.`a` = `test`.`t10`.`pk`) and (`test`.`t10`.`pk` < 3))
-
-BUG#37120 optimizer_switch allowable values not according to specification
-
-select @@optimizer_switch;
-@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
-set optimizer_switch='default,materialization=off';
-select @@optimizer_switch;
-@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
-set optimizer_switch='default,semijoin=off';
-select @@optimizer_switch;
-@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=on,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
-set optimizer_switch='default,loosescan=off';
-select @@optimizer_switch;
-@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
-set optimizer_switch='default,semijoin=off,materialization=off';
-select @@optimizer_switch;
-@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
-set optimizer_switch='default,materialization=off,semijoin=off';
-select @@optimizer_switch;
-@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=on,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
-set optimizer_switch='default,semijoin=off,materialization=off,loosescan=off';
-select @@optimizer_switch;
-@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=off,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
-set optimizer_switch='default,semijoin=off,loosescan=off';
-select @@optimizer_switch;
-@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=on,semijoin=off,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
-set optimizer_switch='default,materialization=off,loosescan=off';
-select @@optimizer_switch;
-@@optimizer_switch
-index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_condition_pushdown=on,firstmatch=on,loosescan=off,materialization=off,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on
-set optimizer_switch=default;
drop table t0, t1, t2;
drop table t10, t11, t12;
=== modified file 'mysql-test/t/index_merge_myisam.test'
--- a/mysql-test/t/index_merge_myisam.test 2009-08-24 19:10:48 +0000
+++ b/mysql-test/t/index_merge_myisam.test 2010-07-16 08:58:24 +0000
@@ -20,78 +20,6 @@ let $merge_table_support= 1;
--source include/index_merge_2sweeps.inc
--source include/index_merge_ror_cpk.inc
---echo #
---echo # Generic @@optimizer_switch tests (move those into a separate file if
---echo # we get another @@optimizer_switch user)
---echo #
-
---replace_regex /,table_elimination=on//
-select @@optimizer_switch;
-
-set optimizer_switch='index_merge=off,index_merge_union=off';
---replace_regex /,table_elimination=on//
-select @@optimizer_switch;
-
-set optimizer_switch='index_merge_union=on';
---replace_regex /,table_elimination=on//
-select @@optimizer_switch;
-
-set optimizer_switch='default,index_merge_sort_union=off';
---replace_regex /,table_elimination=on//
-select @@optimizer_switch;
-
---error ER_WRONG_VALUE_FOR_VAR
-set optimizer_switch=4;
-
---error ER_WRONG_VALUE_FOR_VAR
-set optimizer_switch=NULL;
-
---error ER_WRONG_VALUE_FOR_VAR
-set optimizer_switch='default,index_merge';
-
---error ER_WRONG_VALUE_FOR_VAR
-set optimizer_switch='index_merge=index_merge';
-
---error ER_WRONG_VALUE_FOR_VAR
-set optimizer_switch='index_merge=on,but...';
-
---error ER_WRONG_VALUE_FOR_VAR
-set optimizer_switch='index_merge=';
-
---error ER_WRONG_VALUE_FOR_VAR
-set optimizer_switch='index_merge';
-
---error ER_WRONG_VALUE_FOR_VAR
-set optimizer_switch='on';
-
---error ER_WRONG_VALUE_FOR_VAR
-set optimizer_switch='index_merge=on,index_merge=off';
-
---error ER_WRONG_VALUE_FOR_VAR
-set optimizer_switch='index_merge_union=on,index_merge_union=default';
-
---error ER_WRONG_VALUE_FOR_VAR
-set optimizer_switch='default,index_merge=on,index_merge=off,default';
-
-set optimizer_switch=default;
-set optimizer_switch='index_merge=off,index_merge_union=off,default';
---replace_regex /,table_elimination=on//
-select @@optimizer_switch;
-set optimizer_switch=default;
-
-# Check setting defaults for global vars
---replace_regex /,table_elimination=on//
-select @@global.optimizer_switch;
-set @@global.optimizer_switch=default;
---replace_regex /,table_elimination=on//
-select @@global.optimizer_switch;
-
---echo #
---echo # Check index_merge's @@optimizer_switch flags
---echo #
---replace_regex /,table_elimination.on//
-select @@optimizer_switch;
-
create table t0 (a int);
insert into t0 values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
create table t1 (a int, b int, c int, filler char(100),
@@ -190,8 +118,6 @@ set optimizer_switch='default,index_merg
explain select * from t1 where a=10 and b=10 or c=10;
set optimizer_switch=default;
---replace_regex /,table_elimination.on//
-show variables like 'optimizer_switch';
drop table t0, t1;
=== modified file 'mysql-test/t/myisam_mrr.test'
--- a/mysql-test/t/myisam_mrr.test 2009-12-22 14:43:00 +0000
+++ b/mysql-test/t/myisam_mrr.test 2010-07-16 08:58:24 +0000
@@ -103,8 +103,7 @@ drop table t0, t1;
# Check that optimizer_switch is present
---replace_regex /,table_elimination=o[nf]*//
-select @@optimizer_switch;
+select @@optimizer_switch like '%index_condition_pushdown=on%';
# Check if it affects ICP
create table t0 (a int);
=== added file 'mysql-test/t/optimizer_switch.test'
--- a/mysql-test/t/optimizer_switch.test 1970-01-01 00:00:00 +0000
+++ b/mysql-test/t/optimizer_switch.test 2010-07-16 08:58:24 +0000
@@ -0,0 +1,113 @@
+--echo #
+--echo # Generic @@optimizer_switch tests
+--echo #
+--echo #
+
+--replace_regex /,table_elimination=on//
+select @@optimizer_switch;
+
+set optimizer_switch='index_merge=off,index_merge_union=off';
+--replace_regex /,table_elimination=on//
+select @@optimizer_switch;
+
+set optimizer_switch='index_merge_union=on';
+--replace_regex /,table_elimination=on//
+select @@optimizer_switch;
+
+set optimizer_switch='default,index_merge_sort_union=off';
+--replace_regex /,table_elimination=on//
+select @@optimizer_switch;
+
+--error ER_WRONG_VALUE_FOR_VAR
+set optimizer_switch=4;
+
+--error ER_WRONG_VALUE_FOR_VAR
+set optimizer_switch=NULL;
+
+--error ER_WRONG_VALUE_FOR_VAR
+set optimizer_switch='default,index_merge';
+
+--error ER_WRONG_VALUE_FOR_VAR
+set optimizer_switch='index_merge=index_merge';
+
+--error ER_WRONG_VALUE_FOR_VAR
+set optimizer_switch='index_merge=on,but...';
+
+--error ER_WRONG_VALUE_FOR_VAR
+set optimizer_switch='index_merge=';
+
+--error ER_WRONG_VALUE_FOR_VAR
+set optimizer_switch='index_merge';
+
+--error ER_WRONG_VALUE_FOR_VAR
+set optimizer_switch='on';
+
+--error ER_WRONG_VALUE_FOR_VAR
+set optimizer_switch='index_merge=on,index_merge=off';
+
+--error ER_WRONG_VALUE_FOR_VAR
+set optimizer_switch='index_merge_union=on,index_merge_union=default';
+
+--error ER_WRONG_VALUE_FOR_VAR
+set optimizer_switch='default,index_merge=on,index_merge=off,default';
+
+set optimizer_switch=default;
+set optimizer_switch='index_merge=off,index_merge_union=off,default';
+--replace_regex /,table_elimination=on//
+select @@optimizer_switch;
+set optimizer_switch=default;
+
+# Check setting defaults for global vars
+--replace_regex /,table_elimination=on//
+select @@global.optimizer_switch;
+set @@global.optimizer_switch=default;
+--replace_regex /,table_elimination=on//
+select @@global.optimizer_switch;
+
+--echo #
+--echo # Check index_merge's @@optimizer_switch flags
+--echo #
+--replace_regex /,table_elimination.on//
+select @@optimizer_switch;
+
+--echo
+--echo BUG#37120 optimizer_switch allowable values not according to specification
+--echo
+
+--replace_regex /,table_elimination=on//
+select @@optimizer_switch;
+
+set optimizer_switch='default,materialization=off';
+--replace_regex /,table_elimination=on//
+select @@optimizer_switch;
+
+set optimizer_switch='default,semijoin=off';
+--replace_regex /,table_elimination=on//
+select @@optimizer_switch;
+
+set optimizer_switch='default,loosescan=off';
+--replace_regex /,table_elimination=on//
+select @@optimizer_switch;
+
+set optimizer_switch='default,semijoin=off,materialization=off';
+--replace_regex /,table_elimination=on//
+select @@optimizer_switch;
+
+set optimizer_switch='default,materialization=off,semijoin=off';
+--replace_regex /,table_elimination=on//
+select @@optimizer_switch;
+
+set optimizer_switch='default,semijoin=off,materialization=off,loosescan=off';
+--replace_regex /,table_elimination=on//
+select @@optimizer_switch;
+
+set optimizer_switch='default,semijoin=off,loosescan=off';
+--replace_regex /,table_elimination=on//
+select @@optimizer_switch;
+
+set optimizer_switch='default,materialization=off,loosescan=off';
+--replace_regex /,table_elimination=on//
+select @@optimizer_switch;
+set optimizer_switch=default;
+
+
=== modified file 'mysql-test/t/order_by.test'
--- a/mysql-test/t/order_by.test 2010-03-04 08:03:07 +0000
+++ b/mysql-test/t/order_by.test 2010-07-15 14:07:01 +0000
@@ -402,7 +402,11 @@ CREATE TABLE t1 (
INSERT INTO t1 VALUES ('0',3,'0'),('0',2,'1'),('0',1,'2'),('1',2,'1'),('1',1,'3'), ('1',0,'2'),('2',3,'0'),('2',2,'1'),('2',1,'2'),('2',3,'0'),('2',2,'1'),('2',1,'2'),('3',2,'1'),('3',1,'2'),('3','3','3');
EXPLAIN SELECT * FROM t1 WHERE FieldKey = '1' ORDER BY LongVal;
SELECT * FROM t1 WHERE FieldKey = '1' ORDER BY LongVal;
-EXPLAIN SELECT * FROM t1 WHERE FieldKey > '2' ORDER BY LongVal;
+--echo DS-MRR: use two IGNORE INDEX queries, otherwise we get cost races, because
+--echo DS-MRR: records_in_range/read_time return the same numbers for all three indexes
+EXPLAIN SELECT * FROM t1 IGNORE INDEX (LongField, StringField) WHERE FieldKey > '2' ORDER BY LongVal;
+EXPLAIN SELECT * FROM t1 IGNORE INDEX (FieldKey, LongField) WHERE FieldKey > '2' ORDER BY LongVal;
+
SELECT * FROM t1 WHERE FieldKey > '2' ORDER BY LongVal;
EXPLAIN SELECT * FROM t1 WHERE FieldKey > '2' ORDER BY FieldKey, LongVal;
SELECT * FROM t1 WHERE FieldKey > '2' ORDER BY FieldKey, LongVal;
=== modified file 'mysql-test/t/subselect_mat.test'
--- a/mysql-test/t/subselect_mat.test 2010-03-13 20:04:52 +0000
+++ b/mysql-test/t/subselect_mat.test 2010-07-16 11:02:15 +0000
@@ -905,3 +905,19 @@ select * from t1 where t1.i in (select t
set session optimizer_switch=@save_optimizer_switch;
drop table t1, t2, t3;
+#
+# Test that the contents of the temp table of a materialized subquery is
+# cleaned up between PS re-executions.
+#
+
+create table t0 (a int);
+insert into t0 values (0),(1),(2);
+create table t1 (a int);
+insert into t1 values (0),(1),(2);
+explain select a, a in (select a from t1) from t0;
+select a, a in (select a from t1) from t0;
+prepare s from 'select a, a in (select a from t1) from t0';
+execute s;
+update t1 set a=123;
+execute s;
+drop table t0, t1;
=== modified file 'mysql-test/t/subselect_no_mat.test'
--- a/mysql-test/t/subselect_no_mat.test 2010-02-21 07:33:54 +0000
+++ b/mysql-test/t/subselect_no_mat.test 2010-07-16 08:58:24 +0000
@@ -1,13 +1,11 @@
#
# Run subselect.test without semi-join optimization (test materialize)
#
---replace_regex /,table_elimination=on//
-show variables like 'optimizer_switch';
+select @@optimizer_switch like '%materialization=on%';
set optimizer_switch='materialization=off';
--source t/subselect.test
set optimizer_switch=default;
---replace_regex /,table_elimination=on//
-show variables like 'optimizer_switch';
+select @@optimizer_switch like '%materialization=on%';
=== modified file 'mysql-test/t/subselect_no_opts.test'
--- a/mysql-test/t/subselect_no_opts.test 2010-02-21 07:33:54 +0000
+++ b/mysql-test/t/subselect_no_opts.test 2010-07-16 08:58:24 +0000
@@ -1,13 +1,9 @@
#
# Run subselect.test without semi-join optimization (test materialize)
#
---replace_regex /,table_elimination=on//
-show variables like 'optimizer_switch';
set optimizer_switch='materialization=off,semijoin=off';
--source t/subselect.test
set optimizer_switch=default;
---replace_regex /,table_elimination=on//
-show variables like 'optimizer_switch';
=== modified file 'mysql-test/t/subselect_no_semijoin.test'
--- a/mysql-test/t/subselect_no_semijoin.test 2010-02-21 07:33:54 +0000
+++ b/mysql-test/t/subselect_no_semijoin.test 2010-07-16 08:58:24 +0000
@@ -1,13 +1,8 @@
#
# Run subselect.test without semi-join optimization (test materialize)
#
---replace_regex /,table_elimination=on//
-show variables like 'optimizer_switch';
set optimizer_switch='semijoin=off';
--source t/subselect.test
set optimizer_switch=default;
---replace_regex /,table_elimination=on//
-show variables like 'optimizer_switch';
-
=== modified file 'mysql-test/t/subselect_sj.test'
--- a/mysql-test/t/subselect_sj.test 2010-03-15 06:32:54 +0000
+++ b/mysql-test/t/subselect_sj.test 2010-07-16 08:58:24 +0000
@@ -92,46 +92,6 @@ execute s1;
insert into t1 select (A.a + 10 * B.a),1 from t0 A, t0 B;
explain extended select * from t1 where a in (select pk from t10 where pk<3);
---echo
---echo BUG#37120 optimizer_switch allowable values not according to specification
---echo
-
---replace_regex /,table_elimination=on//
-select @@optimizer_switch;
-
-set optimizer_switch='default,materialization=off';
---replace_regex /,table_elimination=on//
-select @@optimizer_switch;
-
-set optimizer_switch='default,semijoin=off';
---replace_regex /,table_elimination=on//
-select @@optimizer_switch;
-
-set optimizer_switch='default,loosescan=off';
---replace_regex /,table_elimination=on//
-select @@optimizer_switch;
-
-set optimizer_switch='default,semijoin=off,materialization=off';
---replace_regex /,table_elimination=on//
-select @@optimizer_switch;
-
-set optimizer_switch='default,materialization=off,semijoin=off';
---replace_regex /,table_elimination=on//
-select @@optimizer_switch;
-
-set optimizer_switch='default,semijoin=off,materialization=off,loosescan=off';
---replace_regex /,table_elimination=on//
-select @@optimizer_switch;
-
-set optimizer_switch='default,semijoin=off,loosescan=off';
---replace_regex /,table_elimination=on//
-select @@optimizer_switch;
-
-set optimizer_switch='default,materialization=off,loosescan=off';
---replace_regex /,table_elimination=on//
-select @@optimizer_switch;
-set optimizer_switch=default;
-
drop table t0, t1, t2;
drop table t10, t11, t12;
=== modified file 'mysys/queues.c'
--- a/mysys/queues.c 2008-02-18 22:29:39 +0000
+++ b/mysys/queues.c 2010-07-16 07:33:01 +0000
@@ -1,25 +1,42 @@
-/* Copyright (C) 2000, 2005 MySQL AB
+/* Copyright (C) 2010 Monty Program Ab
+ All Rights reserved
- This program is free software; you can redistribute it and/or modify
- it under the terms of the GNU General Public License as published by
- the Free Software Foundation; version 2 of the License.
-
- This program is distributed in the hope that it will be useful,
- but WITHOUT ANY WARRANTY; without even the implied warranty of
- MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
- GNU General Public License for more details.
-
- You should have received a copy of the GNU General Public License
- along with this program; if not, write to the Free Software
- Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */
+ Redistribution and use in source and binary forms, with or without
+ modification, are permitted provided that the following conditions are met:
+ * Redistributions of source code must retain the above copyright
+ notice, this list of conditions and the following disclaimer.
+ * Redistributions in binary form must reproduce the following disclaimer
+ in the documentation and/or other materials provided with the
+ distribution.
+
+ THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
+ FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL
+ <COPYRIGHT HOLDER> BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+ USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+ OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ SUCH DAMAGE.
+*/
/*
+ This code originates from the Unireg project.
+
Code for generell handling of priority Queues.
Implemention of queues from "Algoritms in C" by Robert Sedgewick.
- An optimisation of _downheap suggested in Exercise 7.51 in "Data
- Structures & Algorithms in C++" by Mark Allen Weiss, Second Edition
- was implemented by Mikael Ronstrom 2005. Also the O(N) algorithm
- of queue_fix was implemented.
+
+ The queue can optionally store the position in queue in the element
+ that is in the queue. This allows one to remove any element from the queue
+ in O(1) time.
+
+ Optimisation of _downheap() and queue_fix() is inspired by code done
+ by Mikael Ronström, based on an optimisation of _downheap from
+ Exercise 7.51 in "Data Structures & Algorithms in C++" by Mark Allen
+ Weiss, Second Edition.
*/
#include "mysys_priv.h"
@@ -39,6 +56,10 @@
max_at_top Set to 1 if you want biggest element on top.
compare Compare function for elements, takes 3 arguments.
first_cmp_arg First argument to compare function
+ offset_to_queue_pos If <> 0, then offset+1 in element to store position
+ in queue (for fast delete of element in queue)
+ auto_extent When the queue is full and there is insert operation
+ extend the queue.
NOTES
Will allocate max_element pointers for queue array
@@ -50,74 +71,33 @@
int init_queue(QUEUE *queue, uint max_elements, uint offset_to_key,
pbool max_at_top, int (*compare) (void *, uchar *, uchar *),
- void *first_cmp_arg)
+ void *first_cmp_arg, uint offset_to_queue_pos,
+ uint auto_extent)
+
{
DBUG_ENTER("init_queue");
- if ((queue->root= (uchar **) my_malloc((max_elements+1)*sizeof(void*),
+ if ((queue->root= (uchar **) my_malloc((max_elements + 1) * sizeof(void*),
MYF(MY_WME))) == 0)
DBUG_RETURN(1);
- queue->elements=0;
- queue->compare=compare;
- queue->first_cmp_arg=first_cmp_arg;
- queue->max_elements=max_elements;
- queue->offset_to_key=offset_to_key;
+ queue->elements= 0;
+ queue->compare= compare;
+ queue->first_cmp_arg= first_cmp_arg;
+ queue->max_elements= max_elements;
+ queue->offset_to_key= offset_to_key;
+ queue->offset_to_queue_pos= offset_to_queue_pos;
+ queue->auto_extent= auto_extent;
queue_set_max_at_top(queue, max_at_top);
DBUG_RETURN(0);
}
-
-/*
- Init queue, uses init_queue internally for init work but also accepts
- auto_extent as parameter
-
- SYNOPSIS
- init_queue_ex()
- queue Queue to initialise
- max_elements Max elements that will be put in queue
- offset_to_key Offset to key in element stored in queue
- Used when sending pointers to compare function
- max_at_top Set to 1 if you want biggest element on top.
- compare Compare function for elements, takes 3 arguments.
- first_cmp_arg First argument to compare function
- auto_extent When the queue is full and there is insert operation
- extend the queue.
-
- NOTES
- Will allocate max_element pointers for queue array
-
- RETURN
- 0 ok
- 1 Could not allocate memory
-*/
-
-int init_queue_ex(QUEUE *queue, uint max_elements, uint offset_to_key,
- pbool max_at_top, int (*compare) (void *, uchar *, uchar *),
- void *first_cmp_arg, uint auto_extent)
-{
- int ret;
- DBUG_ENTER("init_queue_ex");
-
- if ((ret= init_queue(queue, max_elements, offset_to_key, max_at_top, compare,
- first_cmp_arg)))
- DBUG_RETURN(ret);
-
- queue->auto_extent= auto_extent;
- DBUG_RETURN(0);
-}
-
/*
Reinitialize queue for other usage
SYNOPSIS
reinit_queue()
queue Queue to initialise
- max_elements Max elements that will be put in queue
- offset_to_key Offset to key in element stored in queue
- Used when sending pointers to compare function
- max_at_top Set to 1 if you want biggest element on top.
- compare Compare function for elements, takes 3 arguments.
- first_cmp_arg First argument to compare function
+ For rest of arguments, see init_queue() above
NOTES
This will delete all elements from the queue. If you don't want this,
@@ -125,21 +105,23 @@ int init_queue_ex(QUEUE *queue, uint max
RETURN
0 ok
- EE_OUTOFMEMORY Wrong max_elements
+ 1 Wrong max_elements; Queue has old size
*/
int reinit_queue(QUEUE *queue, uint max_elements, uint offset_to_key,
pbool max_at_top, int (*compare) (void *, uchar *, uchar *),
- void *first_cmp_arg)
+ void *first_cmp_arg, uint offset_to_queue_pos,
+ uint auto_extent)
{
DBUG_ENTER("reinit_queue");
- queue->elements=0;
- queue->compare=compare;
- queue->first_cmp_arg=first_cmp_arg;
- queue->offset_to_key=offset_to_key;
+ queue->elements= 0;
+ queue->compare= compare;
+ queue->first_cmp_arg= first_cmp_arg;
+ queue->offset_to_key= offset_to_key;
+ queue->offset_to_queue_pos= offset_to_queue_pos;
+ queue->auto_extent= auto_extent;
queue_set_max_at_top(queue, max_at_top);
- resize_queue(queue, max_elements);
- DBUG_RETURN(0);
+ DBUG_RETURN(resize_queue(queue, max_elements));
}
@@ -167,8 +149,8 @@ int resize_queue(QUEUE *queue, uint max_
if (queue->max_elements == max_elements)
DBUG_RETURN(0);
if ((new_root= (uchar **) my_realloc((void *)queue->root,
- (max_elements+1)*sizeof(void*),
- MYF(MY_WME))) == 0)
+ (max_elements + 1)* sizeof(void*),
+ MYF(MY_WME))) == 0)
DBUG_RETURN(1);
set_if_smaller(queue->elements, max_elements);
queue->max_elements= max_elements;
@@ -197,39 +179,58 @@ void delete_queue(QUEUE *queue)
if (queue->root)
{
my_free((uchar*) queue->root,MYF(0));
- queue->root=0;
+ queue->root=0; /* Allow multiple calls */
}
DBUG_VOID_RETURN;
}
- /* Code for insert, search and delete of elements */
+/*
+ Insert element in queue
+
+ SYNOPSIS
+ queue_insert()
+ queue Queue to use
+ element Element to insert
+*/
void queue_insert(register QUEUE *queue, uchar *element)
{
reg2 uint idx, next;
+ uint offset_to_queue_pos= queue->offset_to_queue_pos;
DBUG_ASSERT(queue->elements < queue->max_elements);
- queue->root[0]= element;
+
idx= ++queue->elements;
/* max_at_top swaps the comparison if we want to order by desc */
- while ((queue->compare(queue->first_cmp_arg,
+ while (idx > 1 &&
+ (queue->compare(queue->first_cmp_arg,
element + queue->offset_to_key,
queue->root[(next= idx >> 1)] +
queue->offset_to_key) * queue->max_at_top) < 0)
{
queue->root[idx]= queue->root[next];
+ if (offset_to_queue_pos)
+ (*(uint*) (queue->root[idx] + offset_to_queue_pos-1))= idx;
idx= next;
}
queue->root[idx]= element;
+ if (offset_to_queue_pos)
+ (*(uint*) (element+ offset_to_queue_pos-1))= idx;
}
+
/*
- Does safe insert. If no more space left on the queue resize it.
- Return codes:
- 0 - OK
- 1 - Cannot allocate more memory
- 2 - auto_extend is 0, the operation would
-
+ Like queue_insert, but resize queue if queue is full
+
+ SYNOPSIS
+ queue_insert_safe()
+ queue Queue to use
+ element Element to insert
+
+ RETURN
+ 0 OK
+ 1 Cannot allocate more memory
+ 2 auto_extend is 0; No insertion done
*/
int queue_insert_safe(register QUEUE *queue, uchar *element)
@@ -239,7 +240,7 @@ int queue_insert_safe(register QUEUE *qu
{
if (!queue->auto_extent)
return 2;
- else if (resize_queue(queue, queue->max_elements + queue->auto_extent))
+ if (resize_queue(queue, queue->max_elements + queue->auto_extent))
return 1;
}
@@ -248,40 +249,48 @@ int queue_insert_safe(register QUEUE *qu
}
- /* Remove item from queue */
- /* Returns pointer to removed element */
+/*
+ Remove item from queue
+
+ SYNOPSIS
+ queue_remove()
+ queue Queue to use
+ element Index of element to remove.
+ First element in queue is 'queue_first_element(queue)'
+
+ RETURN
+ pointer to removed element
+*/
uchar *queue_remove(register QUEUE *queue, uint idx)
{
uchar *element;
- DBUG_ASSERT(idx < queue->max_elements);
- element= queue->root[++idx]; /* Intern index starts from 1 */
- queue->root[idx]= queue->root[queue->elements--];
- _downheap(queue, idx);
+ DBUG_ASSERT(idx >= 1 && idx <= queue->elements);
+ element= queue->root[idx];
+ _downheap(queue, idx, queue->root[queue->elements--]);
return element;
}
- /* Fix when element on top has been replaced */
-#ifndef queue_replaced
-void queue_replaced(QUEUE *queue)
-{
- _downheap(queue,1);
-}
-#endif
+/*
+ Add element to fixed position and update heap
-#ifndef OLD_VERSION
+ SYNOPSIS
+ _downheap()
+ queue Queue to use
+ idx Index of element to change
+ element Element to store at 'idx'
+*/
-void _downheap(register QUEUE *queue, uint idx)
+void _downheap(register QUEUE *queue, uint start_idx, uchar *element)
{
- uchar *element;
- uint elements,half_queue,offset_to_key, next_index;
+ uint elements,half_queue,offset_to_key, next_index, offset_to_queue_pos;
+ register uint idx= start_idx;
my_bool first= TRUE;
- uint start_idx= idx;
offset_to_key=queue->offset_to_key;
- element=queue->root[idx];
- half_queue=(elements=queue->elements) >> 1;
+ offset_to_queue_pos= queue->offset_to_queue_pos;
+ half_queue= (elements= queue->elements) >> 1;
while (idx <= half_queue)
{
@@ -298,393 +307,49 @@ void _downheap(register QUEUE *queue, ui
element+offset_to_key) * queue->max_at_top) >= 0)))
{
queue->root[idx]= element;
+ if (offset_to_queue_pos)
+ (*(uint*) (element + offset_to_queue_pos-1))= idx;
return;
}
- queue->root[idx]=queue->root[next_index];
- idx=next_index;
first= FALSE;
- }
-
- next_index= idx >> 1;
- while (next_index > start_idx)
- {
- if ((queue->compare(queue->first_cmp_arg,
- queue->root[next_index]+offset_to_key,
- element+offset_to_key) *
- queue->max_at_top) < 0)
- break;
- queue->root[idx]=queue->root[next_index];
+ queue->root[idx]= queue->root[next_index];
+ if (offset_to_queue_pos)
+ (*(uint*) (queue->root[idx] + offset_to_queue_pos-1))= idx;
idx=next_index;
- next_index= idx >> 1;
}
- queue->root[idx]=element;
-}
-#else
/*
- The old _downheap version is kept for comparisons with the benchmark
- suit or new benchmarks anyone wants to run for comparisons.
+ Insert the element into the right position. This is the same code
+ as we have in queue_insert()
*/
- /* Fix heap when index have changed */
-void _downheap(register QUEUE *queue, uint idx)
-{
- uchar *element;
- uint elements,half_queue,next_index,offset_to_key;
-
- offset_to_key=queue->offset_to_key;
- element=queue->root[idx];
- half_queue=(elements=queue->elements) >> 1;
-
- while (idx <= half_queue)
- {
- next_index=idx+idx;
- if (next_index < elements &&
- (queue->compare(queue->first_cmp_arg,
- queue->root[next_index]+offset_to_key,
- queue->root[next_index+1]+offset_to_key) *
- queue->max_at_top) > 0)
- next_index++;
- if ((queue->compare(queue->first_cmp_arg,
- queue->root[next_index]+offset_to_key,
- element+offset_to_key) * queue->max_at_top) >= 0)
- break;
- queue->root[idx]=queue->root[next_index];
- idx=next_index;
+ while ((next_index= (idx >> 1)) > start_idx &&
+ queue->compare(queue->first_cmp_arg,
+ element+offset_to_key,
+ queue->root[next_index]+offset_to_key)*
+ queue->max_at_top < 0)
+ {
+ queue->root[idx]= queue->root[next_index];
+ if (offset_to_queue_pos)
+ (*(uint*) (queue->root[idx] + offset_to_queue_pos-1))= idx;
+ idx= next_index;
}
- queue->root[idx]=element;
+ queue->root[idx]= element;
+ if (offset_to_queue_pos)
+ (*(uint*) (element + offset_to_queue_pos-1))= idx;
}
-#endif
-
/*
Fix heap when every element was changed.
+
+ SYNOPSIS
+ queue_fix()
+ queue Queue to use
*/
void queue_fix(QUEUE *queue)
{
uint i;
for (i= queue->elements >> 1; i > 0; i--)
- _downheap(queue, i);
-}
-
-#ifdef MAIN
- /*
- A test program for the priority queue implementation.
- It can also be used to benchmark changes of the implementation
- Build by doing the following in the directory mysys
- make test_priority_queue
- ./test_priority_queue
-
- Written by Mikael Ronström, 2005
- */
-
-static uint num_array[1025];
-static uint tot_no_parts= 0;
-static uint tot_no_loops= 0;
-static uint expected_part= 0;
-static uint expected_num= 0;
-static bool max_ind= 0;
-static bool fix_used= 0;
-static ulonglong start_time= 0;
-
-static bool is_divisible_by(uint num, uint divisor)
-{
- uint quotient= num / divisor;
- if (quotient * divisor == num)
- return TRUE;
- return FALSE;
-}
-
-void calculate_next()
-{
- uint part= expected_part, num= expected_num;
- uint no_parts= tot_no_parts;
- if (max_ind)
- {
- do
- {
- while (++part <= no_parts)
- {
- if (is_divisible_by(num, part) &&
- (num <= ((1 << 21) + part)))
- {
- expected_part= part;
- expected_num= num;
- return;
- }
- }
- part= 0;
- } while (--num);
- }
- else
- {
- do
- {
- while (--part > 0)
- {
- if (is_divisible_by(num, part))
- {
- expected_part= part;
- expected_num= num;
- return;
- }
- }
- part= no_parts + 1;
- } while (++num);
- }
-}
-
-void calculate_end_next(uint part)
-{
- uint no_parts= tot_no_parts, num;
- num_array[part]= 0;
- if (max_ind)
- {
- expected_num= 0;
- for (part= no_parts; part > 0 ; part--)
- {
- if (num_array[part])
- {
- num= num_array[part] & 0x3FFFFF;
- if (num >= expected_num)
- {
- expected_num= num;
- expected_part= part;
- }
- }
- }
- if (expected_num == 0)
- expected_part= 0;
- }
- else
- {
- expected_num= 0xFFFFFFFF;
- for (part= 1; part <= no_parts; part++)
- {
- if (num_array[part])
- {
- num= num_array[part] & 0x3FFFFF;
- if (num <= expected_num)
- {
- expected_num= num;
- expected_part= part;
- }
- }
- }
- if (expected_num == 0xFFFFFFFF)
- expected_part= 0;
- }
- return;
-}
-static int test_compare(void *null_arg, uchar *a, uchar *b)
-{
- uint a_num= (*(uint*)a) & 0x3FFFFF;
- uint b_num= (*(uint*)b) & 0x3FFFFF;
- uint a_part, b_part;
- if (a_num > b_num)
- return +1;
- if (a_num < b_num)
- return -1;
- a_part= (*(uint*)a) >> 22;
- b_part= (*(uint*)b) >> 22;
- if (a_part < b_part)
- return +1;
- if (a_part > b_part)
- return -1;
- return 0;
-}
-
-bool check_num(uint num_part)
-{
- uint part= num_part >> 22;
- uint num= num_part & 0x3FFFFF;
- if (part == expected_part)
- if (num == expected_num)
- return FALSE;
- printf("Expect part %u Expect num 0x%x got part %u num 0x%x max_ind %u fix_used %u \n",
- expected_part, expected_num, part, num, max_ind, fix_used);
- return TRUE;
-}
-
-
-void perform_insert(QUEUE *queue)
-{
- uint i= 1, no_parts= tot_no_parts;
- uint backward_start= 0;
-
- expected_part= 1;
- expected_num= 1;
-
- if (max_ind)
- backward_start= 1 << 21;
-
- do
- {
- uint num= (i + backward_start);
- if (max_ind)
- {
- while (!is_divisible_by(num, i))
- num--;
- if (max_ind && (num > expected_num ||
- (num == expected_num && i < expected_part)))
- {
- expected_num= num;
- expected_part= i;
- }
- }
- num_array[i]= num + (i << 22);
- if (fix_used)
- queue_element(queue, i-1)= (uchar*)&num_array[i];
- else
- queue_insert(queue, (uchar*)&num_array[i]);
- } while (++i <= no_parts);
- if (fix_used)
- {
- queue->elements= no_parts;
- queue_fix(queue);
- }
-}
-
-bool perform_ins_del(QUEUE *queue, bool max_ind)
-{
- uint i= 0, no_loops= tot_no_loops, j= tot_no_parts;
- do
- {
- uint num_part= *(uint*)queue_top(queue);
- uint part= num_part >> 22;
- if (check_num(num_part))
- return TRUE;
- if (j++ >= no_loops)
- {
- calculate_end_next(part);
- queue_remove(queue, (uint) 0);
- }
- else
- {
- calculate_next();
- if (max_ind)
- num_array[part]-= part;
- else
- num_array[part]+= part;
- queue_top(queue)= (uchar*)&num_array[part];
- queue_replaced(queue);
- }
- } while (++i < no_loops);
- return FALSE;
-}
-
-bool do_test(uint no_parts, uint l_max_ind, bool l_fix_used)
-{
- QUEUE queue;
- bool result;
- max_ind= l_max_ind;
- fix_used= l_fix_used;
- init_queue(&queue, no_parts, 0, max_ind, test_compare, NULL);
- tot_no_parts= no_parts;
- tot_no_loops= 1024;
- perform_insert(&queue);
- if ((result= perform_ins_del(&queue, max_ind)))
- delete_queue(&queue);
- if (result)
- {
- printf("Error\n");
- return TRUE;
- }
- return FALSE;
-}
-
-static void start_measurement()
-{
- start_time= my_getsystime();
-}
-
-static void stop_measurement()
-{
- ulonglong stop_time= my_getsystime();
- uint time_in_micros;
- stop_time-= start_time;
- stop_time/= 10; /* Convert to microseconds */
- time_in_micros= (uint)stop_time;
- printf("Time expired is %u microseconds \n", time_in_micros);
-}
-
-static void benchmark_test()
-{
- QUEUE queue_real;
- QUEUE *queue= &queue_real;
- uint i, add;
- fix_used= TRUE;
- max_ind= FALSE;
- tot_no_parts= 1024;
- init_queue(queue, tot_no_parts, 0, max_ind, test_compare, NULL);
- /*
- First benchmark whether queue_fix is faster than using queue_insert
- for sizes of 16 partitions.
- */
- for (tot_no_parts= 2, add=2; tot_no_parts < 128;
- tot_no_parts+= add, add++)
- {
- printf("Start benchmark queue_fix, tot_no_parts= %u \n", tot_no_parts);
- start_measurement();
- for (i= 0; i < 128; i++)
- {
- perform_insert(queue);
- queue_remove_all(queue);
- }
- stop_measurement();
-
- fix_used= FALSE;
- printf("Start benchmark queue_insert\n");
- start_measurement();
- for (i= 0; i < 128; i++)
- {
- perform_insert(queue);
- queue_remove_all(queue);
- }
- stop_measurement();
- }
- /*
- Now benchmark insertion and deletion of 16400 elements.
- Used in consecutive runs this shows whether the optimised _downheap
- is faster than the standard implementation.
- */
- printf("Start benchmarking _downheap \n");
- start_measurement();
- perform_insert(queue);
- for (i= 0; i < 65536; i++)
- {
- uint num, part;
- num= *(uint*)queue_top(queue);
- num+= 16;
- part= num >> 22;
- num_array[part]= num;
- queue_top(queue)= (uchar*)&num_array[part];
- queue_replaced(queue);
- }
- for (i= 0; i < 16; i++)
- queue_remove(queue, (uint) 0);
- queue_remove_all(queue);
- stop_measurement();
-}
-
-int main()
-{
- int i, add= 1;
- for (i= 1; i < 1024; i+=add, add++)
- {
- printf("Start test for priority queue of size %u\n", i);
- if (do_test(i, 0, 1))
- return -1;
- if (do_test(i, 1, 1))
- return -1;
- if (do_test(i, 0, 0))
- return -1;
- if (do_test(i, 1, 0))
- return -1;
- }
- benchmark_test();
- printf("OK\n");
- return 0;
+ _downheap(queue, i, queue_element(queue, i));
}
-#endif
=== modified file 'mysys/thr_alarm.c'
--- a/mysys/thr_alarm.c 2008-10-10 15:28:41 +0000
+++ b/mysys/thr_alarm.c 2010-07-16 07:33:01 +0000
@@ -41,6 +41,19 @@ volatile my_bool alarm_thread_running= 0
time_t next_alarm_expire_time= ~ (time_t) 0;
static sig_handler process_alarm_part2(int sig);
+#ifdef DBUG_OFF
+#define reset_index_in_queue(alarm_data)
+#else
+#define reset_index_in_queue(alarm_data) alarm_data->index_in_queue= 0;
+#endif /* DBUG_OFF */
+
+#ifndef USE_ONE_SIGNAL_HAND
+#define one_signal_hand_sigmask(A,B,C) pthread_sigmask((A), (B), (C))
+#else
+#define one_signal_hand_sigmask(A,B,C)
+#endif
+
+
#if !defined(__WIN__)
static pthread_mutex_t LOCK_alarm;
@@ -72,8 +85,8 @@ void init_thr_alarm(uint max_alarms)
DBUG_ENTER("init_thr_alarm");
alarm_aborted=0;
next_alarm_expire_time= ~ (time_t) 0;
- init_queue(&alarm_queue,max_alarms+1,offsetof(ALARM,expire_time),0,
- compare_ulong,NullS);
+ init_queue(&alarm_queue, max_alarms+1, offsetof(ALARM,expire_time), 0,
+ compare_ulong, NullS, offsetof(ALARM, index_in_queue)+1, 0);
sigfillset(&full_signal_set); /* Neaded to block signals */
pthread_mutex_init(&LOCK_alarm,MY_MUTEX_INIT_FAST);
pthread_cond_init(&COND_alarm,NULL);
@@ -151,7 +164,7 @@ void resize_thr_alarm(uint max_alarms)
my_bool thr_alarm(thr_alarm_t *alrm, uint sec, ALARM *alarm_data)
{
- time_t now;
+ time_t now, next;
#ifndef USE_ONE_SIGNAL_HAND
sigset_t old_mask;
#endif
@@ -161,79 +174,68 @@ my_bool thr_alarm(thr_alarm_t *alrm, uin
DBUG_PRINT("enter",("thread: %s sec: %d",my_thread_name(),sec));
now= my_time(0);
-#ifndef USE_ONE_SIGNAL_HAND
- pthread_sigmask(SIG_BLOCK,&full_signal_set,&old_mask);
-#endif
+ if (!alarm_data)
+ {
+ if (!(alarm_data=(ALARM*) my_malloc(sizeof(ALARM),MYF(MY_WME))))
+ goto abort_no_unlock;
+ alarm_data->malloced= 1;
+ }
+ else
+ alarm_data->malloced= 0;
+ next= now + sec;
+ alarm_data->expire_time= next;
+ alarm_data->alarmed= 0;
+ alarm_data->thread= current_my_thread_var->pthread_self;
+ alarm_data->thread_id= current_my_thread_var->id;
+
+ one_signal_hand_sigmask(SIG_BLOCK,&full_signal_set,&old_mask);
pthread_mutex_lock(&LOCK_alarm); /* Lock from threads & alarms */
- if (alarm_aborted > 0)
+ if (unlikely(alarm_aborted))
{ /* No signal thread */
DBUG_PRINT("info", ("alarm aborted"));
- *alrm= 0; /* No alarm */
- pthread_mutex_unlock(&LOCK_alarm);
-#ifndef USE_ONE_SIGNAL_HAND
- pthread_sigmask(SIG_SETMASK,&old_mask,NULL);
-#endif
- DBUG_RETURN(1);
- }
- if (alarm_aborted < 0)
+ if (alarm_aborted > 0)
+ goto abort;
sec= 1; /* Abort mode */
-
+ }
if (alarm_queue.elements >= max_used_alarms)
{
if (alarm_queue.elements == alarm_queue.max_elements)
{
DBUG_PRINT("info", ("alarm queue full"));
fprintf(stderr,"Warning: thr_alarm queue is full\n");
- *alrm= 0; /* No alarm */
- pthread_mutex_unlock(&LOCK_alarm);
-#ifndef USE_ONE_SIGNAL_HAND
- pthread_sigmask(SIG_SETMASK,&old_mask,NULL);
-#endif
- DBUG_RETURN(1);
+ goto abort;
}
max_used_alarms=alarm_queue.elements+1;
}
- reschedule= (ulong) next_alarm_expire_time > (ulong) now + sec;
- if (!alarm_data)
- {
- if (!(alarm_data=(ALARM*) my_malloc(sizeof(ALARM),MYF(MY_WME))))
- {
- DBUG_PRINT("info", ("failed my_malloc()"));
- *alrm= 0; /* No alarm */
- pthread_mutex_unlock(&LOCK_alarm);
-#ifndef USE_ONE_SIGNAL_HAND
- pthread_sigmask(SIG_SETMASK,&old_mask,NULL);
-#endif
- DBUG_RETURN(1);
- }
- alarm_data->malloced=1;
- }
- else
- alarm_data->malloced=0;
- alarm_data->expire_time=now+sec;
- alarm_data->alarmed=0;
- alarm_data->thread= current_my_thread_var->pthread_self;
- alarm_data->thread_id= current_my_thread_var->id;
+ reschedule= (ulong) next_alarm_expire_time > (ulong) next;
queue_insert(&alarm_queue,(uchar*) alarm_data);
+ assert(alarm_data->index_in_queue > 0);
/* Reschedule alarm if the current one has more than sec left */
- if (reschedule)
+ if (unlikely(reschedule))
{
DBUG_PRINT("info", ("reschedule"));
if (pthread_equal(pthread_self(),alarm_thread))
{
alarm(sec); /* purecov: inspected */
- next_alarm_expire_time= now + sec;
+ next_alarm_expire_time= next;
}
else
reschedule_alarms(); /* Reschedule alarms */
}
pthread_mutex_unlock(&LOCK_alarm);
-#ifndef USE_ONE_SIGNAL_HAND
- pthread_sigmask(SIG_SETMASK,&old_mask,NULL);
-#endif
+ one_signal_hand_sigmask(SIG_SETMASK,&old_mask,NULL);
(*alrm)= &alarm_data->alarmed;
DBUG_RETURN(0);
+
+abort:
+ if (alarm_data->malloced)
+ my_free(alarm_data, MYF(0));
+ pthread_mutex_unlock(&LOCK_alarm);
+ one_signal_hand_sigmask(SIG_SETMASK,&old_mask,NULL);
+abort_no_unlock:
+ *alrm= 0; /* No alarm */
+ DBUG_RETURN(1);
}
@@ -247,41 +249,18 @@ void thr_end_alarm(thr_alarm_t *alarmed)
#ifndef USE_ONE_SIGNAL_HAND
sigset_t old_mask;
#endif
- uint i, found=0;
DBUG_ENTER("thr_end_alarm");
-#ifndef USE_ONE_SIGNAL_HAND
- pthread_sigmask(SIG_BLOCK,&full_signal_set,&old_mask);
-#endif
- pthread_mutex_lock(&LOCK_alarm);
-
+ one_signal_hand_sigmask(SIG_BLOCK,&full_signal_set,&old_mask);
alarm_data= (ALARM*) ((uchar*) *alarmed - offsetof(ALARM,alarmed));
- for (i=0 ; i < alarm_queue.elements ; i++)
- {
- if ((ALARM*) queue_element(&alarm_queue,i) == alarm_data)
- {
- queue_remove(&alarm_queue,i),MYF(0);
- if (alarm_data->malloced)
- my_free((uchar*) alarm_data,MYF(0));
- found++;
-#ifdef DBUG_OFF
- break;
-#endif
- }
- }
- DBUG_ASSERT(!*alarmed || found == 1);
- if (!found)
- {
- if (*alarmed)
- fprintf(stderr,"Warning: Didn't find alarm 0x%lx in queue of %d alarms\n",
- (long) *alarmed, alarm_queue.elements);
- DBUG_PRINT("warning",("Didn't find alarm 0x%lx in queue\n",
- (long) *alarmed));
- }
+ pthread_mutex_lock(&LOCK_alarm);
+ DBUG_ASSERT(alarm_data->index_in_queue != 0);
+ DBUG_ASSERT(queue_element(&alarm_queue, alarm_data->index_in_queue) ==
+ alarm_data);
+ queue_remove(&alarm_queue, alarm_data->index_in_queue);
pthread_mutex_unlock(&LOCK_alarm);
-#ifndef USE_ONE_SIGNAL_HAND
- pthread_sigmask(SIG_SETMASK,&old_mask,NULL);
-#endif
+ one_signal_hand_sigmask(SIG_SETMASK,&old_mask,NULL);
+ reset_index_in_queue(alarm_data);
DBUG_VOID_RETURN;
}
@@ -344,12 +323,13 @@ static sig_handler process_alarm_part2(i
#if defined(MAIN) && !defined(__bsdi__)
printf("process_alarm\n"); fflush(stdout);
#endif
- if (alarm_queue.elements)
+ if (likely(alarm_queue.elements))
{
- if (alarm_aborted)
+ if (unlikely(alarm_aborted))
{
uint i;
- for (i=0 ; i < alarm_queue.elements ;)
+ for (i= queue_first_element(&alarm_queue) ;
+ i <= queue_last_element(&alarm_queue) ;)
{
alarm_data=(ALARM*) queue_element(&alarm_queue,i);
alarm_data->alarmed=1; /* Info to thread */
@@ -360,6 +340,7 @@ static sig_handler process_alarm_part2(i
printf("Warning: pthread_kill couldn't find thread!!!\n");
#endif
queue_remove(&alarm_queue,i); /* No thread. Remove alarm */
+ reset_index_in_queue(alarm_data);
}
else
i++; /* Signal next thread */
@@ -371,8 +352,8 @@ static sig_handler process_alarm_part2(i
}
else
{
- ulong now=(ulong) my_time(0);
- ulong next=now+10-(now%10);
+ time_t now= my_time(0);
+ time_t next= now+10-(now%10);
while ((alarm_data=(ALARM*) queue_top(&alarm_queue))->expire_time <= now)
{
alarm_data->alarmed=1; /* Info to thread */
@@ -382,15 +363,16 @@ static sig_handler process_alarm_part2(i
{
#ifdef MAIN
printf("Warning: pthread_kill couldn't find thread!!!\n");
-#endif
- queue_remove(&alarm_queue,0); /* No thread. Remove alarm */
+#endif /* MAIN */
+ queue_remove_top(&alarm_queue); /* No thread. Remove alarm */
+ reset_index_in_queue(alarm_data);
if (!alarm_queue.elements)
break;
}
else
{
alarm_data->expire_time=next;
- queue_replaced(&alarm_queue);
+ queue_replace_top(&alarm_queue);
}
}
#ifndef USE_ALARM_THREAD
@@ -486,13 +468,15 @@ void thr_alarm_kill(my_thread_id thread_
if (alarm_aborted)
return;
pthread_mutex_lock(&LOCK_alarm);
- for (i=0 ; i < alarm_queue.elements ; i++)
+ for (i= queue_first_element(&alarm_queue) ;
+ i <= queue_last_element(&alarm_queue);
+ i++)
{
- if (((ALARM*) queue_element(&alarm_queue,i))->thread_id == thread_id)
+ ALARM *element= (ALARM*) queue_element(&alarm_queue,i);
+ if (element->thread_id == thread_id)
{
- ALARM *tmp=(ALARM*) queue_remove(&alarm_queue,i);
- tmp->expire_time=0;
- queue_insert(&alarm_queue,(uchar*) tmp);
+ element->expire_time= 0;
+ queue_replace(&alarm_queue, i);
reschedule_alarms();
break;
}
@@ -508,7 +492,7 @@ void thr_alarm_info(ALARM_INFO *info)
info->max_used_alarms= max_used_alarms;
if ((info->active_alarms= alarm_queue.elements))
{
- ulong now=(ulong) my_time(0);
+ time_t now= my_time(0);
long time_diff;
ALARM *alarm_data= (ALARM*) queue_top(&alarm_queue);
time_diff= (long) (alarm_data->expire_time - now);
@@ -556,7 +540,7 @@ static void *alarm_handler(void *arg __a
{
if (alarm_queue.elements)
{
- ulong sleep_time,now= my_time(0);
+ time_t sleep_time,now= my_time(0);
if (alarm_aborted)
sleep_time=now+1;
else
@@ -792,20 +776,6 @@ static void *test_thread(void *arg)
return 0;
}
-#ifdef USE_ONE_SIGNAL_HAND
-static sig_handler print_signal_warning(int sig)
-{
- printf("Warning: Got signal %d from thread %s\n",sig,my_thread_name());
- fflush(stdout);
-#ifdef DONT_REMEMBER_SIGNAL
- my_sigset(sig,print_signal_warning); /* int. thread system calls */
-#endif
- if (sig == SIGALRM)
- alarm(2); /* reschedule alarm */
-}
-#endif /* USE_ONE_SIGNAL_HAND */
-
-
static void *signal_hand(void *arg __attribute__((unused)))
{
sigset_t set;
=== modified file 'sql/create_options.cc'
--- a/sql/create_options.cc 2010-05-12 17:56:05 +0000
+++ b/sql/create_options.cc 2010-07-16 07:33:01 +0000
@@ -583,9 +583,9 @@ my_bool engine_table_options_frm_read(co
}
if (buff < buff_end)
- sql_print_warning("Table %`s was created in a later MariaDB version - "
+ sql_print_warning("Table '%s' was created in a later MariaDB version - "
"unknown table attributes were ignored",
- share->table_name);
+ share->table_name.str);
DBUG_RETURN(buff > buff_end);
}
=== modified file 'sql/event_queue.cc'
--- a/sql/event_queue.cc 2008-12-02 22:02:52 +0000
+++ b/sql/event_queue.cc 2010-07-16 07:33:01 +0000
@@ -136,9 +136,9 @@ Event_queue::init_queue(THD *thd)
LOCK_QUEUE_DATA();
- if (init_queue_ex(&queue, EVENT_QUEUE_INITIAL_SIZE , 0 /*offset*/,
- 0 /*max_on_top*/, event_queue_element_compare_q,
- NULL, EVENT_QUEUE_EXTENT))
+ if (::init_queue(&queue, EVENT_QUEUE_INITIAL_SIZE , 0 /*offset*/,
+ 0 /*max_on_top*/, event_queue_element_compare_q,
+ NullS, 0, EVENT_QUEUE_EXTENT))
{
sql_print_error("Event Scheduler: Can't initialize the execution queue");
goto err;
@@ -325,11 +325,13 @@ void
Event_queue::drop_matching_events(THD *thd, LEX_STRING pattern,
bool (*comparator)(LEX_STRING, Event_basic *))
{
- uint i= 0;
+ uint i;
DBUG_ENTER("Event_queue::drop_matching_events");
DBUG_PRINT("enter", ("pattern=%s", pattern.str));
- while (i < queue.elements)
+ for (i= queue_first_element(&queue) ;
+ i <= queue_last_element(&queue) ;
+ )
{
Event_queue_element *et= (Event_queue_element *) queue_element(&queue, i);
DBUG_PRINT("info", ("[%s.%s]?", et->dbname.str, et->name.str));
@@ -339,7 +341,8 @@ Event_queue::drop_matching_events(THD *t
The queue is ordered. If we remove an element, then all elements
after it will shift one position to the left, if we imagine it as
an array from left to the right. In this case we should not
- increment the counter and the (i < queue.elements) condition is ok.
+ increment the counter and the (i <= queue_last_element() condition
+ is ok.
*/
queue_remove(&queue, i);
delete et;
@@ -403,7 +406,9 @@ Event_queue::find_n_remove_event(LEX_STR
uint i;
DBUG_ENTER("Event_queue::find_n_remove_event");
- for (i= 0; i < queue.elements; ++i)
+ for (i= queue_first_element(&queue);
+ i <= queue_last_element(&queue);
+ i++)
{
Event_queue_element *et= (Event_queue_element *) queue_element(&queue, i);
DBUG_PRINT("info", ("[%s.%s]==[%s.%s]?", db.str, name.str,
@@ -441,7 +446,9 @@ Event_queue::recalculate_activation_time
LOCK_QUEUE_DATA();
DBUG_PRINT("info", ("%u loaded events to be recalculated", queue.elements));
- for (i= 0; i < queue.elements; i++)
+ for (i= queue_first_element(&queue);
+ i <= queue_last_element(&queue);
+ i++)
{
((Event_queue_element*)queue_element(&queue, i))->compute_next_execution_time();
((Event_queue_element*)queue_element(&queue, i))->update_timing_fields(thd);
@@ -454,16 +461,19 @@ Event_queue::recalculate_activation_time
have removed all. The queue has been ordered in a way the disabled
events are at the end.
*/
- for (i= queue.elements; i > 0; i--)
+ for (i= queue_last_element(&queue);
+ (int) i >= (int) queue_first_element(&queue);
+ i--)
{
- Event_queue_element *element = (Event_queue_element*)queue_element(&queue, i - 1);
+ Event_queue_element *element=
+ (Event_queue_element*)queue_element(&queue, i);
if (element->status != Event_parse_data::DISABLED)
break;
/*
This won't cause queue re-order, because we remove
always the last element.
*/
- queue_remove(&queue, i - 1);
+ queue_remove(&queue, i);
delete element;
}
UNLOCK_QUEUE_DATA();
@@ -499,7 +509,9 @@ Event_queue::empty_queue()
sql_print_information("Event Scheduler: Purging the queue. %u events",
queue.elements);
/* empty the queue */
- for (i= 0; i < queue.elements; ++i)
+ for (i= queue_first_element(&queue);
+ i <= queue_last_element(&queue);
+ i++)
{
Event_queue_element *et= (Event_queue_element *) queue_element(&queue, i);
delete et;
@@ -525,7 +537,9 @@ Event_queue::dbug_dump_queue(time_t now)
uint i;
DBUG_ENTER("Event_queue::dbug_dump_queue");
DBUG_PRINT("info", ("Dumping queue . Elements=%u", queue.elements));
- for (i = 0; i < queue.elements; i++)
+ for (i= queue_first_element(&queue);
+ i <= queue_last_element(&queue);
+ i++)
{
et= ((Event_queue_element*)queue_element(&queue, i));
DBUG_PRINT("info", ("et: 0x%lx name: %s.%s", (long) et,
@@ -592,7 +606,7 @@ Event_queue::get_top_for_execution_if_ti
continue;
}
- top= ((Event_queue_element*) queue_element(&queue, 0));
+ top= (Event_queue_element*) queue_top(&queue);
thd->set_current_time(); /* Get current time */
@@ -634,10 +648,10 @@ Event_queue::get_top_for_execution_if_ti
top->dbname.str, top->name.str,
top->dropped? "Dropping.":"");
delete top;
- queue_remove(&queue, 0);
+ queue_remove_top(&queue);
}
else
- queue_replaced(&queue);
+ queue_replace_top(&queue);
dbug_dump_queue(thd->query_start());
break;
=== modified file 'sql/filesort.cc'
--- a/sql/filesort.cc 2010-06-01 19:52:20 +0000
+++ b/sql/filesort.cc 2010-07-16 07:33:01 +0000
@@ -1151,7 +1151,9 @@ uint read_to_buffer(IO_CACHE *fromfile,
void reuse_freed_buff(QUEUE *queue, BUFFPEK *reuse, uint key_length)
{
uchar *reuse_end= reuse->base + reuse->max_keys * key_length;
- for (uint i= 0; i < queue->elements; ++i)
+ for (uint i= queue_first_element(queue);
+ i <= queue_last_element(queue);
+ i++)
{
BUFFPEK *bp= (BUFFPEK *) queue_element(queue, i);
if (bp->base + bp->max_keys * key_length == reuse->base)
@@ -1240,7 +1242,7 @@ int merge_buffers(SORTPARAM *param, IO_C
first_cmp_arg= (void*) &sort_length;
}
if (init_queue(&queue, (uint) (Tb-Fb)+1, offsetof(BUFFPEK,key), 0,
- (queue_compare) cmp, first_cmp_arg))
+ (queue_compare) cmp, first_cmp_arg, 0, 0))
DBUG_RETURN(1); /* purecov: inspected */
for (buffpek= Fb ; buffpek <= Tb ; buffpek++)
{
@@ -1277,7 +1279,7 @@ int merge_buffers(SORTPARAM *param, IO_C
error= 0; /* purecov: inspected */
goto end; /* purecov: inspected */
}
- queue_replaced(&queue); // Top element has been used
+ queue_replace_top(&queue); // Top element has been used
}
else
cmp= 0; // Not unique
@@ -1325,14 +1327,14 @@ int merge_buffers(SORTPARAM *param, IO_C
if (!(error= (int) read_to_buffer(from_file,buffpek,
rec_length)))
{
- VOID(queue_remove(&queue,0));
+ VOID(queue_remove_top(&queue));
reuse_freed_buff(&queue, buffpek, rec_length);
break; /* One buffer have been removed */
}
else if (error == -1)
goto err; /* purecov: inspected */
}
- queue_replaced(&queue); /* Top element has been replaced */
+ queue_replace_top(&queue); /* Top element has been replaced */
}
}
buffpek= (BUFFPEK*) queue_top(&queue);
=== modified file 'sql/ha_partition.cc'
--- a/sql/ha_partition.cc 2010-06-05 14:53:36 +0000
+++ b/sql/ha_partition.cc 2010-07-16 07:33:01 +0000
@@ -2567,7 +2567,7 @@ int ha_partition::open(const char *name,
Initialize priority queue, initialized to reading forward.
*/
if ((error= init_queue(&m_queue, m_tot_parts, (uint) PARTITION_BYTES_IN_POS,
- 0, key_rec_cmp, (void*)this)))
+ 0, key_rec_cmp, (void*)this, 0, 0)))
goto err_handler;
/*
@@ -4622,7 +4622,7 @@ int ha_partition::handle_unordered_scan_
int ha_partition::handle_ordered_index_scan(uchar *buf, bool reverse_order)
{
uint i;
- uint j= 0;
+ uint j= queue_first_element(&m_queue);
bool found= FALSE;
DBUG_ENTER("ha_partition::handle_ordered_index_scan");
@@ -4716,7 +4716,7 @@ int ha_partition::handle_ordered_index_s
*/
queue_set_max_at_top(&m_queue, reverse_order);
queue_set_cmp_arg(&m_queue, (void*)m_curr_key_info);
- m_queue.elements= j;
+ m_queue.elements= j - queue_first_element(&m_queue);
queue_fix(&m_queue);
return_top_record(buf);
table->status= 0;
@@ -4787,7 +4787,7 @@ int ha_partition::handle_ordered_next(uc
if (error == HA_ERR_END_OF_FILE)
{
/* Return next buffered row */
- queue_remove(&m_queue, (uint) 0);
+ queue_remove_top(&m_queue);
if (m_queue.elements)
{
DBUG_PRINT("info", ("Record returned from partition %u (2)",
@@ -4799,7 +4799,7 @@ int ha_partition::handle_ordered_next(uc
}
DBUG_RETURN(error);
}
- queue_replaced(&m_queue);
+ queue_replace_top(&m_queue);
return_top_record(buf);
DBUG_PRINT("info", ("Record returned from partition %u", m_top_entry));
DBUG_RETURN(0);
@@ -4830,7 +4830,7 @@ int ha_partition::handle_ordered_prev(uc
{
if (error == HA_ERR_END_OF_FILE)
{
- queue_remove(&m_queue, (uint) 0);
+ queue_remove_top(&m_queue);
if (m_queue.elements)
{
return_top_record(buf);
@@ -4842,7 +4842,7 @@ int ha_partition::handle_ordered_prev(uc
}
DBUG_RETURN(error);
}
- queue_replaced(&m_queue);
+ queue_replace_top(&m_queue);
return_top_record(buf);
DBUG_PRINT("info", ("Record returned from partition %d", m_top_entry));
DBUG_RETURN(0);
=== modified file 'sql/ha_partition.h'
--- a/sql/ha_partition.h 2010-06-05 14:53:36 +0000
+++ b/sql/ha_partition.h 2010-07-16 07:33:01 +0000
@@ -1129,7 +1129,7 @@ public:
virtual handlerton *partition_ht() const
{
handlerton *h= m_file[0]->ht;
- for (int i=1; i < m_tot_parts; i++)
+ for (uint i=1; i < m_tot_parts; i++)
DBUG_ASSERT(h == m_file[i]->ht);
return h;
}
=== modified file 'sql/item_cmpfunc.cc'
--- a/sql/item_cmpfunc.cc 2010-07-10 10:37:30 +0000
+++ b/sql/item_cmpfunc.cc 2010-07-16 07:33:01 +0000
@@ -1778,10 +1778,12 @@ Item *Item_in_optimizer::expr_cache_inse
if (args[0]->cols() == 1)
depends_on.push_front((Item**)args);
else
- for (int i= 0; i < args[0]->cols(); i++)
+ {
+ for (uint i= 0; i < args[0]->cols(); i++)
{
depends_on.push_front(args[0]->addr(i));
}
+ }
if (args[1]->expr_cache_is_needed(thd))
DBUG_RETURN(set_expr_cache(thd, depends_on));
=== modified file 'sql/item_subselect.cc'
--- a/sql/item_subselect.cc 2010-07-16 10:52:02 +0000
+++ b/sql/item_subselect.cc 2010-07-16 12:10:55 +0000
@@ -4880,7 +4880,8 @@ subselect_rowid_merge_engine::init(MY_BI
merge_keys[i]->sort_keys();
if (init_queue(&pq, keys_count, 0, FALSE,
- subselect_rowid_merge_engine::cmp_keys_by_cur_rownum, NULL))
+ subselect_rowid_merge_engine::cmp_keys_by_cur_rownum, NULL,
+ 0, 0))
return TRUE;
return FALSE;
=== modified file 'sql/mysqld.cc'
--- a/sql/mysqld.cc 2010-07-10 10:37:30 +0000
+++ b/sql/mysqld.cc 2010-07-16 08:58:24 +0000
@@ -409,6 +409,12 @@ static const char *optimizer_switch_str=
"index_merge_sort_union=on,"
"index_merge_intersection=on,"
"index_condition_pushdown=on,"
+ "firstmatch=on,"
+ "loosescan=on,"
+ "materialization=on,"
+ "semijoin=on,"
+ "partial_match_rowid_merge=on,"
+ "partial_match_table_scan=on,"
"subquery_cache=on"
#ifndef DBUG_OFF
",table_elimination=on";
@@ -7227,7 +7233,9 @@ The minimum value for this variable is 4
{"optimizer_switch", OPT_OPTIMIZER_SWITCH,
"optimizer_switch=option=val[,option=val...], where option={index_merge, "
"index_merge_union, index_merge_sort_union, index_merge_intersection, "
- "index_condition_pushdown, subquery_cache"
+ "index_condition_pushdown, firstmatch, loosescan, materialization, "
+ "semijoin, partial_match_rowid_merge, partial_match_table_scan, "
+ "subquery_cache"
#ifndef DBUG_OFF
", table_elimination"
#endif
=== modified file 'sql/net_serv.cc'
--- a/sql/net_serv.cc 2010-05-26 18:55:40 +0000
+++ b/sql/net_serv.cc 2010-07-16 07:33:01 +0000
@@ -262,18 +262,20 @@ static int net_data_is_ready(my_socket s
#endif /* EMBEDDED_LIBRARY */
/**
- Remove unwanted characters from connection
- and check if disconnected.
+ Intialize NET handler for new reads:
- Read from socket until there is nothing more to read. Discard
- what is read.
-
- If there is anything when to read 'net_clear' is called this
- normally indicates an error in the protocol.
-
- When connection is properly closed (for TCP it means with
- a FIN packet), then select() considers a socket "ready to read",
- in the sense that there's EOF to read, but read() returns 0.
+ - Read from socket until there is nothing more to read. Discard
+ what is read.
+ - Initialize net for new net_read/net_write calls.
+
+ If there is anything when to read 'net_clear' is called this
+ normally indicates an error in the protocol. Normally one should not
+ need to do clear the communication buffer. If one compiles without
+ -DUSE_NET_CLEAR then one wins one read call / query.
+
+ When connection is properly closed (for TCP it means with
+ a FIN packet), then select() considers a socket "ready to read",
+ in the sense that there's EOF to read, but read() returns 0.
@param net NET handler
@param clear_buffer if <> 0, then clear all data from comm buff
@@ -281,20 +283,18 @@ static int net_data_is_ready(my_socket s
void net_clear(NET *net, my_bool clear_buffer __attribute__((unused)))
{
-#if !defined(EMBEDDED_LIBRARY) && defined(DBUG_OFF)
- size_t count;
- int ready;
-#endif
DBUG_ENTER("net_clear");
/*
- We don't do a clear in case of DBUG_OFF to catch bugs
- in the protocol handling
+ We don't do a clear in case of not DBUG_OFF to catch bugs in the
+ protocol handling.
*/
-#if !defined(EMBEDDED_LIBRARY) && defined(DBUG_OFF)
+#if (!defined(EMBEDDED_LIBRARY) && defined(DBUG_OFF)) || defined(USE_NET_CLEAR)
if (clear_buffer)
{
+ size_t count;
+ int ready;
while ((ready= net_data_is_ready(net->vio->sd)) > 0)
{
/* The socket is ready */
=== modified file 'sql/opt_range.cc'
--- a/sql/opt_range.cc 2010-07-10 10:37:30 +0000
+++ b/sql/opt_range.cc 2010-07-16 07:33:01 +0000
@@ -1155,10 +1155,7 @@ QUICK_SELECT_I::QUICK_SELECT_I()
QUICK_RANGE_SELECT::QUICK_RANGE_SELECT(THD *thd, TABLE *table, uint key_nr,
bool no_alloc, MEM_ROOT *parent_alloc,
bool *create_error)
- :dont_free(0),doing_key_read(0),/*error(0),*/free_file(0),/*in_range(0),*/cur_range(NULL),last_range(0)
- //psergey3-merge: check whether we need doing_key_read and last_range
- // was:
- // :free_file(0),cur_range(NULL),last_range(0),dont_free(0)
+ :doing_key_read(0),/*error(0),*/free_file(0),/*in_range(0),*/cur_range(NULL),last_range(0),dont_free(0)
{
my_bitmap_map *bitmap;
DBUG_ENTER("QUICK_RANGE_SELECT::QUICK_RANGE_SELECT");
@@ -1594,7 +1591,7 @@ int QUICK_ROR_UNION_SELECT::init()
DBUG_ENTER("QUICK_ROR_UNION_SELECT::init");
if (init_queue(&queue, quick_selects.elements, 0,
FALSE , QUICK_ROR_UNION_SELECT::queue_cmp,
- (void*) this))
+ (void*) this, 0, 0))
{
bzero(&queue, sizeof(QUEUE));
DBUG_RETURN(1);
@@ -8293,12 +8290,12 @@ int QUICK_ROR_UNION_SELECT::get_next()
{
if (error != HA_ERR_END_OF_FILE)
DBUG_RETURN(error);
- queue_remove(&queue, 0);
+ queue_remove_top(&queue);
}
else
{
quick->save_last_pos();
- queue_replaced(&queue);
+ queue_replace_top(&queue);
}
if (!have_prev_rowid)
=== modified file 'sql/sql_class.cc'
--- a/sql/sql_class.cc 2010-07-16 10:52:02 +0000
+++ b/sql/sql_class.cc 2010-07-16 12:10:55 +0000
@@ -2994,14 +2994,28 @@ create_result_table(THD *thd_arg, List<I
if (!stat)
return TRUE;
- cleanup();
-
+ reset();
table->file->extra(HA_EXTRA_WRITE_CACHE);
table->file->extra(HA_EXTRA_IGNORE_DUP_KEY);
return FALSE;
}
+void select_materialize_with_stats::reset()
+{
+ memset(col_stat, 0, table->s->fields * sizeof(Column_statistics));
+ max_nulls_in_row= 0;
+ count_rows= 0;
+}
+
+
+void select_materialize_with_stats::cleanup()
+{
+ reset();
+ select_union::cleanup();
+}
+
+
/**
Override select_union::send_data to analyze each row for NULLs and to
update null_statistics before sending data to the client.
=== modified file 'sql/sql_class.h'
--- a/sql/sql_class.h 2010-07-16 10:52:02 +0000
+++ b/sql/sql_class.h 2010-07-16 12:10:55 +0000
@@ -2907,11 +2907,11 @@ public:
bool send_data(List<Item> &items);
bool send_eof();
bool flush();
- TMP_TABLE_PARAM *get_tmp_table_param() { return &tmp_table_param; }
-
+ void cleanup();
virtual bool create_result_table(THD *thd, List<Item> *column_types,
bool is_distinct, ulonglong options,
const char *alias, bool bit_fields_as_long);
+ TMP_TABLE_PARAM *get_tmp_table_param() { return &tmp_table_param; }
};
/* Base subselect interface class */
@@ -2971,6 +2971,9 @@ protected:
*/
ha_rows count_rows;
+protected:
+ void reset();
+
public:
select_materialize_with_stats() { tmp_table_param.init(); }
virtual bool create_result_table(THD *thd, List<Item> *column_types,
@@ -2978,12 +2981,7 @@ public:
const char *alias, bool bit_fields_as_long);
bool init_result_table(ulonglong select_options);
bool send_data(List<Item> &items);
- void cleanup()
- {
- memset(col_stat, 0, table->s->fields * sizeof(Column_statistics));
- max_nulls_in_row= 0;
- count_rows= 0;
- }
+ void cleanup();
ha_rows get_null_count_of_col(uint idx)
{
DBUG_ASSERT(idx < table->s->fields);
=== modified file 'sql/sql_union.cc'
--- a/sql/sql_union.cc 2010-07-10 10:37:30 +0000
+++ b/sql/sql_union.cc 2010-07-16 11:02:15 +0000
@@ -136,6 +136,22 @@ select_union::create_result_table(THD *t
}
+/**
+ Reset and empty the temporary table that stores the materialized query result.
+
+ @note The cleanup performed here is exactly the same as for the two temp
+ tables of JOIN - exec_tmp_table_[1 | 2].
+*/
+
+void select_union::cleanup()
+{
+ table->file->extra(HA_EXTRA_RESET_STATE);
+ table->file->ha_delete_all_rows();
+ free_io_cache(table);
+ filesort_free_buffers(table,0);
+}
+
+
/*
initialization procedures before fake_select_lex preparation()
=== modified file 'sql/uniques.cc'
--- a/sql/uniques.cc 2009-09-07 20:50:10 +0000
+++ b/sql/uniques.cc 2010-07-16 07:33:01 +0000
@@ -423,7 +423,7 @@ static bool merge_walk(uchar *merge_buff
if (end <= begin ||
merge_buffer_size < (ulong) (key_length * (end - begin + 1)) ||
init_queue(&queue, (uint) (end - begin), offsetof(BUFFPEK, key), 0,
- buffpek_compare, &compare_context))
+ buffpek_compare, &compare_context, 0, 0))
return 1;
/* we need space for one key when a piece of merge buffer is re-read */
merge_buffer_size-= key_length;
@@ -468,7 +468,7 @@ static bool merge_walk(uchar *merge_buff
*/
top->key+= key_length;
if (--top->mem_count)
- queue_replaced(&queue);
+ queue_replace_top(&queue);
else /* next piece should be read */
{
/* save old_key not to overwrite it in read_to_buffer */
@@ -478,14 +478,14 @@ static bool merge_walk(uchar *merge_buff
if (bytes_read == (uint) (-1))
goto end;
else if (bytes_read > 0) /* top->key, top->mem_count are reset */
- queue_replaced(&queue); /* in read_to_buffer */
+ queue_replace_top(&queue); /* in read_to_buffer */
else
{
/*
Tree for old 'top' element is empty: remove it from the queue and
give all its memory to the nearest tree.
*/
- queue_remove(&queue, 0);
+ queue_remove_top(&queue);
reuse_freed_buff(&queue, top, key_length);
}
}
=== modified file 'storage/maria/ma_ft_boolean_search.c'
--- a/storage/maria/ma_ft_boolean_search.c 2010-01-06 19:20:16 +0000
+++ b/storage/maria/ma_ft_boolean_search.c 2010-07-16 07:33:01 +0000
@@ -473,14 +473,15 @@ static void _ftb_init_index_search(FT_IN
int i;
FTB_WORD *ftbw;
- if ((ftb->state != READY && ftb->state !=INDEX_DONE) ||
- ftb->keynr == NO_SUCH_KEY)
+ if (ftb->state == UNINITIALIZED || ftb->keynr == NO_SUCH_KEY)
return;
ftb->state=INDEX_SEARCH;
- for (i=ftb->queue.elements; i; i--)
+ for (i= queue_last_element(&ftb->queue);
+ (int) i >= (int) queue_first_element(&ftb->queue);
+ i--)
{
- ftbw=(FTB_WORD *)(ftb->queue.root[i]);
+ ftbw=(FTB_WORD *)(queue_element(&ftb->queue, i));
if (ftbw->flags & FTB_FLAG_TRUNC)
{
@@ -585,7 +586,7 @@ FT_INFO * maria_ft_init_boolean_search(M
sizeof(void *))))
goto err;
reinit_queue(&ftb->queue, ftb->queue.max_elements, 0, 0,
- (int (*)(void*, uchar*, uchar*))FTB_WORD_cmp, 0);
+ (int (*)(void*, uchar*, uchar*))FTB_WORD_cmp, 0, 0, 0);
for (ftbw= ftb->last_word; ftbw; ftbw= ftbw->prev)
queue_insert(&ftb->queue, (uchar *)ftbw);
ftb->list=(FTB_WORD **)alloc_root(&ftb->mem_root,
@@ -828,7 +829,7 @@ int maria_ft_boolean_read_next(FT_INFO *
/* update queue */
_ft2_search(ftb, ftbw, 0);
- queue_replaced(& ftb->queue);
+ queue_replace_top(&ftb->queue);
}
ftbe=ftb->root;
=== modified file 'storage/maria/ma_ft_nlq_search.c'
--- a/storage/maria/ma_ft_nlq_search.c 2009-11-30 13:36:06 +0000
+++ b/storage/maria/ma_ft_nlq_search.c 2010-07-16 07:33:01 +0000
@@ -253,12 +253,12 @@ FT_INFO *maria_ft_init_nlq_search(MARIA_
{
QUEUE best;
init_queue(&best,ft_query_expansion_limit,0,0, (queue_compare) &FT_DOC_cmp,
- 0);
+ 0, 0, 0);
tree_walk(&aio.dtree, (tree_walk_action) &walk_and_push,
&best, left_root_right);
while (best.elements)
{
- my_off_t docid=((FT_DOC *)queue_remove(& best, 0))->dpos;
+ my_off_t docid= ((FT_DOC *)queue_remove_top(&best))->dpos;
if (!(*info->read_record)(info, record, docid))
{
info->update|= HA_STATE_AKTIV;
=== modified file 'storage/maria/ma_sort.c'
--- a/storage/maria/ma_sort.c 2009-11-29 23:08:56 +0000
+++ b/storage/maria/ma_sort.c 2010-07-16 07:33:01 +0000
@@ -933,7 +933,7 @@ merge_buffers(MARIA_SORT_PARAM *info, ui
if (init_queue(&queue,(uint) (Tb-Fb)+1,offsetof(BUFFPEK,key),0,
(int (*)(void*, uchar *,uchar*)) info->key_cmp,
- (void*) info))
+ (void*) info, 0, 0))
DBUG_RETURN(1); /* purecov: inspected */
for (buffpek= Fb ; buffpek <= Tb ; buffpek++)
@@ -982,7 +982,7 @@ merge_buffers(MARIA_SORT_PARAM *info, ui
uchar *base= buffpek->base;
uint max_keys=buffpek->max_keys;
- VOID(queue_remove(&queue,0));
+ VOID(queue_remove_top(&queue));
/* Put room used by buffer to use in other buffer */
for (refpek= (BUFFPEK**) &queue_top(&queue);
@@ -1007,7 +1007,7 @@ merge_buffers(MARIA_SORT_PARAM *info, ui
}
else if (error == -1)
goto err; /* purecov: inspected */
- queue_replaced(&queue); /* Top element has been replaced */
+ queue_replace_top(&queue); /* Top element has been replaced */
}
}
buffpek=(BUFFPEK*) queue_top(&queue);
=== modified file 'storage/maria/maria_pack.c'
--- a/storage/maria/maria_pack.c 2009-02-19 09:01:25 +0000
+++ b/storage/maria/maria_pack.c 2010-07-16 07:33:01 +0000
@@ -590,7 +590,7 @@ static int compress(PACK_MRG_INFO *mrg,c
Create a global priority queue in preparation for making
temporary Huffman trees.
*/
- if (init_queue(&queue,256,0,0,compare_huff_elements,0))
+ if (init_queue(&queue, 256, 0, 0, compare_huff_elements, 0, 0, 0))
goto err;
/*
@@ -1521,7 +1521,7 @@ static int make_huff_tree(HUFF_TREE *huf
if (queue.max_elements < found)
{
delete_queue(&queue);
- if (init_queue(&queue,found,0,0,compare_huff_elements,0))
+ if (init_queue(&queue,found, 0, 0, compare_huff_elements, 0, 0, 0))
return -1;
}
@@ -1625,8 +1625,7 @@ static int make_huff_tree(HUFF_TREE *huf
Make a priority queue from the queue. Construct its index so that we
have a partially ordered tree.
*/
- for (i=found/2 ; i > 0 ; i--)
- _downheap(&queue,i);
+ queue_fix(&queue);
/* The Huffman algorithm. */
bytes_packed=0; bits_packed=0;
@@ -1637,12 +1636,9 @@ static int make_huff_tree(HUFF_TREE *huf
Popping from a priority queue includes a re-ordering of the queue,
to get the next least incidence element to the top.
*/
- a=(HUFF_ELEMENT*) queue_remove(&queue,0);
- /*
- Copy the next least incidence element. The queue implementation
- reserves root[0] for temporary purposes. root[1] is the top.
- */
- b=(HUFF_ELEMENT*) queue.root[1];
+ a=(HUFF_ELEMENT*) queue_remove_top(&queue);
+ /* Copy the next least incidence element */
+ b=(HUFF_ELEMENT*) queue_top(&queue);
/* Get a new element from the element buffer. */
new_huff_el=huff_tree->element_buffer+found+i;
/* The new element gets the sum of the two least incidence elements. */
@@ -1664,8 +1660,8 @@ static int make_huff_tree(HUFF_TREE *huf
Replace the copied top element by the new element and re-order the
queue.
*/
- queue.root[1]=(uchar*) new_huff_el;
- queue_replaced(&queue);
+ queue_top(&queue)= (uchar*) new_huff_el;
+ queue_replace_top(&queue);
}
huff_tree->root=(HUFF_ELEMENT*) queue.root[1];
huff_tree->bytes_packed=bytes_packed+(bits_packed+7)/8;
@@ -1796,8 +1792,7 @@ static my_off_t calc_packed_length(HUFF_
Make a priority queue from the queue. Construct its index so that we
have a partially ordered tree.
*/
- for (i=(found+1)/2 ; i > 0 ; i--)
- _downheap(&queue,i);
+ queue_fix(&queue);
/* The Huffman algorithm. */
for (i=0 ; i < found-1 ; i++)
@@ -1811,12 +1806,9 @@ static my_off_t calc_packed_length(HUFF_
incidence). Popping from a priority queue includes a re-ordering
of the queue, to get the next least incidence element to the top.
*/
- a= (my_off_t*) queue_remove(&queue, 0);
- /*
- Copy the next least incidence element. The queue implementation
- reserves root[0] for temporary purposes. root[1] is the top.
- */
- b= (my_off_t*) queue.root[1];
+ a= (my_off_t*) queue_remove_top(&queue);
+ /* Copy the next least incidence element. */
+ b= (my_off_t*) queue_top(&queue);
/* Create a new element in a local (automatic) buffer. */
new_huff_el= element_buffer + i;
/* The new element gets the sum of the two least incidence elements. */
@@ -1836,8 +1828,8 @@ static my_off_t calc_packed_length(HUFF_
queue. This successively replaces the references to counts by
references to HUFF_ELEMENTs.
*/
- queue.root[1]=(uchar*) new_huff_el;
- queue_replaced(&queue);
+ queue_top(&queue)= (uchar*) new_huff_el;
+ queue_replace_top(&queue);
}
DBUG_RETURN(bytes_packed+(bits_packed+7)/8);
}
=== modified file 'storage/myisam/ft_boolean_search.c'
--- a/storage/myisam/ft_boolean_search.c 2010-06-01 19:52:20 +0000
+++ b/storage/myisam/ft_boolean_search.c 2010-07-16 07:33:01 +0000
@@ -482,16 +482,18 @@ static int _ft2_search(FTB *ftb, FTB_WOR
static void _ftb_init_index_search(FT_INFO *ftb)
{
- int i;
+ uint i;
FTB_WORD *ftbw;
if (ftb->state == UNINITIALIZED || ftb->keynr == NO_SUCH_KEY)
return;
ftb->state=INDEX_SEARCH;
- for (i=ftb->queue.elements; i; i--)
+ for (i= queue_last_element(&ftb->queue);
+ (int) i >= (int) queue_first_element(&ftb->queue);
+ i--)
{
- ftbw=(FTB_WORD *)(ftb->queue.root[i]);
+ ftbw=(FTB_WORD *)(queue_element(&ftb->queue, i));
if (ftbw->flags & FTB_FLAG_TRUNC)
{
@@ -595,12 +597,12 @@ FT_INFO * ft_init_boolean_search(MI_INFO
sizeof(void *))))
goto err;
reinit_queue(&ftb->queue, ftb->queue.max_elements, 0, 0,
- (int (*)(void*, uchar*, uchar*))FTB_WORD_cmp, 0);
+ (int (*)(void*, uchar*, uchar*))FTB_WORD_cmp, 0, 0, 0);
for (ftbw= ftb->last_word; ftbw; ftbw= ftbw->prev)
queue_insert(&ftb->queue, (uchar *)ftbw);
ftb->list=(FTB_WORD **)alloc_root(&ftb->mem_root,
sizeof(FTB_WORD *)*ftb->queue.elements);
- memcpy(ftb->list, ftb->queue.root+1, sizeof(FTB_WORD *)*ftb->queue.elements);
+ memcpy(ftb->list, &queue_top(&ftb->queue), sizeof(FTB_WORD *)*ftb->queue.elements);
my_qsort2(ftb->list, ftb->queue.elements, sizeof(FTB_WORD *),
(qsort2_cmp)FTB_WORD_cmp_list, (void*) ftb->charset);
if (ftb->queue.elements<2) ftb->with_scan &= ~FTB_FLAG_TRUNC;
@@ -839,7 +841,7 @@ int ft_boolean_read_next(FT_INFO *ftb, c
/* update queue */
_ft2_search(ftb, ftbw, 0);
- queue_replaced(& ftb->queue);
+ queue_replace_top(&ftb->queue);
}
ftbe=ftb->root;
=== modified file 'storage/myisam/ft_nlq_search.c'
--- a/storage/myisam/ft_nlq_search.c 2010-01-27 21:53:08 +0000
+++ b/storage/myisam/ft_nlq_search.c 2010-07-16 07:33:01 +0000
@@ -250,12 +250,12 @@ FT_INFO *ft_init_nlq_search(MI_INFO *inf
{
QUEUE best;
init_queue(&best,ft_query_expansion_limit,0,0, (queue_compare) &FT_DOC_cmp,
- 0);
+ 0, 0, 0);
tree_walk(&aio.dtree, (tree_walk_action) &walk_and_push,
&best, left_root_right);
while (best.elements)
{
- my_off_t docid=((FT_DOC *)queue_remove(& best, 0))->dpos;
+ my_off_t docid= ((FT_DOC *)queue_remove_top(&best))->dpos;
if (!(*info->read_record)(info,docid,record))
{
info->update|= HA_STATE_AKTIV;
=== modified file 'storage/myisam/mi_test_all.sh'
--- a/storage/myisam/mi_test_all.sh 2007-07-28 11:36:20 +0000
+++ b/storage/myisam/mi_test_all.sh 2010-07-16 07:33:01 +0000
@@ -5,6 +5,7 @@
valgrind="valgrind --alignment=8 --leak-check=yes"
silent="-s"
+rm -f test1.TMD
if test -f mi_test1$MACH ; then suffix=$MACH ; else suffix=""; fi
./mi_test1$suffix $silent
=== modified file 'storage/myisam/myisampack.c'
--- a/storage/myisam/myisampack.c 2009-02-19 09:01:25 +0000
+++ b/storage/myisam/myisampack.c 2010-07-16 07:33:01 +0000
@@ -576,7 +576,7 @@ static int compress(PACK_MRG_INFO *mrg,c
Create a global priority queue in preparation for making
temporary Huffman trees.
*/
- if (init_queue(&queue,256,0,0,compare_huff_elements,0))
+ if (init_queue(&queue, 256, 0, 0, compare_huff_elements, 0, 0, 0))
goto err;
/*
@@ -1511,7 +1511,7 @@ static int make_huff_tree(HUFF_TREE *huf
if (queue.max_elements < found)
{
delete_queue(&queue);
- if (init_queue(&queue,found,0,0,compare_huff_elements,0))
+ if (init_queue(&queue,found, 0, 0, compare_huff_elements, 0, 0, 0))
return -1;
}
@@ -1615,8 +1615,7 @@ static int make_huff_tree(HUFF_TREE *huf
Make a priority queue from the queue. Construct its index so that we
have a partially ordered tree.
*/
- for (i=found/2 ; i > 0 ; i--)
- _downheap(&queue,i);
+ queue_fix(&queue);
/* The Huffman algorithm. */
bytes_packed=0; bits_packed=0;
@@ -1627,12 +1626,9 @@ static int make_huff_tree(HUFF_TREE *huf
Popping from a priority queue includes a re-ordering of the queue,
to get the next least incidence element to the top.
*/
- a=(HUFF_ELEMENT*) queue_remove(&queue,0);
- /*
- Copy the next least incidence element. The queue implementation
- reserves root[0] for temporary purposes. root[1] is the top.
- */
- b=(HUFF_ELEMENT*) queue.root[1];
+ a=(HUFF_ELEMENT*) queue_remove_top(&queue);
+ /* Copy the next least incidence element */
+ b=(HUFF_ELEMENT*) queue_top(&queue);
/* Get a new element from the element buffer. */
new_huff_el=huff_tree->element_buffer+found+i;
/* The new element gets the sum of the two least incidence elements. */
@@ -1654,8 +1650,8 @@ static int make_huff_tree(HUFF_TREE *huf
Replace the copied top element by the new element and re-order the
queue.
*/
- queue.root[1]=(uchar*) new_huff_el;
- queue_replaced(&queue);
+ queue_top(&queue)= (uchar*) new_huff_el;
+ queue_replace_top(&queue);
}
huff_tree->root=(HUFF_ELEMENT*) queue.root[1];
huff_tree->bytes_packed=bytes_packed+(bits_packed+7)/8;
@@ -1786,8 +1782,7 @@ static my_off_t calc_packed_length(HUFF_
Make a priority queue from the queue. Construct its index so that we
have a partially ordered tree.
*/
- for (i=(found+1)/2 ; i > 0 ; i--)
- _downheap(&queue,i);
+ queue_fix(&queue);
/* The Huffman algorithm. */
for (i=0 ; i < found-1 ; i++)
@@ -1801,12 +1796,9 @@ static my_off_t calc_packed_length(HUFF_
incidence). Popping from a priority queue includes a re-ordering
of the queue, to get the next least incidence element to the top.
*/
- a= (my_off_t*) queue_remove(&queue, 0);
- /*
- Copy the next least incidence element. The queue implementation
- reserves root[0] for temporary purposes. root[1] is the top.
- */
- b= (my_off_t*) queue.root[1];
+ a= (my_off_t*) queue_remove_top(&queue);
+ /* Copy the next least incidence element. */
+ b= (my_off_t*) queue_top(&queue);
/* Create a new element in a local (automatic) buffer. */
new_huff_el= element_buffer + i;
/* The new element gets the sum of the two least incidence elements. */
@@ -1826,8 +1818,8 @@ static my_off_t calc_packed_length(HUFF_
queue. This successively replaces the references to counts by
references to HUFF_ELEMENTs.
*/
- queue.root[1]=(uchar*) new_huff_el;
- queue_replaced(&queue);
+ queue_top(&queue)= (uchar*) new_huff_el;
+ queue_replace_top(&queue);
}
DBUG_RETURN(bytes_packed+(bits_packed+7)/8);
}
=== modified file 'storage/myisam/sort.c'
--- a/storage/myisam/sort.c 2010-04-28 12:52:24 +0000
+++ b/storage/myisam/sort.c 2010-07-16 07:33:01 +0000
@@ -920,7 +920,7 @@ merge_buffers(MI_SORT_PARAM *info, uint
if (init_queue(&queue,(uint) (Tb-Fb)+1,offsetof(BUFFPEK,key),0,
(int (*)(void*, uchar *,uchar*)) info->key_cmp,
- (void*) info))
+ (void*) info, 0, 0))
DBUG_RETURN(1); /* purecov: inspected */
for (buffpek= Fb ; buffpek <= Tb ; buffpek++)
@@ -969,7 +969,7 @@ merge_buffers(MI_SORT_PARAM *info, uint
uchar *base= buffpek->base;
uint max_keys=buffpek->max_keys;
- VOID(queue_remove(&queue,0));
+ VOID(queue_remove_top(&queue));
/* Put room used by buffer to use in other buffer */
for (refpek= (BUFFPEK**) &queue_top(&queue);
@@ -994,7 +994,7 @@ merge_buffers(MI_SORT_PARAM *info, uint
}
else if (error == -1)
goto err; /* purecov: inspected */
- queue_replaced(&queue); /* Top element has been replaced */
+ queue_replace_top(&queue); /* Top element has been replaced */
}
}
buffpek=(BUFFPEK*) queue_top(&queue);
=== modified file 'storage/myisammrg/myrg_queue.c'
--- a/storage/myisammrg/myrg_queue.c 2007-05-10 09:59:39 +0000
+++ b/storage/myisammrg/myrg_queue.c 2010-07-16 07:33:01 +0000
@@ -52,7 +52,7 @@ int _myrg_init_queue(MYRG_INFO *info,int
if (init_queue(q,info->tables, 0,
(myisam_readnext_vec[search_flag] == SEARCH_SMALLER),
queue_key_cmp,
- info->open_tables->table->s->keyinfo[inx].seg))
+ info->open_tables->table->s->keyinfo[inx].seg, 0, 0))
error=my_errno;
}
else
@@ -60,7 +60,7 @@ int _myrg_init_queue(MYRG_INFO *info,int
if (reinit_queue(q,info->tables, 0,
(myisam_readnext_vec[search_flag] == SEARCH_SMALLER),
queue_key_cmp,
- info->open_tables->table->s->keyinfo[inx].seg))
+ info->open_tables->table->s->keyinfo[inx].seg, 0, 0))
error=my_errno;
}
}
=== modified file 'storage/myisammrg/myrg_rnext.c'
--- a/storage/myisammrg/myrg_rnext.c 2007-05-10 09:59:39 +0000
+++ b/storage/myisammrg/myrg_rnext.c 2010-07-16 07:33:01 +0000
@@ -32,7 +32,7 @@ int myrg_rnext(MYRG_INFO *info, uchar *b
{
if (err == HA_ERR_END_OF_FILE)
{
- queue_remove(&(info->by_key),0);
+ queue_remove_top(&(info->by_key));
if (!info->by_key.elements)
return HA_ERR_END_OF_FILE;
}
@@ -43,7 +43,7 @@ int myrg_rnext(MYRG_INFO *info, uchar *b
{
/* Found here, adding to queue */
queue_top(&(info->by_key))=(uchar *)(info->current_table);
- queue_replaced(&(info->by_key));
+ queue_replace_top(&(info->by_key));
}
/* now, mymerge's read_next is as simple as one queue_top */
=== modified file 'storage/myisammrg/myrg_rnext_same.c'
--- a/storage/myisammrg/myrg_rnext_same.c 2007-05-10 09:59:39 +0000
+++ b/storage/myisammrg/myrg_rnext_same.c 2010-07-16 07:33:01 +0000
@@ -29,7 +29,7 @@ int myrg_rnext_same(MYRG_INFO *info, uch
{
if (err == HA_ERR_END_OF_FILE)
{
- queue_remove(&(info->by_key),0);
+ queue_remove_top(&(info->by_key));
if (!info->by_key.elements)
return HA_ERR_END_OF_FILE;
}
@@ -40,7 +40,7 @@ int myrg_rnext_same(MYRG_INFO *info, uch
{
/* Found here, adding to queue */
queue_top(&(info->by_key))=(uchar *)(info->current_table);
- queue_replaced(&(info->by_key));
+ queue_replace_top(&(info->by_key));
}
/* now, mymerge's read_next is as simple as one queue_top */
=== modified file 'storage/myisammrg/myrg_rprev.c'
--- a/storage/myisammrg/myrg_rprev.c 2007-05-10 09:59:39 +0000
+++ b/storage/myisammrg/myrg_rprev.c 2010-07-16 07:33:01 +0000
@@ -32,7 +32,7 @@ int myrg_rprev(MYRG_INFO *info, uchar *b
{
if (err == HA_ERR_END_OF_FILE)
{
- queue_remove(&(info->by_key),0);
+ queue_remove_top(&(info->by_key));
if (!info->by_key.elements)
return HA_ERR_END_OF_FILE;
}
@@ -43,7 +43,7 @@ int myrg_rprev(MYRG_INFO *info, uchar *b
{
/* Found here, adding to queue */
queue_top(&(info->by_key))=(uchar *)(info->current_table);
- queue_replaced(&(info->by_key));
+ queue_replace_top(&(info->by_key));
}
/* now, mymerge's read_prev is as simple as one queue_top */
1
0
[Maria-developers] bzr commit into file:///home/tsk/mprog/src/5.3/ branch (timour:2805)
by timour@askmonty.org 16 Jul '10
by timour@askmonty.org 16 Jul '10
16 Jul '10
#At file:///home/tsk/mprog/src/5.3/ based on revid:psergey@askmonty.org-20100716090711-5ijpspzyvmoi5mix
2805 timour(a)askmonty.org 2010-07-16
Fixed a problem where the temp table of a materialized subquery
was not cleaned up between PS re-executions. The reason was two-fold:
- a merge with mysql-6.0 missed select_union::cleanup() that should
have cleaned up the temp table, and
- the subclass of select_union used by materialization didn't call
the base class cleanup() method.
modified:
mysql-test/r/subselect_mat.result
mysql-test/t/subselect_mat.test
sql/sql_class.cc
sql/sql_class.h
sql/sql_union.cc
=== modified file 'mysql-test/r/subselect_mat.result'
--- a/mysql-test/r/subselect_mat.result 2010-06-26 10:05:41 +0000
+++ b/mysql-test/r/subselect_mat.result 2010-07-16 11:02:15 +0000
@@ -1246,3 +1246,29 @@ i
4
set session optimizer_switch=@save_optimizer_switch;
drop table t1, t2, t3;
+create table t0 (a int);
+insert into t0 values (0),(1),(2);
+create table t1 (a int);
+insert into t1 values (0),(1),(2);
+explain select a, a in (select a from t1) from t0;
+id select_type table type possible_keys key key_len ref rows Extra
+1 PRIMARY t0 ALL NULL NULL NULL NULL 3
+2 SUBQUERY t1 ALL NULL NULL NULL NULL 3
+select a, a in (select a from t1) from t0;
+a a in (select a from t1)
+0 1
+1 1
+2 1
+prepare s from 'select a, a in (select a from t1) from t0';
+execute s;
+a a in (select a from t1)
+0 1
+1 1
+2 1
+update t1 set a=123;
+execute s;
+a a in (select a from t1)
+0 0
+1 0
+2 0
+drop table t0, t1;
=== modified file 'mysql-test/t/subselect_mat.test'
--- a/mysql-test/t/subselect_mat.test 2010-03-13 20:04:52 +0000
+++ b/mysql-test/t/subselect_mat.test 2010-07-16 11:02:15 +0000
@@ -905,3 +905,19 @@ select * from t1 where t1.i in (select t
set session optimizer_switch=@save_optimizer_switch;
drop table t1, t2, t3;
+#
+# Test that the contents of the temp table of a materialized subquery is
+# cleaned up between PS re-executions.
+#
+
+create table t0 (a int);
+insert into t0 values (0),(1),(2);
+create table t1 (a int);
+insert into t1 values (0),(1),(2);
+explain select a, a in (select a from t1) from t0;
+select a, a in (select a from t1) from t0;
+prepare s from 'select a, a in (select a from t1) from t0';
+execute s;
+update t1 set a=123;
+execute s;
+drop table t0, t1;
=== modified file 'sql/sql_class.cc'
--- a/sql/sql_class.cc 2010-07-10 10:37:30 +0000
+++ b/sql/sql_class.cc 2010-07-16 11:02:15 +0000
@@ -2994,14 +2994,28 @@ create_result_table(THD *thd_arg, List<I
if (!stat)
return TRUE;
- cleanup();
-
+ reset();
table->file->extra(HA_EXTRA_WRITE_CACHE);
table->file->extra(HA_EXTRA_IGNORE_DUP_KEY);
return FALSE;
}
+void select_materialize_with_stats::reset()
+{
+ memset(col_stat, 0, table->s->fields * sizeof(Column_statistics));
+ max_nulls_in_row= 0;
+ count_rows= 0;
+}
+
+
+void select_materialize_with_stats::cleanup()
+{
+ reset();
+ select_union::cleanup();
+}
+
+
/**
Override select_union::send_data to analyze each row for NULLs and to
update null_statistics before sending data to the client.
=== modified file 'sql/sql_class.h'
--- a/sql/sql_class.h 2010-07-10 10:37:30 +0000
+++ b/sql/sql_class.h 2010-07-16 11:02:15 +0000
@@ -2905,7 +2905,7 @@ public:
bool send_data(List<Item> &items);
bool send_eof();
bool flush();
-
+ void cleanup();
virtual bool create_result_table(THD *thd, List<Item> *column_types,
bool is_distinct, ulonglong options,
const char *alias, bool bit_fields_as_long);
@@ -2968,6 +2968,9 @@ protected:
*/
ha_rows count_rows;
+protected:
+ void reset();
+
public:
select_materialize_with_stats() {}
virtual bool create_result_table(THD *thd, List<Item> *column_types,
@@ -2975,12 +2978,7 @@ public:
const char *alias, bool bit_fields_as_long);
bool init_result_table(ulonglong select_options);
bool send_data(List<Item> &items);
- void cleanup()
- {
- memset(col_stat, 0, table->s->fields * sizeof(Column_statistics));
- max_nulls_in_row= 0;
- count_rows= 0;
- }
+ void cleanup();
ha_rows get_null_count_of_col(uint idx)
{
DBUG_ASSERT(idx < table->s->fields);
=== modified file 'sql/sql_union.cc'
--- a/sql/sql_union.cc 2010-07-10 10:37:30 +0000
+++ b/sql/sql_union.cc 2010-07-16 11:02:15 +0000
@@ -136,6 +136,22 @@ select_union::create_result_table(THD *t
}
+/**
+ Reset and empty the temporary table that stores the materialized query result.
+
+ @note The cleanup performed here is exactly the same as for the two temp
+ tables of JOIN - exec_tmp_table_[1 | 2].
+*/
+
+void select_union::cleanup()
+{
+ table->file->extra(HA_EXTRA_RESET_STATE);
+ table->file->ha_delete_all_rows();
+ free_io_cache(table);
+ filesort_free_buffers(table,0);
+}
+
+
/*
initialization procedures before fake_select_lex preparation()
1
0
[Maria-developers] bzr commit into file:///home/tsk/mprog/src/5.3-mwl89/ branch (timour:2801)
by timour@askmonty.org 16 Jul '10
by timour@askmonty.org 16 Jul '10
16 Jul '10
#At file:///home/tsk/mprog/src/5.3-mwl89/ based on revid:sanja@askmonty.org-20100710103730-ayy6a61pdibspf4o
2801 timour(a)askmonty.org 2010-07-16
MWL#89: Cost-based choice between Materialization and IN->EXISTS transformation
1. Changed the lazy optimization for subqueries that can be
materialized into bottom-up optimization during the optimization of
the main query.
The main change is implemented by the method
Item_in_subselect::setup_engine.
All other changes were required to correct problems resulting from
changing the order of optimization. Most of these problems followed
the same pattern - there are some shared structures between a
subquery and its parent query. Depending on which one is optimized
first (parent or child query), these shared strucutres may get
different values, thus resulting in an inconsistent query plan.
2. Changed the code-generation for subquery materialization to be
performed in runtime memory for each (re)execution, instead of in
statement memory (once per prepared statement).
- Item_in_subselect::setup_engine() no longer creates materialization
related objects in statement memory.
- Merged subselect_hash_sj_engine::init_permanent and
subselect_hash_sj_engine::init_runtime into
subselect_hash_sj_engine::init, which is called for each
(re)execution.
- Fixed deletion of the temp table accordingly.
@ mysql-test/r/subselect_mat.result
Adjusted changed EXPLAIN because of earlier optimization of subqueries.
modified:
mysql-test/r/subselect_mat.result
sql/item_subselect.cc
sql/item_subselect.h
sql/sql_class.cc
sql/sql_class.h
sql/sql_select.cc
=== modified file 'mysql-test/r/subselect_mat.result'
--- a/mysql-test/r/subselect_mat.result 2010-06-26 10:05:41 +0000
+++ b/mysql-test/r/subselect_mat.result 2010-07-16 10:52:02 +0000
@@ -1139,7 +1139,7 @@ insert into t1 values (5);
explain select min(a1) from t1 where 7 in (select b1 from t2 group by b1);
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY NULL NULL NULL NULL NULL NULL NULL Select tables optimized away
-2 SUBQUERY t2 system NULL NULL NULL NULL 0 const row not found
+2 SUBQUERY NULL NULL NULL NULL NULL NULL NULL no matching row in const table
select min(a1) from t1 where 7 in (select b1 from t2 group by b1);
min(a1)
set @@optimizer_switch='default,materialization=off';
@@ -1153,7 +1153,7 @@ set @@optimizer_switch='default,semijoin
explain select min(a1) from t1 where 7 in (select b1 from t2);
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY NULL NULL NULL NULL NULL NULL NULL Select tables optimized away
-2 SUBQUERY t2 system NULL NULL NULL NULL 0 const row not found
+2 SUBQUERY NULL NULL NULL NULL NULL NULL NULL no matching row in const table
select min(a1) from t1 where 7 in (select b1 from t2);
min(a1)
set @@optimizer_switch='default,materialization=off';
=== modified file 'sql/item_subselect.cc'
--- a/sql/item_subselect.cc 2010-07-10 10:37:30 +0000
+++ b/sql/item_subselect.cc 2010-07-16 10:52:02 +0000
@@ -166,6 +166,7 @@ void Item_in_subselect::cleanup()
Item_subselect::~Item_subselect()
{
delete engine;
+ engine= NULL;
}
Item_subselect::trans_res
@@ -2220,73 +2221,73 @@ void Item_in_subselect::update_used_tabl
bool Item_in_subselect::setup_engine()
{
- subselect_hash_sj_engine *new_engine= NULL;
- bool res= FALSE;
+ subselect_hash_sj_engine *mat_engine= NULL;
+ subselect_single_select_engine *select_engine;
DBUG_ENTER("Item_in_subselect::setup_engine");
- if (engine->engine_type() == subselect_engine::SINGLE_SELECT_ENGINE)
- {
- /* Create/initialize objects in permanent memory. */
- subselect_single_select_engine *old_engine;
- Query_arena *arena= thd->stmt_arena, backup;
- old_engine= (subselect_single_select_engine*) engine;
+ SELECT_LEX *save_select= thd->lex->current_select;
+ thd->lex->current_select= get_select_lex();
+ int res= thd->lex->current_select->join->optimize();
+ thd->lex->current_select= save_select;
+ if (res)
+ DBUG_RETURN(TRUE);
- if (arena->is_conventional())
- arena= 0;
- else
- thd->set_n_backup_active_arena(arena, &backup);
+ /*
+ The select_engine (that executes transformed IN=>EXISTS subselects) is
+ pre-created at parse time, and is stored in statment memory (preserved
+ across PS executions).
+ */
+ DBUG_ASSERT(engine->engine_type() == subselect_engine::SINGLE_SELECT_ENGINE);
+ select_engine= (subselect_single_select_engine*) engine;
- if (!(new_engine= new subselect_hash_sj_engine(thd, this,
- old_engine)) ||
- new_engine->init_permanent(unit->get_unit_column_types()))
- {
- Item_subselect::trans_res trans_res;
- /*
- If for some reason we cannot use materialization for this IN predicate,
- delete all materialization-related objects, and apply the IN=>EXISTS
- transformation.
- */
- delete new_engine;
- new_engine= NULL;
- exec_method= NOT_TRANSFORMED;
- if (left_expr->cols() == 1)
- trans_res= single_value_in_to_exists_transformer(old_engine->join,
- &eq_creator);
- else
- trans_res= row_value_in_to_exists_transformer(old_engine->join);
- res= (trans_res != Item_subselect::RES_OK);
- }
- if (new_engine)
- engine= new_engine;
+ /* Create/initialize execution objects. */
+ if (!(mat_engine= new subselect_hash_sj_engine(thd, this, select_engine)))
+ DBUG_RETURN(TRUE);
- if (arena)
- thd->restore_active_arena(arena, &backup);
- }
- else
+ if (mat_engine->init(&select_engine->join->fields_list))
{
- DBUG_ASSERT(engine->engine_type() == subselect_engine::HASH_SJ_ENGINE);
- new_engine= (subselect_hash_sj_engine*) engine;
- }
+ Item_subselect::trans_res trans_res;
+ /*
+ If for some reason we cannot use materialization for this IN predicate,
+ delete all materialization-related objects, and apply the IN=>EXISTS
+ transformation.
+ */
+ delete mat_engine;
+ mat_engine= NULL;
+ exec_method= NOT_TRANSFORMED;
+
+ if (left_expr->cols() == 1)
+ trans_res= single_value_in_to_exists_transformer(select_engine->join,
+ &eq_creator);
+ else
+ trans_res= row_value_in_to_exists_transformer(select_engine->join);
- /* Initilizations done in runtime memory, repeated for each execution. */
- if (new_engine)
- {
/*
- Reset the LIMIT 1 set in Item_exists_subselect::fix_length_and_dec.
- TODO:
- Currently we set the subquery LIMIT to infinity, and this is correct
- because we forbid at parse time LIMIT inside IN subqueries (see
- Item_in_subselect::test_limit). However, once we allow this, here
- we should set the correct limit if given in the query.
+ The IN=>EXISTS transformation above injects new predicates into the
+ WHERE and HAVING clauses. Since the subquery was already optimized,
+ below we force its reoptimization with the new injected conditions
+ by the first call to subselect_single_select_engine::exec().
+ This is the only case of lazy subquery optimization in the server.
*/
- unit->global_parameters->select_limit= NULL;
- if ((res= new_engine->init_runtime()))
- DBUG_RETURN(res);
+ DBUG_ASSERT(select_engine->join->optimized);
+ select_engine->join->optimized= false;
+ DBUG_RETURN(trans_res != Item_subselect::RES_OK);
}
- DBUG_RETURN(res);
+ /*
+ Reset the "LIMIT 1" set in Item_exists_subselect::fix_length_and_dec.
+ TODO:
+ Currently we set the subquery LIMIT to infinity, and this is correct
+ because we forbid at parse time LIMIT inside IN subqueries (see
+ Item_in_subselect::test_limit). However, once we allow this, here
+ we should set the correct limit if given in the query.
+ */
+ unit->global_parameters->select_limit= NULL;
+
+ engine= mat_engine;
+ DBUG_RETURN(FALSE);
}
@@ -3787,13 +3788,14 @@ bitmap_init_memroot(MY_BITMAP *map, uint
@retval FALSE otherwise
*/
-bool subselect_hash_sj_engine::init_permanent(List<Item> *tmp_columns)
+bool subselect_hash_sj_engine::init(List<Item> *tmp_columns)
{
+ select_union *result_sink;
/* Options to create_tmp_table. */
ulonglong tmp_create_options= thd->options | TMP_TABLE_ALL_COLUMNS;
/* | TMP_TABLE_FORCE_MYISAM; TIMOUR: force MYISAM */
- DBUG_ENTER("subselect_hash_sj_engine::init_permanent");
+ DBUG_ENTER("subselect_hash_sj_engine::init");
if (bitmap_init_memroot(&non_null_key_parts, tmp_columns->elements,
thd->mem_root) ||
@@ -3822,15 +3824,16 @@ bool subselect_hash_sj_engine::init_perm
DBUG_RETURN(TRUE);
}
*/
- if (!(result= new select_materialize_with_stats))
+ if (!(result_sink= new select_materialize_with_stats))
DBUG_RETURN(TRUE);
-
- if (((select_union*) result)->create_result_table(
- thd, tmp_columns, TRUE, tmp_create_options,
- "materialized subselect", TRUE))
+ result_sink->get_tmp_table_param()->materialized_subquery= true;
+ if (result_sink->create_result_table(thd, tmp_columns, TRUE,
+ tmp_create_options,
+ "materialized subselect", TRUE))
DBUG_RETURN(TRUE);
- tmp_table= ((select_union*) result)->table;
+ tmp_table= result_sink->table;
+ result= result_sink;
/*
If the subquery has blobs, or the total key lenght is bigger than
@@ -3867,6 +3870,17 @@ bool subselect_hash_sj_engine::init_perm
!(lookup_engine= make_unique_engine()))
DBUG_RETURN(TRUE);
+ /*
+ Repeat name resolution for 'cond' since cond is not part of any
+ clause of the query, and it is not 'fixed' during JOIN::prepare.
+ */
+ if (semi_join_conds && !semi_join_conds->fixed &&
+ semi_join_conds->fix_fields(thd, (Item**)&semi_join_conds))
+ DBUG_RETURN(TRUE);
+ /* Let our engine reuse this query plan for materialization. */
+ materialize_join= materialize_engine->join;
+ materialize_join->change_result(result);
+
DBUG_RETURN(FALSE);
}
@@ -3957,8 +3971,6 @@ subselect_hash_sj_engine::make_unique_en
Item_iterator_row it(item_in->left_expr);
/* The only index on the temporary table. */
KEY *tmp_key= tmp_table->key_info;
- /* Number of keyparts in tmp_key. */
- uint tmp_key_parts= tmp_key->key_parts;
JOIN_TAB *tab;
DBUG_ENTER("subselect_hash_sj_engine::make_unique_engine");
@@ -3981,41 +3993,22 @@ subselect_hash_sj_engine::make_unique_en
}
-/**
- Initialize members of the engine that need to be re-initilized at each
- execution.
+subselect_hash_sj_engine::~subselect_hash_sj_engine()
+{
+ delete lookup_engine;
+ delete result;
+ if (tmp_table)
+ free_tmp_table(thd, tmp_table);
+}
- @retval TRUE if a memory allocation error occurred
- @retval FALSE if success
-*/
-bool subselect_hash_sj_engine::init_runtime()
+int subselect_hash_sj_engine::prepare()
{
/*
Create and optimize the JOIN that will be used to materialize
the subquery if not yet created.
*/
- materialize_engine->prepare();
- /*
- Repeat name resolution for 'cond' since cond is not part of any
- clause of the query, and it is not 'fixed' during JOIN::prepare.
- */
- if (semi_join_conds && !semi_join_conds->fixed &&
- semi_join_conds->fix_fields(thd, (Item**)&semi_join_conds))
- return TRUE;
- /* Let our engine reuse this query plan for materialization. */
- materialize_join= materialize_engine->join;
- materialize_join->change_result(result);
- return FALSE;
-}
-
-
-subselect_hash_sj_engine::~subselect_hash_sj_engine()
-{
- delete lookup_engine;
- delete result;
- if (tmp_table)
- free_tmp_table(thd, tmp_table);
+ return materialize_engine->prepare();
}
@@ -4036,6 +4029,12 @@ void subselect_hash_sj_engine::cleanup()
count_null_only_columns= 0;
strategy= UNDEFINED;
materialize_engine->cleanup();
+ /*
+ Restore the original Item_in_subselect engine. This engine is created once
+ at parse time and stored across executions, while all other materialization
+ related engines are created and chosen for each execution.
+ */
+ ((Item_in_subselect *) item)->engine= materialize_engine;
if (lookup_engine_type == TABLE_SCAN_ENGINE ||
lookup_engine_type == ROWID_MERGE_ENGINE)
{
@@ -4052,6 +4051,9 @@ void subselect_hash_sj_engine::cleanup()
DBUG_ASSERT(lookup_engine->engine_type() == UNIQUESUBQUERY_ENGINE);
lookup_engine->cleanup();
result->cleanup(); /* Resets the temp table as well. */
+ DBUG_ASSERT(tmp_table);
+ free_tmp_table(thd, tmp_table);
+ tmp_table= NULL;
}
@@ -4080,9 +4082,8 @@ int subselect_hash_sj_engine::exec()
the subquery predicate.
*/
thd->lex->current_select= materialize_engine->select_lex;
- if ((res= materialize_join->optimize()))
- goto err; /* purecov: inspected */
- DBUG_ASSERT(!is_materialized); /* We should materialize only once. */
+ /* The subquery should be optimized, and materialized only once. */
+ DBUG_ASSERT(materialize_join->optimized && !is_materialized);
materialize_join->exec();
if ((res= test(materialize_join->error || thd->is_fatal_error)))
goto err;
=== modified file 'sql/item_subselect.h'
--- a/sql/item_subselect.h 2010-07-10 10:37:30 +0000
+++ b/sql/item_subselect.h 2010-07-16 10:52:02 +0000
@@ -817,10 +817,9 @@ public:
}
~subselect_hash_sj_engine();
- bool init_permanent(List<Item> *tmp_columns);
- bool init_runtime();
+ bool init(List<Item> *tmp_columns);
void cleanup();
- int prepare() { return 0; } /* Override virtual function in base class. */
+ int prepare();
int exec();
virtual void print(String *str, enum_query_type query_type);
uint cols()
=== modified file 'sql/sql_class.cc'
--- a/sql/sql_class.cc 2010-07-10 10:37:30 +0000
+++ b/sql/sql_class.cc 2010-07-16 10:52:02 +0000
@@ -3052,6 +3052,7 @@ void TMP_TABLE_PARAM::init()
table_charset= 0;
precomputed_group_by= 0;
bit_fields_as_long= 0;
+ materialized_subquery= 0;
skip_create_table= 0;
DBUG_VOID_RETURN;
}
=== modified file 'sql/sql_class.h'
--- a/sql/sql_class.h 2010-07-10 10:37:30 +0000
+++ b/sql/sql_class.h 2010-07-16 10:52:02 +0000
@@ -2852,6 +2852,8 @@ public:
uint convert_blob_length;
CHARSET_INFO *table_charset;
bool schema_table;
+ /* TRUE if the temp table is created for subquery materialization. */
+ bool materialized_subquery;
/*
True if GROUP BY and its aggregate functions are already computed
by a table access method (e.g. by loose index scan). In this case
@@ -2875,8 +2877,8 @@ public:
TMP_TABLE_PARAM()
:copy_field(0), group_parts(0),
group_length(0), group_null_parts(0), convert_blob_length(0),
- schema_table(0), precomputed_group_by(0), force_copy_fields(0),
- bit_fields_as_long(0), skip_create_table(0)
+ schema_table(0), materialized_subquery(0), precomputed_group_by(0),
+ force_copy_fields(0), bit_fields_as_long(0), skip_create_table(0)
{}
~TMP_TABLE_PARAM()
{
@@ -2905,6 +2907,7 @@ public:
bool send_data(List<Item> &items);
bool send_eof();
bool flush();
+ TMP_TABLE_PARAM *get_tmp_table_param() { return &tmp_table_param; }
virtual bool create_result_table(THD *thd, List<Item> *column_types,
bool is_distinct, ulonglong options,
@@ -2969,7 +2972,7 @@ protected:
ha_rows count_rows;
public:
- select_materialize_with_stats() {}
+ select_materialize_with_stats() { tmp_table_param.init(); }
virtual bool create_result_table(THD *thd, List<Item> *column_types,
bool is_distinct, ulonglong options,
const char *alias, bool bit_fields_as_long);
=== modified file 'sql/sql_select.cc'
--- a/sql/sql_select.cc 2010-07-10 10:37:30 +0000
+++ b/sql/sql_select.cc 2010-07-16 10:52:02 +0000
@@ -2586,14 +2586,13 @@ err:
Setup for execution all subqueries of a query, for which the optimizer
chose hash semi-join.
- @details Iterate over all subqueries of the query, and if they are under an
- IN predicate, and the optimizer chose to compute it via hash semi-join:
- - try to initialize all data structures needed for the materialized execution
- of the IN predicate,
- - if this fails, then perform the IN=>EXISTS transformation which was
- previously blocked during JOIN::prepare.
-
- This method is part of the "code generation" query processing phase.
+ @details Iterate over all immediate child subqueries of the query, and if
+ they are under an IN predicate, and the optimizer chose to compute it via
+ materialization:
+ - optimize each subquery,
+ - choose an optimial execution strategy for the IN predicate - either
+ materialization, or an IN=>EXISTS transformation with an approriate
+ engine.
This phase must be called after substitute_for_best_equal_field() because
that function may replace items with other items from a multiple equality,
@@ -7925,7 +7924,7 @@ bool TABLE_REF::tmp_table_index_lookup_i
use that information instead.
*/
cur_ref_buff + null_count,
- null_count ? key_buff : 0,
+ null_count ? cur_ref_buff : 0,
cur_key_part->length, items[i], value);
cur_ref_buff+= cur_key_part->store_length;
}
@@ -11408,10 +11407,30 @@ create_tmp_table(THD *thd,TMP_TABLE_PARA
{
if (thd->is_fatal_error)
goto err; // Got OOM
- continue; // Some kindf of const item
+ continue; // Some kind of const item
}
if (type == Item::SUM_FUNC_ITEM)
- ((Item_sum *) item)->result_field= new_field;
+ {
+ Item_sum *agg_item= (Item_sum *) item;
+ /*
+ Update the result field only if it has never been set, or if the
+ created temporary table is not to be used for subquery
+ materialization.
+
+ The reason is that for subqueries that require materialization as part
+ of their plan, we create the 'external' temporary table needed for IN
+ execution, after the 'internal' temporary table needed for grouping.
+ Since both the external and the internal temporary tables are created
+ for the same list of SELECT fields of the subquery, setting
+ 'result_field' for each invocation of create_tmp_table overrides the
+ previous value of 'result_field'.
+
+ The condition below prevents the creation of the external temp table
+ to override the 'result_field' that was set for the internal temp table.
+ */
+ if (!agg_item->result_field || !param->materialized_subquery)
+ agg_item->result_field= new_field;
+ }
tmp_from_field++;
reclength+=new_field->pack_length();
if (!(new_field->flags & NOT_NULL_FLAG))
@@ -19240,6 +19259,8 @@ bool JOIN::change_result(select_result *
{
DBUG_ENTER("JOIN::change_result");
result= res;
+ if (tmp_join)
+ tmp_join->result= res;
if (!procedure && (result->prepare(fields_list, select_lex->master_unit()) ||
result->prepare2()))
{
1
0
[Maria-developers] WL#126 New (by Monty): Sync also maria_control_file when doing a flush tables
by worklog-noreply@askmonty.org 16 Jul '10
by worklog-noreply@askmonty.org 16 Jul '10
16 Jul '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Sync also maria_control_file when doing a flush tables
CREATION DATE..: Fri, 16 Jul 2010, 09:45
SUPERVISOR.....:
IMPLEMENTOR....:
COPIES TO......:
CATEGORY.......: Maria-BackLog
TASK ID........: 126 (http://askmonty.org/worklog/?tid=126)
VERSION........: WorkLog-4.0
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 8 (hours remain)
ORIG. ESTIMATE.: 8
PROGRESS NOTES:
DESCRIPTION:
Sync also maria_control_file when doing a flush tables
This is to solve the following problem:
- Fill data in a maria table
- Flush tables
- kill mysqld (without shutdown)
Now when one does a maria_check table, one can get a warning like:
maria_chk: error: Found row with transaction id 206985052 when max transaction
id according to maria_control_file is 206984765
which is a bit confusing.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
16 Jul '10
Hi All,
Yes, PBXT uses a lot of file handles: generally 3 per table. It opens
all tables before recovery. Unfortunately, I have not worked on
optimizing this behavior.
Although the only reason for this is to resolve possible FK/PK
relationships. So I think this should be considered a PBXT bug, and
the problem should be fixed.
Anyway this will certainly be the problem in this case. As Kristian
says, select() is severely limited in that it only works with the low
numbered file handles.
To get the server up and running again you can just delete the tables
manually (the table files: .frm, .xtd, .xtr, .xti), when MySQL is not
running. PBXT will complain on startup, but should recover anyway.
Best regards,
Paul
On Jul 15, 2010, at 9:48 AM, Michael Widenius wrote:
>
> Hi!
>
>>>>>> "Time" == Time Less <timelessness(a)gmail.com> writes:
>
>>> I think it is very likely that you are hitting this bug:
>>>
>>> http://bugs.mysql.com/bug.php?id=48929
>>>
>
> Time> Ah, yes, could be. Though my strace was different than this
> bug shows.
>
>
>>> The problem is that MySQL/MariaDB is using select() to accept new
>>> connections. But select() has a hard-coded limit of 1024 on the
>>> max number
>>> of
>>> open files it can support. It seems PBXT uses an open file
>>> descriptor per
>>> table,
>
>
> Time> Not two per table? It has it looks like many log files, then
> also a data and
> Time> index file per table. On my system where I'm trying to use
> 1,000 tables, I
> Time> expect about 2,000+[mumble] file handles.
>
>>> Maybe MariaDB should backport the fix, it is actually a buffer
>>> overflow
>>> (though it is hard to see how it could be exploitable), but
>>> perhaps more
>>> relevant it is a rather nasty state for the server to get into,
>>> and not
>>> really
>>> clear how to get it out of it again :-(.
>>>
>
> Time> You can't get out of the state again. The server won't accept
> connections,
> Time> so you can't drop any tables. You just have to wipe the DB and
> start over.
> Time> If PBXT is resilient to its tables disappearing between server
> restarts, you
> Time> could rm files from the data directory.
>
> I didn't know that PBXT would open up all files at startup. If that's
> the case, we should switch to use poll in MariaDB 5.3 ASAP (and
> provide a patch for those that wants it for MariaDB 5.1).
>
> Paul, can you verify the above is the case for PBXT (ie, that PBXT
> uses one file descriptor per table and don't free these at all?)
>
> Regards,
> Monty
>
> _______________________________________________
> Mailing list: https://launchpad.net/~maria-developers
> Post to : maria-developers(a)lists.launchpad.net
> Unsubscribe : https://launchpad.net/~maria-developers
> More help : https://help.launchpad.net/ListHelp
1
0
[Maria-developers] Rev 2801: Fixed an error in the creation of REF access method for materialized in file:///home/tsk/mprog/src/5.3/
by timour@askmonty.org 15 Jul '10
by timour@askmonty.org 15 Jul '10
15 Jul '10
At file:///home/tsk/mprog/src/5.3/
------------------------------------------------------------
revno: 2801
revision-id: timour(a)askmonty.org-20100715135910-y1gvcc3d63sod6xt
parent: sanja(a)askmonty.org-20100710103730-ayy6a61pdibspf4o
committer: timour(a)askmonty.org
branch nick: 5.3
timestamp: Thu 2010-07-15 16:59:10 +0300
message:
Fixed an error in the creation of REF access method for materialized
subquery execution, where the the REF buffer format was mistaken to
be in record format instead of key format. The error was that the null
byte for all fields of the record was in the front of the buffer,
and not before each field data.
=== modified file 'sql/item_subselect.cc'
--- a/sql/item_subselect.cc 2010-07-10 10:37:30 +0000
+++ b/sql/item_subselect.cc 2010-07-15 13:59:10 +0000
@@ -3957,8 +3957,6 @@
Item_iterator_row it(item_in->left_expr);
/* The only index on the temporary table. */
KEY *tmp_key= tmp_table->key_info;
- /* Number of keyparts in tmp_key. */
- uint tmp_key_parts= tmp_key->key_parts;
JOIN_TAB *tab;
DBUG_ENTER("subselect_hash_sj_engine::make_unique_engine");
=== modified file 'sql/sql_select.cc'
--- a/sql/sql_select.cc 2010-07-10 10:37:30 +0000
+++ b/sql/sql_select.cc 2010-07-15 13:59:10 +0000
@@ -7925,7 +7925,7 @@
use that information instead.
*/
cur_ref_buff + null_count,
- null_count ? key_buff : 0,
+ null_count ? cur_ref_buff : 0,
cur_key_part->length, items[i], value);
cur_ref_buff+= cur_key_part->store_length;
}
1
0
15 Jul '10
Hi!
>>>>> "Time" == Time Less <timelessness(a)gmail.com> writes:
>> I think it is very likely that you are hitting this bug:
>>
>> http://bugs.mysql.com/bug.php?id=48929
>>
Time> Ah, yes, could be. Though my strace was different than this bug shows.
>> The problem is that MySQL/MariaDB is using select() to accept new
>> connections. But select() has a hard-coded limit of 1024 on the max number
>> of
>> open files it can support. It seems PBXT uses an open file descriptor per
>> table,
Time> Not two per table? It has it looks like many log files, then also a data and
Time> index file per table. On my system where I'm trying to use 1,000 tables, I
Time> expect about 2,000+[mumble] file handles.
>> Maybe MariaDB should backport the fix, it is actually a buffer overflow
>> (though it is hard to see how it could be exploitable), but perhaps more
>> relevant it is a rather nasty state for the server to get into, and not
>> really
>> clear how to get it out of it again :-(.
>>
Time> You can't get out of the state again. The server won't accept connections,
Time> so you can't drop any tables. You just have to wipe the DB and start over.
Time> If PBXT is resilient to its tables disappearing between server restarts, you
Time> could rm files from the data directory.
I didn't know that PBXT would open up all files at startup. If that's
the case, we should switch to use poll in MariaDB 5.3 ASAP (and
provide a patch for those that wants it for MariaDB 5.1).
Paul, can you verify the above is the case for PBXT (ie, that PBXT
uses one file descriptor per table and don't free these at all?)
Regards,
Monty
2
1
14 Jul '10
Time Less <timelessness(a)gmail.com> writes:
>> The problem is that MySQL/MariaDB is using select() to accept new
>> connections. But select() has a hard-coded limit of 1024 on the max number
>> of
>> open files it can support. It seems PBXT uses an open file descriptor per
>> table,
>
>
> Not two per table? It has it looks like many log files, then also a data and
> index file per table. On my system where I'm trying to use 1,000 tables, I
> expect about 2,000+[mumble] file handles.
Sure could be, I don't know the details, just that it seems to open >1000
files simultaneously.
> You can't get out of the state again. The server won't accept connections,
> so you can't drop any tables. You just have to wipe the DB and start over.
> If PBXT is resilient to its tables disappearing between server restarts, you
> could rm files from the data directory.
One way is to use the --bootstrap parameter to mysqld (after stopping the
server).
This allows to start the server, run a set of commands, and shut down, without
needing to connect. Something like
(echo "DROP TABLE t1;" ; echo "DROP TABLE t2;"; ...) | mysqld --defaults-file=/etc/my.cnf --bootstrap
(I actually tried this).
One way or the other, it's a nasty bug.
- Kristian.
1
0
14 Jul '10
Hi!
I am resending this to maria-developers, as this is a more appropriate
address for PBXT issues.
>>>>> "Time" == Time Less <timelessness(a)gmail.com> writes:
Time> Just thought I'd try out PBXT. I created a database instance with 10
Time> databases, inside each 100 tables. They were MyISAM tables. Then I did an
Time> "alter table <name> engine=pbxt" across all of them. That worked fine. Then
Time> I restarted MySQL, and now it won't come up. Error logs are filled with
Time> thousands of:
Time> # tail /var/log/mysql.err
Time> 100713 12:33:33 [ERROR] Error in accept: Bad file descriptor
Time> 100713 12:33:33 [ERROR] Error in accept: Bad file descriptor
Time> 100713 12:33:33 [ERROR] Error in accept: Bad file descriptor
Time> To the tune of tens (or hundreds?) per second. I've tried restarting mysqld
Time> with *ulimit -n 32768* and with my.cnf setting *open_files_limit = 32768* to
Time> no effect. Does anyone have experience with this? Extensive Google searches
Time> turn up nothing except esoteric OpenBSD problems or such.
Time> MariaDB is MariaDB-server-5.1.44-75.el5.
Sorry, but this is an issue that I haven't heard about before.
However, I am sure that the PBXT team will have some suggestions for
you how to solve this.
Regards,
Monty
2
1
Re: [Maria-developers] [Fwd: [Commits] bzr commit into Mariadb 5.2, with Maria 2.0:maria/5.2 branch (igor:2822) Bug#603654]
by Sergey Petrunya 13 Jul '10
by Sergey Petrunya 13 Jul '10
13 Jul '10
Hello Igor,
Ok to push.
On Mon, Jul 12, 2010 at 07:08:58PM -0700, Igor Babaev wrote:
> Please review this patch for the 5.2 tree.
>
> Regards,
> Igor.
>
>
> -------- Original Message --------
> Subject: [Commits] bzr commit into Mariadb 5.2, with Maria 2.0:maria/5.2
> branch (igor:2822) Bug#603654
> Date: Mon, 12 Jul 2010 19:05:37 -0700 (PDT)
> From: Igor Babaev <igor(a)askmonty.org>
> Reply-To: maria-developers(a)lists.launchpad.net
> To: commits(a)mariadb.org
>
> #At lp:maria/5.2 based on
> revid:igor@askmonty.org-20100713012307-rnom77fx57ef900o
>
> 2822 Igor Babaev 2010-07-12
> Fixed bug #603654.
> If a virtual column was used in the ORDER BY clause of a query
> and some of the columns this virtual column was based upon were
> not referred anywhere in the query then the execution of the
> query could cause an assertion failure.
> It happened because in this case the bitmap of the columns used
> for ordering keys was not formed correctly.
> modified:
> mysql-test/suite/vcol/r/vcol_misc.result
> mysql-test/suite/vcol/t/vcol_misc.test
> sql/filesort.cc
>
> === modified file 'mysql-test/suite/vcol/r/vcol_misc.result'
> --- a/mysql-test/suite/vcol/r/vcol_misc.result 2010-07-13 01:23:07 +0000
> +++ b/mysql-test/suite/vcol/r/vcol_misc.result 2010-07-13 02:05:28 +0000
> @@ -35,3 +35,13 @@ a int NOT NULL DEFAULT '0',
> v double AS ((1, a)) VIRTUAL
> );
> ERROR HY000: Expression for computed column cannot return a row
> +CREATE TABLE t1 (
> +a CHAR(255) BINARY NOT NULL DEFAULT 0,
> +b CHAR(255) BINARY NOT NULL DEFAULT 0,
> +v CHAR(255) BINARY AS (CONCAT(a,b)) VIRTUAL );
> +INSERT INTO t1(a,b) VALUES ('4','7'), ('4','6');
> +SELECT 1 AS C FROM t1 ORDER BY v;
> +C
> +1
> +1
> +DROP TABLE t1;
>
> === modified file 'mysql-test/suite/vcol/t/vcol_misc.test'
> --- a/mysql-test/suite/vcol/t/vcol_misc.test 2010-07-13 01:23:07 +0000
> +++ b/mysql-test/suite/vcol/t/vcol_misc.test 2010-07-13 02:05:28 +0000
> @@ -30,6 +30,19 @@ CREATE TABLE t1 (
> v double AS ((1, a)) VIRTUAL
> );
>
> +#
> +# Bug#603654: Virtual column in ORDER BY, no other references of table
> columns
> +#
> +
> +CREATE TABLE t1 (
> + a CHAR(255) BINARY NOT NULL DEFAULT 0,
> + b CHAR(255) BINARY NOT NULL DEFAULT 0,
> + v CHAR(255) BINARY AS (CONCAT(a,b)) VIRTUAL );
> +INSERT INTO t1(a,b) VALUES ('4','7'), ('4','6');
> +SELECT 1 AS C FROM t1 ORDER BY v;
> +
> +DROP TABLE t1;
> +
>
>
>
>
> === modified file 'sql/filesort.cc'
> --- a/sql/filesort.cc 2010-06-01 19:52:20 +0000
> +++ b/sql/filesort.cc 2010-07-13 02:05:28 +0000
> @@ -1009,7 +1009,14 @@ static void register_used_fields(SORTPAR
> if ((field= sort_field->field))
> {
> if (field->table == table)
> - bitmap_set_bit(bitmap, field->field_index);
> + {
> + if (field->vcol_info)
> + {
> + Item *vcol_item= field->vcol_info->expr_item;
> + vcol_item->walk(&Item::register_field_in_read_map, 1, (uchar
> *) 0);
> + }
> + bitmap_set_bit(bitmap, field->field_index);
> + }
> }
> else
> { // Item
>
> _______________________________________________
> commits mailing list
> commits(a)mariadb.org
> https://lists.askmonty.org/cgi-bin/mailman/listinfo/commits
--
BR
Sergey
--
Sergey Petrunia, Software Developer
Monty Program AB, http://askmonty.org
Blog: http://s.petrunia.net/blog
1
0
Hi people,
Our cmake scripts weren't able to build outside the source dir. This is
bad for 3 reasons:
- it's supposed to work
- cmake recommends building outside the src dir
- on Windows, cmake can't build 32 and 64 bit in the same solution, so
patches has to be manually copied to test compilation on the other
The attached patch fixes all the issues. I ran a find on the source dir
before and after the the build, and they were identical. I didn't
actually check that no existing files we're modified, but bzr di claim
this isn't the case.
I didn't port the zip creation script, but that shouldn't be a problem.
src != blddir is more a development thing, and the buildbot slave
producing the zip file will just continue to build in the same dir. The
cpack installer builds just fine.
OK to push to 5.1?
Should I send this patch to the Oracle as well to minimize the drift
between the cmake files?
Bo Thorsen.
Monty Program AB.
--
MariaDB: MySQL replacement
Community developed. Feature enhanced. Backward compatible.
1
0
Hi,
I have pushed SphinxSE into MariaDB 5.2. It should be available in the
upcoming MariaDB 5.2.2, which is currently planned within a few weeks.
For now, only the code for the storage engine is pushed, not for the
mysql-test-run.pl stuff. Sergei Golubchik wanted to modify mysql-test-run.pl
first so that engine-specific code can be added in a more modular fashion (he
will do it when he returns from vacation).
For now, the engine is by default built as a loadable .so plugin. This is the
general policy for MariaDB for new plugins. I think the idea is that engines
that prove to be stable and well supported by their maintainer can get
promoted to build statically by default (we are still flexing out the new
policy). Note that the engine will still be available by default, the user
will just need to do a one-time INSTALL PLUGIN command. Note also that
SphinxSE can still be built statically with ./configure --with-plugin-sphinx,
only the default is dynamic.
Daniel, you should probably look into documenting the SphinxSE for MariaDB
5.2.2. Andrew kindly gave us permission to use what we need from the Sphinx
documentation:
http://sphinxsearch.com/docs/current.html#sphinxse
Andrew suggested including also a link to the Sphinx webpage in the MariaDB
documentation for SphinxSE, which is of course a very good idea.
Also you should document how to do the INSTALL PLUGIN command (it's similar to
for OQGraph). Note that the sphinx plugin will be included in eg. deb packages
and so on and installed by default, just the INSTALL PLUGIN command is needed
to enable it.
Andrew, thanks for all your work with this! It is very good to have SphinxSE
included, there have been several requests for this.
We (mainly Sergei) made some changes to the code during review. I attach a
patch for all the changes since your original import, in case you are
interested. Note that I re-added the code when I merged to 5.2 to get a clean
history, so future work on SphinxSE in MariaDB should be based on
lp:maria/5.2, not our old sphinx bzr trees.
- Kristian.
1
0
08 Jul '10
The attached patch reduces the warnings in a 32 bit build with Visual
Studio 2008 to a set in the flex/bison generated code. I'll handle those
later.
I'll add comments on some of the changes here.
The libmysqld.def file: The description line isn't supported in the more
recent compilers.
In fsp0fsp.c, there is an explicit cast to ullint. In this case, the
code does what's intended, and the compiler warning is one of the "are
you sure this is right" warnings. C4334 gives a warning if you make a
1u << 10 and store that in a 64 bit variable, because you could have
meant 1i64 << 10.
The change in i_s.cc is the one I'm most worried about. It used to store
the unsigned long long in a double. The change I did can be wrong. But
even if it's not, I'm worried what happens when this runs on an existing
set of tables.
We had a very long discussion about the cast in row0sel.c. The
conclusion was that
* auto_increment on doubles or floats is a very odd case
* there is a compiler bug in Visual Studio for very large values (above
around 2^53 it converts doubles wrong to uint64
I'd like to separate this discussion from this patch, though, and submit
a bug report on the innodb code on mysql.
I assume there will be something here I should change, but if parts of
the patch are ready for pushing, let me know that.
Bo Thorsen.
Monty Program AB.
--
MariaDB: MySQL replacement
Community developed. Feature enhanced. Backward compatible.
1
1
Hi Philip, all
Philip's bug on recovery from a kill -9 during DML reminded me... from
my experience, aborting a DDL query (ALTER TABLE - things like adding
an index usually) can be tricky and more often than not causes hassles.
By abort in this case I mean killing the connection/thread that does
the DDL.
It would be great to somehow have a few test cases in the suite and it
needs to be tested with all storage engines.
Provided the table it operates on is big enough, it should be possible
to just zap it by timing.
Thoughts?
Cheers,
Arjen.
2
2
07 Jul '10
Danny,
this the log from a short discussion about how to provide the
documentation for Sphinx to our users. Can you think of some
easy way to automate at least the notification process that
the docs changed?
<timour> knielsen, shodan IMHO it wuold be best if Sphinx could notify us automatically about changes in the docs, rather than having a possibly outdated documentation, with a link to the up-to-date docs.
<timour> If we take this manual approach to copy/paste docs for a dozen storage engines, we will end up in a big mess.
<shodan> timour: would http://code.google.com/feeds/p/sphinxsearch/svnchanges/basic do for notifications?
<timour> shodan, isn't that a feed with all changes?
<shodan> timour: yes, and so doc/ changes are there too
<timour> knielsen, perhaps we could pull the docs automatically?
<shodan> timour: maybe we could setup a special hook on commits to just doc/ but def not to a specific section
<timour> shodan, I think we have to ask our docs guy what is the best way to at least be automatically notified of doc changes (for the same release).
<timour> To me there are 2 kinds of doc changes:
<timour> - improvements of the docs for the same version of the code,
<timour> - changes due to a new release.
<timour> The latter will be straightforward, because we will know when we release a new version of Sphinx, but not the former.
What do you think about the problem with outdated docs of storage engines?
Timour
2
1
Hi all,
The attached patch adds a -64 keyword to the script
make_mariadb_win_dist. With this patch, the script produces both 32 bit
zip files (without any arguments) and 64 bit zip files (with -64). I
also expanded it so it's possible to run it with "-nobuild -64" and a
help text.
Kristian, when this patch is pushed, you can add a new buildbot slave
based on the one that builds the 32 bit zip file and installer. Adding
the -64 argument on the call to this script is the only thing necessary
for buildbot to produce 64 bit binaries. The result files are named
-win64 instead of -win32.
Ok to push this to 5.1?
Bo Thorsen.
Monty Program AB.
--
MariaDB: MySQL replacement
Community developed. Feature enhanced. Backward compatible.
2
4
06 Jul '10
The three companies Continuent, Codership, and Monty Program are planning to
start working on some enhancements to the replication system in MariaDB,
together with anyone interested in joining in.
At this stage, there are no fixed directions for the project, and to do this
in as open a way possible with the maximum community involvement and interest,
we agreed to start with an email discussion on the maria-developers@ mailing
list. So consider it started!
The plan so far is:
1) The parties taking this initiative, MP, Continuent, and Codership, present
their own ideas in this thread on maria-developers@ (and everyone else who
wants to chime in at this stage).
2) Once we have some concrete suggestions as a starting point, we use this to
reach out in a broader way with some blog posts on planetmysql / planetmariadb
to encourage further input and discussions for possible directions of the
project. Eventually we want to end up with a list of the most important goals
and a possible roadmap for replication enhancements.
(It is best to have something concrete as a basis of a broad community
discussion/process).
To start of, here are some points of interest that I collected. Everyone
please chime in with your own additional points, as well as comments and
further details on these one.
Three areas in particular seem to be of high interest in the community
currently (not excluding any other areas):
- High Availability
* Most seems to focus on building HA solutions on top of MySQL
replication, eg. MMM and Tungsten.
* For this project, seems mainly to be to implement improvements to
replication that help facilitate improving these on-top HA solutions.
* Tools to automate (or help automate) failover from a master.
* Better facilities to do initial setup of new slave without downtime, or
re-sync of an old master or slave that has been outside of the
replication topology for some period of time.
- Performance, especially scalability
* Multi-threaded slave SQL thread.
* Store the binlog inside a transactional engine (eg. InnoDB) to reduce
I/O, solve problems like group commit, and simplify crash recovery.
- More pluggable replication
* Make the replication code and APIs be more suitable for people to build
extra functionality into or on top of the stock MySQL replication.
* Better documentation of C++ APIs and binlog format.
* Adding extra information to binlog that can be useful for non-standard
replication stuff. For example column names (for RBR), checksums.
* Refactoring the server code to be more modular with APIs more suitable
for external usage.
* Add support for replication plugins, types to be determined. For example
binlog filtering plugins?
It is also very important to consider the work that the replication team at
MySQL is doing (and has done). I found a good deal of interesting information
about this here:
http://forge.mysql.com/wiki/MySQL_Replication:_Walk-through_of_the_new_5.1_…)
This describes a number of 6.0/5.4 and preview features that we could merge
and/or contribute to. Here are the highlights that I found:
- Features included in 6.0/5.4 (which are cancelled I think, but presumably
this will go in a milestone release):
* CHANGE MASTER ... IGNORE_SERVER_IDS for better support of circular
replication.
* Replication heartbeat.
* sync_relay_log_info, sync_master_info, sync_relay_log, relay_log_recovery
for crash recovery on slave.
* Binlog Performance Optimization (lock contention improvement).
* Semi-synchronous Replication, with Pluggable Replication Architecture.
http://forge.mysql.com/wiki/ReplicationFeatures/SemiSyncReplication
- Feature previews:
* Parallel slave application: WL#4648
http://forge.mysql.com/wiki/ReplicationFeatures/ParallelSlave
* Time-delayed replication: WL#344
http://forge.mysql.com/wiki/ReplicationFeatures/DelayedReplication
* Scriptable Replication: WL#4008
http://forge.mysql.com/wiki/ReplicationFeatures/ScriptableReplication
* Synchronous Replication.
Drizzle is also doing work on a new replication system. I read through the
series of blog posts that Jay Pipes wrote on this subject. They mostly deal
with how this is designed in terms of the Drizzle server code, and is low on
detail about how the replication will actually work (the only thing I could
really extract was that it is a form of row-based replication). If someone has
links to information about this that I missed, it could be interesting.
Let the discussion begin!
- Kristian.
13
67
[Maria-developers] MWL#123, sql layer part (was: Re: mwl#121: follow up)
by Sergey Petrunya 05 Jul '10
by Sergey Petrunya 05 Jul '10
05 Jul '10
Hello Igor,
On Thu, Jul 01, 2010 at 11:55:58PM -0700, Igor Babaev wrote:
> According to our agreement I introduced a new flag for MRR.
> I called it HA_MRR_MATERIALIZED_KEYS. This flag passed to any MMR
> interface function that takes mrr_mode as a parameter says:
> key values used in ranges are materialized in some buffers external to MRR.
>
This patch only gives DS-MRR information that the keys are materialized.
However, DS-MRR needs to work on (key, range_id) pairs. The patch doesn't
allow to assoicate key with range_id (or vice versa), so DS-MRR will have to
keep (key_pointer, range_id) tuples.
If we consider 64-bit environment, then sizeof(key_value_pointer)==
sizeof(key_value), and this patch won't bring any benefit at all.
I was expecting that the patch will provide some means to associate
key_value_pointer with range_id, so that DS-MRR won't need to store both.
Can we discuss this on scrum meeting or Monday evening?
> === modified file 'sql/handler.h'
> --- sql/handler.h 2010-03-20 12:01:47 +0000
> +++ sql/handler.h 2010-07-01 20:00:35 +0000
> @@ -1212,6 +1212,12 @@ void get_sweep_read_cost(TABLE *table, h
> */
> #define HA_MRR_NO_NULL_ENDPOINTS 128
>
> +/*
> + The MRR user has materialized range keys somewhere in the user's buffer.
> + This can be used for optimization of the procedure that sorts these keys
> + since in this case key values don't have to be copied into the MRR buffer.
> +*/
> +#define HA_MRR_MATERIALIZED_KEYS 256
>
>
> /*
>
> === modified file 'sql/sql_join_cache.cc'
> --- sql/sql_join_cache.cc 2010-03-07 15:41:45 +0000
> +++ sql/sql_join_cache.cc 2010-07-01 19:59:55 +0000
> @@ -651,6 +651,9 @@ int JOIN_CACHE_BKA::init()
>
> use_emb_key= check_emb_key_usage();
>
> + if (use_emb_key)
> + mrr_mode|= HA_MRR_MATERIALIZED_KEYS;
> +
> create_remaining_fields(FALSE);
>
> set_constants();
> @@ -2617,6 +2620,8 @@ int JOIN_CACHE_BKA_UNIQUE::init()
> data_fields_offset+= copy->length;
> }
>
> + mrr_mode|= HA_MRR_MATERIALIZED_KEYS;
> +
> DBUG_RETURN(rc);
> }
>
>
BR
Sergey
--
Sergey Petrunia, Software Developer
Monty Program AB, http://askmonty.org
Blog: http://s.petrunia.net/blog
1
0
[Maria-developers] [5.3 merge] Item_in_subselect::init_left_expr_cache() question.
by Sergey Petrunya 03 Jul '10
by Sergey Petrunya 03 Jul '10
03 Jul '10
Hi Timour,
I'm having difficulties with finishing 5.2->5.3 merge. Could you please take a look at
https://bugs.launchpad.net/maria/+bug/598972 ? The questions are in the bug
entry.
BR
Sergey
--
Sergey Petrunia, Software Developer
Monty Program AB, http://askmonty.org
Blog: http://s.petrunia.net/blog
1
1
[Maria-developers] DS-MRR: extra work filed as new WL entries, questions
by Sergey Petrunya 03 Jul '10
by Sergey Petrunya 03 Jul '10
03 Jul '10
Hello Igor,
Based on our discussions, I've filed
* http://askmonty.org/worklog/Server-RawIdeaBin/index.pl?tid=123
"DS-MRR for clustered PKs: more efficient buffer use"
* http://askmonty.org/worklog/Server-RawIdeaBin/index.pl?tid=124
"DS-MRR for clustered PKs: cost function"
* http://askmonty.org/worklog/Client-BackLog/index.pl?tid=125
"Make DS-MRR sort the ranges before scanning the index"
Could you please check if that correctly describes the conclusions we've
arrived at? Also, WL texts contain several unresolved questions, marked with
"TODO".
BR
Sergey
--
Sergey Petrunia, Software Developer
Monty Program AB, http://askmonty.org
Blog: http://s.petrunia.net/blog
1
0
[Maria-developers] WL#122 New (by Sergei): fix locking in XA tc_log
by worklog-noreply@askmonty.org 30 Jun '10
by worklog-noreply@askmonty.org 30 Jun '10
30 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: fix locking in XA tc_log
CREATION DATE..: Wed, 30 Jun 2010, 17:26
SUPERVISOR.....: Sergei
IMPLEMENTOR....: Sergei
COPIES TO......:
CATEGORY.......: Server-Sprint
TASK ID........: 122 (http://askmonty.org/worklog/?tid=122)
VERSION........: Server-5.1
STATUS.........: Assigned
PRIORITY.......: 90
WORKED HOURS...: 0
ESTIMATE.......: 8 (hours remain)
ORIG. ESTIMATE.: 8
PROGRESS NOTES:
DESCRIPTION:
We need to analyze and fix the mutex lock order in TC_LOG code of the XA.
It's needed to fix the https://bugs.launchpad.net/maria/+bug/578117
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
[Maria-developers] WL#99 Updated (by Sergei): dummy test task
by worklog-noreply@askmonty.org 30 Jun '10
by worklog-noreply@askmonty.org 30 Jun '10
30 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: dummy test task
CREATION DATE..: Thu, 04 Mar 2010, 07:18
SUPERVISOR.....: Knielsen
IMPLEMENTOR....:
COPIES TO......:
CATEGORY.......: Client-BackLog
TASK ID........: 99 (http://askmonty.org/worklog/?tid=99)
VERSION........: Server-9.x
STATUS.........: Cancelled
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Sergei - Wed, 30 Jun 2010, 16:27)=-=-
Version updated.
--- /tmp/wklog.99.old.30336 2010-06-30 16:27:06.000000000 +0000
+++ /tmp/wklog.99.new.30336 2010-06-30 16:27:06.000000000 +0000
@@ -1 +1 @@
-Benchmarks-3.0
+Server-9.x
-=-=(Sergei - Wed, 30 Jun 2010, 16:27)=-=-
Status updated.
--- /tmp/wklog.99.old.30336 2010-06-30 16:27:06.000000000 +0000
+++ /tmp/wklog.99.new.30336 2010-06-30 16:27:06.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Cancelled
DESCRIPTION:
test missing estimate
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
30 Jun '10
Hi everyone,
This patch adds a page to the Windows installer that asks the user if he
wants to set up MariaDB as a Windows service.
The way to do this is to hack the NSIS template for CPack. IMHO, this is
a bad way, but it is the recommended (and only) way to do stuff like this.
The patch in the attached file install-service.patch is the only thing
necessary for the current sources. It's a one liner. So far so good :)
The actual patch adds a file win\cmake\NSIS.template.in which is a copy
of the NSIS.template.in from CMake. The attached patch
NSIS.template.in.patch shows the diff with the code I have added.
Bo Thorsen.
Monty Program AB.
--
MariaDB: MySQL replacement
Community developed. Feature enhanced. Backward compatible.
2
1
[Maria-developers] WL#85 Updated (by Sergei): Partitioned Key Cache for MyISAM
by worklog-noreply@askmonty.org 29 Jun '10
by worklog-noreply@askmonty.org 29 Jun '10
29 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Partitioned Key Cache for MyISAM
CREATION DATE..: Sun, 14 Feb 2010, 00:10
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Igor, Monty, Sergei
CATEGORY.......: Server-Sprint
TASK ID........: 85 (http://askmonty.org/worklog/?tid=85)
VERSION........: Server-5.2
STATUS.........: Complete
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 80 (hours remain)
ORIG. ESTIMATE.: 80
PROGRESS NOTES:
-=-=(Sergei - Tue, 29 Jun 2010, 14:05)=-=-
Status updated.
--- /tmp/wklog.85.old.32131 2010-06-29 14:05:44.000000000 +0000
+++ /tmp/wklog.85.new.32131 2010-06-29 14:05:44.000000000 +0000
@@ -1 +1 @@
-Assigned
+Complete
-=-=(Igor - Tue, 16 Mar 2010, 19:34)=-=-
High Level Description modified.
--- /tmp/wklog.85.old.22371 2010-03-16 19:34:33.000000000 +0000
+++ /tmp/wklog.85.new.22371 2010-03-16 19:34:33.000000000 +0000
@@ -15,4 +15,5 @@
the chances for threads not compete for the same key cache lock better.
The idea and the original of the partitioned key cache was provided by one of
-our external contributers.
+our external contributers (see the attached file segmented_keycache_v2.diff with
+the original patch from the contributor).
-=-=(Igor - Sun, 14 Feb 2010, 00:15)=-=-
Category updated.
--- /tmp/wklog.85.old.9810 2010-02-13 22:15:43.000000000 +0000
+++ /tmp/wklog.85.new.9810 2010-02-13 22:15:43.000000000 +0000
@@ -1 +1 @@
-Server-BackLog
+Server-Sprint
-=-=(Igor - Sun, 14 Feb 2010, 00:15)=-=-
Version updated.
--- /tmp/wklog.85.old.9810 2010-02-13 22:15:43.000000000 +0000
+++ /tmp/wklog.85.new.9810 2010-02-13 22:15:43.000000000 +0000
@@ -1 +1 @@
-Benchmarks-3.0
+Server-5.2
-=-=(Igor - Sun, 14 Feb 2010, 00:12)=-=-
New attachment: 'segmented_keycache_v2.diff'
DESCRIPTION:
A partitioned key cache is a collection of structures for regular MyiSAM key
caches called key cache partitions. Any page from a file can be placed into a
buffer of only one partition. The number of the partition is calculated from the
file number and the position of the page in the file, and it's always the same
for the page. The function that maps pages into partitions takes care of even
distribution of pages among partitions.
Partition key cache mitigate one of the major problem of simple key cache:
thread contention for key cache lock (mutex). Every call of a key cache
interface function must acquire this lock. So threads compete for this lock even
in the case when they have acquired shared locks for the file and pages they
want read from are in the key cache buffers. When working with a partitioned key
cache any key cache interface function that needs only one page has to acquire
the key cache lock only for the partition the page is ascribed to. This makes
the chances for threads not compete for the same key cache lock better.
The idea and the original of the partitioned key cache was provided by one of
our external contributers (see the attached file segmented_keycache_v2.diff with
the original patch from the contributor).
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
[Maria-developers] WL#85 Updated (by Sergei): Partitioned Key Cache for MyISAM
by worklog-noreply@askmonty.org 29 Jun '10
by worklog-noreply@askmonty.org 29 Jun '10
29 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Partitioned Key Cache for MyISAM
CREATION DATE..: Sun, 14 Feb 2010, 00:10
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Igor, Monty, Sergei
CATEGORY.......: Server-Sprint
TASK ID........: 85 (http://askmonty.org/worklog/?tid=85)
VERSION........: Server-5.2
STATUS.........: Complete
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 80 (hours remain)
ORIG. ESTIMATE.: 80
PROGRESS NOTES:
-=-=(Sergei - Tue, 29 Jun 2010, 14:05)=-=-
Status updated.
--- /tmp/wklog.85.old.32131 2010-06-29 14:05:44.000000000 +0000
+++ /tmp/wklog.85.new.32131 2010-06-29 14:05:44.000000000 +0000
@@ -1 +1 @@
-Assigned
+Complete
-=-=(Igor - Tue, 16 Mar 2010, 19:34)=-=-
High Level Description modified.
--- /tmp/wklog.85.old.22371 2010-03-16 19:34:33.000000000 +0000
+++ /tmp/wklog.85.new.22371 2010-03-16 19:34:33.000000000 +0000
@@ -15,4 +15,5 @@
the chances for threads not compete for the same key cache lock better.
The idea and the original of the partitioned key cache was provided by one of
-our external contributers.
+our external contributers (see the attached file segmented_keycache_v2.diff with
+the original patch from the contributor).
-=-=(Igor - Sun, 14 Feb 2010, 00:15)=-=-
Category updated.
--- /tmp/wklog.85.old.9810 2010-02-13 22:15:43.000000000 +0000
+++ /tmp/wklog.85.new.9810 2010-02-13 22:15:43.000000000 +0000
@@ -1 +1 @@
-Server-BackLog
+Server-Sprint
-=-=(Igor - Sun, 14 Feb 2010, 00:15)=-=-
Version updated.
--- /tmp/wklog.85.old.9810 2010-02-13 22:15:43.000000000 +0000
+++ /tmp/wklog.85.new.9810 2010-02-13 22:15:43.000000000 +0000
@@ -1 +1 @@
-Benchmarks-3.0
+Server-5.2
-=-=(Igor - Sun, 14 Feb 2010, 00:12)=-=-
New attachment: 'segmented_keycache_v2.diff'
DESCRIPTION:
A partitioned key cache is a collection of structures for regular MyiSAM key
caches called key cache partitions. Any page from a file can be placed into a
buffer of only one partition. The number of the partition is calculated from the
file number and the position of the page in the file, and it's always the same
for the page. The function that maps pages into partitions takes care of even
distribution of pages among partitions.
Partition key cache mitigate one of the major problem of simple key cache:
thread contention for key cache lock (mutex). Every call of a key cache
interface function must acquire this lock. So threads compete for this lock even
in the case when they have acquired shared locks for the file and pages they
want read from are in the key cache buffers. When working with a partitioned key
cache any key cache interface function that needs only one page has to acquire
the key cache lock only for the partition the page is ascribed to. This makes
the chances for threads not compete for the same key cache lock better.
The idea and the original of the partitioned key cache was provided by one of
our external contributers (see the attached file segmented_keycache_v2.diff with
the original patch from the contributor).
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
[Maria-developers] WL#85 Updated (by Sergei): Partitioned Key Cache for MyISAM
by worklog-noreply@askmonty.org 29 Jun '10
by worklog-noreply@askmonty.org 29 Jun '10
29 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Partitioned Key Cache for MyISAM
CREATION DATE..: Sun, 14 Feb 2010, 00:10
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Igor, Monty, Sergei
CATEGORY.......: Server-Sprint
TASK ID........: 85 (http://askmonty.org/worklog/?tid=85)
VERSION........: Server-5.2
STATUS.........: Complete
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 80 (hours remain)
ORIG. ESTIMATE.: 80
PROGRESS NOTES:
-=-=(Sergei - Tue, 29 Jun 2010, 14:05)=-=-
Status updated.
--- /tmp/wklog.85.old.32131 2010-06-29 14:05:44.000000000 +0000
+++ /tmp/wklog.85.new.32131 2010-06-29 14:05:44.000000000 +0000
@@ -1 +1 @@
-Assigned
+Complete
-=-=(Igor - Tue, 16 Mar 2010, 19:34)=-=-
High Level Description modified.
--- /tmp/wklog.85.old.22371 2010-03-16 19:34:33.000000000 +0000
+++ /tmp/wklog.85.new.22371 2010-03-16 19:34:33.000000000 +0000
@@ -15,4 +15,5 @@
the chances for threads not compete for the same key cache lock better.
The idea and the original of the partitioned key cache was provided by one of
-our external contributers.
+our external contributers (see the attached file segmented_keycache_v2.diff with
+the original patch from the contributor).
-=-=(Igor - Sun, 14 Feb 2010, 00:15)=-=-
Category updated.
--- /tmp/wklog.85.old.9810 2010-02-13 22:15:43.000000000 +0000
+++ /tmp/wklog.85.new.9810 2010-02-13 22:15:43.000000000 +0000
@@ -1 +1 @@
-Server-BackLog
+Server-Sprint
-=-=(Igor - Sun, 14 Feb 2010, 00:15)=-=-
Version updated.
--- /tmp/wklog.85.old.9810 2010-02-13 22:15:43.000000000 +0000
+++ /tmp/wklog.85.new.9810 2010-02-13 22:15:43.000000000 +0000
@@ -1 +1 @@
-Benchmarks-3.0
+Server-5.2
-=-=(Igor - Sun, 14 Feb 2010, 00:12)=-=-
New attachment: 'segmented_keycache_v2.diff'
DESCRIPTION:
A partitioned key cache is a collection of structures for regular MyiSAM key
caches called key cache partitions. Any page from a file can be placed into a
buffer of only one partition. The number of the partition is calculated from the
file number and the position of the page in the file, and it's always the same
for the page. The function that maps pages into partitions takes care of even
distribution of pages among partitions.
Partition key cache mitigate one of the major problem of simple key cache:
thread contention for key cache lock (mutex). Every call of a key cache
interface function must acquire this lock. So threads compete for this lock even
in the case when they have acquired shared locks for the file and pages they
want read from are in the key cache buffers. When working with a partitioned key
cache any key cache interface function that needs only one page has to acquire
the key cache lock only for the partition the page is ascribed to. This makes
the chances for threads not compete for the same key cache lock better.
The idea and the original of the partitioned key cache was provided by one of
our external contributers (see the attached file segmented_keycache_v2.diff with
the original patch from the contributor).
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
[Maria-developers] WL#112 Updated (by Sergei): Merge OQGraph into MariaDB
by worklog-noreply@askmonty.org 29 Jun '10
by worklog-noreply@askmonty.org 29 Jun '10
29 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Merge OQGraph into MariaDB
CREATION DATE..: Mon, 29 Mar 2010, 18:00
SUPERVISOR.....: Knielsen
IMPLEMENTOR....: Knielsen
COPIES TO......:
CATEGORY.......: Server-Sprint
TASK ID........: 112 (http://askmonty.org/worklog/?tid=112)
VERSION........: Server-5.2
STATUS.........: Complete
PRIORITY.......: 60
WORKED HOURS...: 13
ESTIMATE.......: 2 (hours remain)
ORIG. ESTIMATE.: 15
PROGRESS NOTES:
-=-=(Sergei - Tue, 29 Jun 2010, 14:04)=-=-
Status updated.
--- /tmp/wklog.112.old.32115 2010-06-29 14:04:40.000000000 +0000
+++ /tmp/wklog.112.new.32115 2010-06-29 14:04:40.000000000 +0000
@@ -1 +1 @@
-Code-Review
+Complete
-=-=(Knielsen - Tue, 06 Apr 2010, 15:28)=-=-
Fixed all issues from first code review.
Implement packaging for OQGraph in bakery.
Set up buildbot hosts for including OQGraph, including binary packaging.
-=-=(Knielsen - Wed, 31 Mar 2010, 13:38)=-=-
Status updated.
--- /tmp/wklog.112.old.12166 2010-03-31 13:38:25.000000000 +0000
+++ /tmp/wklog.112.new.12166 2010-03-31 13:38:25.000000000 +0000
@@ -1 +1 @@
-Assigned
+Code-Review
-=-=(Knielsen - Wed, 31 Mar 2010, 13:38)=-=-
High-Level Specification modified.
--- /tmp/wklog.112.old.12070 2010-03-31 13:38:08.000000000 +0000
+++ /tmp/wklog.112.new.12070 2010-03-31 13:38:08.000000000 +0000
@@ -15,3 +15,5 @@
Fix OQGraph plug.in to detect boost version >= 1.40.0, and only enable OQGraph
if such boost is found.
+Update the packaging in ourdelta/bakery to include the oqgraph_engine.so and
+link with g++ rather than gcc.
-=-=(Knielsen - Mon, 29 Mar 2010, 21:46)=-=-
High-Level Specification modified.
--- /tmp/wklog.112.old.31142 2010-03-29 21:46:11.000000000 +0000
+++ /tmp/wklog.112.new.31142 2010-03-29 21:46:11.000000000 +0000
@@ -1,28 +1,17 @@
Tasks:
-Find the latest version of OQGraph to base this on (there should be a
-Launchpad branch somewhere, match it up with what is in the OQGraph patch for
-MySQL 5.0 in the ourdelta stuff).
-
-Extract the correct version of Boost from the MySQL 5.0 ourdelta patch. This
-is a patched version of Boost fixing a bug that is supposedly fatal for
-OQGraph (details are not known at the time of writing).
+Base work on the Launchpad branch lp:~knielsen/maria/mariadb-5.1-oqgraph
-Document in OQGraph README the need for boost of a specific version, and point
-to where it can be obtained. Also include the patch for boost if the correct
-base version of boost to do this against can be determined.
+OQGraph requires Boost >= 1.40.0 (earlier versions have a bug that affects
+OQGraph).
-Install the patched boost in /usr/local/ on the build machines (release builds
-and selected Buildbot slaves).
+Document in OQGraph README the need for boost of a specific version, and point
+to where it can be obtained.
-Fix OQGraph plug.in to detect correct version of OQGraph that makes the build
-not break. Check which version in Ubuntu starts working (I think it was
-Jaunty), and require at least that version.
-
-Setup some repository or source tarball of the patched boost
-somewhere. Preferably a Launchpad branch or similar (if upstream project can
-be found).
+Install the patched boost in /usr/local/include/boost on the build machines
+(release builds and selected Buildbot slaves). G++ seems to by default look in
+/usr/local/include, so that is sufficient to find it.
-Setup in plug.in or /configure.in appropriate --with-boost=xxx. Or in a pinch,
-we can make do with CFLAGS=-Ixxx, or even default look in /usr/local/.
+Fix OQGraph plug.in to detect boost version >= 1.40.0, and only enable OQGraph
+if such boost is found.
-=-=(Knielsen - Mon, 29 Mar 2010, 18:09)=-=-
High-Level Specification modified.
--- /tmp/wklog.112.old.23061 2010-03-29 18:09:27.000000000 +0000
+++ /tmp/wklog.112.new.23061 2010-03-29 18:09:27.000000000 +0000
@@ -1 +1,28 @@
+Tasks:
+
+Find the latest version of OQGraph to base this on (there should be a
+Launchpad branch somewhere, match it up with what is in the OQGraph patch for
+MySQL 5.0 in the ourdelta stuff).
+
+Extract the correct version of Boost from the MySQL 5.0 ourdelta patch. This
+is a patched version of Boost fixing a bug that is supposedly fatal for
+OQGraph (details are not known at the time of writing).
+
+Document in OQGraph README the need for boost of a specific version, and point
+to where it can be obtained. Also include the patch for boost if the correct
+base version of boost to do this against can be determined.
+
+Install the patched boost in /usr/local/ on the build machines (release builds
+and selected Buildbot slaves).
+
+Fix OQGraph plug.in to detect correct version of OQGraph that makes the build
+not break. Check which version in Ubuntu starts working (I think it was
+Jaunty), and require at least that version.
+
+Setup some repository or source tarball of the patched boost
+somewhere. Preferably a Launchpad branch or similar (if upstream project can
+be found).
+
+Setup in plug.in or /configure.in appropriate --with-boost=xxx. Or in a pinch,
+we can make do with CFLAGS=-Ixxx, or even default look in /usr/local/.
DESCRIPTION:
Get the OQGraph storage engine merged into MariaDB, fixing the remaining
problems blocking the merge.
HIGH-LEVEL SPECIFICATION:
Tasks:
Base work on the Launchpad branch lp:~knielsen/maria/mariadb-5.1-oqgraph
OQGraph requires Boost >= 1.40.0 (earlier versions have a bug that affects
OQGraph).
Document in OQGraph README the need for boost of a specific version, and point
to where it can be obtained.
Install the patched boost in /usr/local/include/boost on the build machines
(release builds and selected Buildbot slaves). G++ seems to by default look in
/usr/local/include, so that is sufficient to find it.
Fix OQGraph plug.in to detect boost version >= 1.40.0, and only enable OQGraph
if such boost is found.
Update the packaging in ourdelta/bakery to include the oqgraph_engine.so and
link with g++ rather than gcc.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
29 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Connect by
CREATION DATE..: Thu, 26 Mar 2009, 00:30
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......:
CATEGORY.......: Server-BackLog
TASK ID........: 11 (http://askmonty.org/worklog/?tid=11)
VERSION........: Server-9.x
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 220 (hours remain)
ORIG. ESTIMATE.: 220
PROGRESS NOTES:
-=-=(Sergei - Tue, 29 Jun 2010, 14:03)=-=-
Category updated.
--- /tmp/wklog.11.old.32063 2010-06-29 14:03:35.000000000 +0000
+++ /tmp/wklog.11.new.32063 2010-06-29 14:03:35.000000000 +0000
@@ -1 +1 @@
-Server-Sprint
+Server-BackLog
-=-=(Guest - Tue, 19 May 2009, 18:27)=-=-
High Level Description modified.
--- /tmp/wklog.11.old.21953 2009-05-19 18:27:14.000000000 +0300
+++ /tmp/wklog.11.new.21953 2009-05-19 18:27:14.000000000 +0300
@@ -1 +1,360 @@
-Add CONNECT BY syntax
+<contents>
+1. Background information
+2. CONNECT BY semantics, properties and limitations
+2.1 Additional CONNECT BY features
+2.2 Limitations
+3. Our implementation
+3.1 Scope Questions
+3.2 CONNECT BY execution
+3.2.1 Straightforward (recursive) evaluation algorithm
+3.2.2 Transitive-closure evaluation algorithms
+3.2.3 Other algorithms
+3.2.4 Loop detection
+3.2.4.1 The upper bound of produced records
+3.2.4.1 Straightforward approach: track chains
+3.2.3 Improvements for straightforward execution strategy
+3.3. Optimization
+4. Use-cases dump
+</contents>
+
+1. Background information
+-------------------------
+* CONNECT BY is a non-standard, Oracle's syntax. It is also supported by
+ EnterpriseDB (Q: any other implementations?)
+
+* PostgreSQL 8.4 (now beta) has support for SQL-standard compliant WITH
+ RECURSIVE (aka Common Table Expressions, CTE) query syntax:
+ http://www.postgresql.org/docs/8.4/static/queries-with.html
+ http://www.postgresql.org/about/news.1074
+ http://archives.postgresql.org/pgsql-hackers/2008-02/msg00642.php
+ http://archives.postgresql.org/pgsql-patches/2008-05/msg00362.php
+
+* Evgen's attempt:
+ http://lists.mysql.com/internals/15569
+
+DB2 and MS SQL support SQL standard's WITH RECURSIVE clause.
+
+2. CONNECT BY semantics, properties and limitations
+---------------------------------------------------
+From Oracle's manual:
+
+<almost-quote>
+
+ SELECT ...
+ FROM ...
+ WHERE ...
+ START WITH cond
+ CONNECT BY connect_cond
+ ORDER [SIBLINGS] BY
+
+In oracle, one expression in connect_cond must be
+
+ PRIOR expr = expr
+
+ or
+
+ expr = PRIOR expr
+
+The manner in which Oracle processes a WHERE clause (if any) in a hierarchical
+query depends on whether the WHERE clause contains a join:
+
+ * If the WHERE predicate contains a join, Oracle applies the join predicates
+ before doing the CONNECT BY processing.
+ * If the WHERE clause does not contain a join, Oracle applies all predicates
+ other than the CONNECT BY predicates after doing the CONNECT BY processing
+ without affecting the other rows of the hierarchy.
+</almost-quote>
+
+See http://www.adp-gmbh.ch/ora/sql/connect_by.html
+http://download-uk.oracle.com/docs/cd/B10501_01/server.920/a96540/queries4a.htm
+
+
+2.1 Additional CONNECT BY features
+----------------------------------
+
+LEVEL pseudocolumn
+ indicates ancestry depth of the record (inital row has level=1, its children
+ have level=2 and so forth). Can be used in CONNECT BY clause to limit
+ traversal depth.
+
+SYS_CONNECT_BY_PATH(column, 'char')
+ returns path from root to the node.
+
+NOCYCLE and CONNECT_BY_ISCYCLE
+ "With the 10g keyword NOCYCLE, hierarchical queries detect loops and do not
+ generate errors. CONNECT_BY_ISCYCLE pseudo-column is a flag that can be used
+ to detect which row is cycling"
+ http://www.dba-oracle.com/t_advanced_sql_connect_by_loop.htm
+
+ORDER SIBLINGS BY
+ CONNECT BY produces records in "children follow parents" order, with order
+ of the siblings unspecified. ORDER SIBLINGS BY orders siblings within each
+ "generation".
+
+2.2 Limitations
+---------------
+Other limitations (which we might or might not want to replicate)
+
+* There is this error:
+ ORA-01437: cannot have join with CONNECT BY
+ Cause: A join operation was specified with a CONNECT BY clause. If a
+ CONNECT BY clause is used in a SELECT statement for a tree-
+ structured query, only one table may be referenced in the query.
+ Action: Remove either the CONNECT BY clause or the join operation from
+ the SQL statement.
+ It seems oracle had this limitation before version 10G
+
+* LEVEL cannot be used on the left side of IN-comparison if the right side is a
+ subquery
+http://download.oracle.com/docs/cd/B10501_01/server.920/a96540/sql_elements6a.htm#9547
+ This seems to have been lifted in version 10?
+
+3. Our implementation
+---------------------
+
+3.1 Scope Questions
+-------------------
+* Are we sure we want CONNECT BY syntax and not SQL standard' one? (I'm not
+ suggesting one or the other, just want to make sure we've made a conscious
+ decision)
+
+* Any use-cases we need to make sure to handle well?
+
+Will we implement any of these features:
+
+* Output is ordered (children follow parents)
+* "ORDER SIBLINGS BY" variant of ORDER BY
+* NOCYCLE/CONNECT_BY_ISCYCLE
+ - It seems any checking for cycles will cause overhead. Do we implement a
+ mode for those who know what they are doing, where the server doesn't
+ actually check cycles but only reports error if it happened to enumerate,
+ say MAX(1M, #records_in_table * 10) records? (This doesn't guarantee that
+ there are no cycles, but this is just beyond what one could logically want)
+
+* Oracle's treatment of WHERE (if there's a join - the WHERE is applied after
+ connect by, otherwise before) [Yes]
+* Can one use SYS_CONNECT_BY_PATH in the CONNECT BY expression?
+
+
+3.2 CONNECT BY execution
+------------------------
+
+3.2.1 Straightforward (recursive) evaluation algorithm
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+As specified in CONNECT BY definition, breadth-first, parent-to-children
+traversal:
+
+ start with a scan that retrieves records using the START WITH condition;
+ pass rows to ouptut and also record them (i.e. needed columns) in
+ some sort of growable, overflow-to-disk buffer in_buf;
+
+ while(in_buf is not empty)
+ {
+ for each record in the buffer
+ {
+ do a scan based on CONNECT BY condition;
+ pass rows to output and also record them (i.e. needed columns) in
+ a growable, overflow-to-disk buffer out_buf;
+ }
+ in_buf= out_buf;
+ }
+
+This algorithm will produce rows in the required order.
+
+3.2.2 Transitive-closure evaluation algorithms
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+When CONNECT BY clause refers only to current and PRIOR records (and doesn't
+refer to connect path using LEVEL or SYS_CONNECT_BY_PATH functions), then
+evaluation of CONNECT BY operation is equivalent to building a transitive
+closure of a certain relation.
+
+TODO: can we use LEVEL/SYS_CONNECT_BY_PATH in select list with these
+ algorithms? looks like no?
+
+There are special algorithms to build transitive closure of relation that is
+represented as a table of edges, e.g. Blocked Warshall Algorithm.
+
+Q: Do we investigate further in this direction?
+
+3.2.3 Other algorithms
+----------------------
+To be resolved: Do we always start from the first clause and go to children?
+Does it make sense to proceed in other direction, from children to parents?
+Looks like no? TODO need definite answer.
+
+3.2.4 Loop detection
+~~~~~~~~~~~~~~~~~~~~
+Transitive-closure algorithms can detect loops (it seems some of them can also
+handle loop avoidance but that needs to be verified).
+
+Straightforward-evaluation algorithm will work forever if there is a loop,
+hence will need assistance in loop detection/avoidance.
+
+3.2.4.1 The upper bound of produced records
+-------------------------------------------
+There is an upper bound of the amount of records CONNECT BY runtime can
+generate without generating a loop.
+
+The worst case is when
+ * every record in a source table was in the parent generation (and thus has
+ started a parent->child->child->... chain)
+ * every chain is of #table-records length.
+
+example of such case:
+
+ SELECT * FROM employees
+ START WITH true
+ CONNECT BY
+ PRIOR emp_id = (emp_id + 1) MOD $n_employees AND
+ length(SYS_CONNECT_BY_PATH('-')) = $n_employees -- guard againist
+ -- forming loops
+
+this gives that we can at most generate O(#table_records^2) records. This
+limitation can be used as a primitive way to stop evaluation.
+
+
+3.2.4.1 Straightforward approach: track chains
+----------------------------------------------
+In general case, we will have to track which records we have seen across each
+of the parent-child chains. The same record can show up in different chains
+at different times and this won't form a loop:
+
+ parent generation1 generation2
+
+ row1- --+---row2---- ---row3-- (chain1)
+ |
+ \--row3-+-- ---row2-- (chain2)
+ |
+ \- ---row4-- (chain3)
+ row4- ...
+
+Tracking can be done by
+- Numbering the chains and using one structure (e.g temptable) to store
+ (rowid, chain#) pairs and check them for uniqueness.
+
+- Using per-chain data structure which we could serialize/deserialize. This
+ could be
+ - serializable hashtable
+ - ordered rowid list
+ - serializable sparse bitmap
+
+One can expect a lot of chains to have common starts (eg. look at chain2 and
+chain3). I don't see how one could take advantage of that, though.
+
+3.2.3 Improvements for straightforward execution strategy
+---------------------------------------------------------
+
+* If the query is a join, it may make sense to materialize it join result
+ (including creation of appropriate index) so we're able to make
+ parent-to-child transitions faster.
+ This seems to be connected to Evgen's work on FROM subqueries.
+
+* If there is a suitable index, we can employ a variant of BatchedKeyAccess.
+
+* Part of CONNECT BY expression that places restrictions on subsequent
+ generation can be moved to the WHERE. If we do that, we get two recordsets:
+
+1. Initial START WITH recordset
+
+2. A recordset to be used to advance to subsequent generation
+
+
+3.3. Optimization
+-----------------
+It seems it is nearly impossible to estimate how many iterations we'll have
+to make and how many records we will end up producing.
+
+TODO: some bad estimates.. assume a fixed number of generations, reuse ref
+accces estimations for fanount, which gives
+
+ access_method_estimate ^ number_of_generations
+
+estimate?
+
+4. Use-cases dump
+=================
+
+http://www.orafaq.com/usenet/comp.databases.oracle.server/2007/01/05/0264.htm:
+ select mak_xx,nr_porz,level lvl from spacer_strona
+ where nvl(dervlvl,0)<3
+ start with mak_xx=125414 and nr_porz=0
+ connect by mak_xx = prior derv_mak_xx and nr_porz = prior derv_nr_porz
+ and prior dervlvl=3
+
+
+http://www.orafaq.com/usenet/comp.databases.oracle.server/2007/01/04/0196.htm:
+ SELECT OPM_N_ID FROM RAFC_ADM.AFC_T_OPERATION_METIER START WITH OPM_N_ID IN
+ (
+ SELECT OPM_N_ID FROM RAFC_ADM.AFC_T_OPERATION_METIER x
+ START WITH x.OPM_N_ID IN (4846)
+ CONNECT BY ((PRIOR x.OPM_MERE_OPM_N_ID = x.OPM_N_ID)
+ OR (PRIOR x.OPM_ANNULEE_OPM_N_ID = x.OPM_N_ID))
+ )
+ CONNECT BY ((PRIOR OPM_N_ID = OPM_MERE_OPM_N_ID) OR (PRIOR OPM_N_ID =
+OPM_ANNULEE_OPM_N_ID))
+
+http://forums.enterprisedb.com/posts/list/737.page:
+ select lpad(' ',2*(level-1)) || to_char(child) s
+ from x
+ start with parent is null
+ connect by prior child = parent;
+
+ select *
+ from emp, dept
+ where dept.deptno = emp.deptno
+ start with mgr is null
+ connect by mgr = prior empno
+
+http://forums.oracle.com/forums/thread.jspa?threadID=623173:
+ SELECT cust_number
+ FROM customer
+ START WITH cust_number = '5568677999'
+ CONNECT BY PRIOR cust_number = cust_group_code.
+
+http://www.orafaq.com/forum/t/118879/0/
+ SELECT COUNT(a.dataid), c.name
+ FROM dauditnew a, dtree b, kuaf c
+ WHERE a.auditdate > SYSDATE-10 AND a.auditstr IN ('Create', 'AddVersion')
+ AND a.dataid = b.dataid AND c.id = a.performerid
+ AND a.SUBTYPE = 0
+ START WITH b.dataid = 6132086 CONNECT BY PRIOR a.dataid = b.parentid GROUP BY
+c.name
+
+
+http://www.postgresql-support.de/blog/blog_hans.html
+ SELECT METIER_ID||'|'||ORGANISATION_ID AS JOBORG
+ FROM INTRA_METIER,INTRA_ORGANISATION
+ WHERE METIER_ID IN(
+ SELECT METIER_ID
+ FROM INTRA_METIER
+ START WITH METIER_ID= '99533220-e8b2-4121-998c-808ea8ca2da7'
+ CONNECT BY METIER_ID= PRIOR PARENT_METIER_ID
+ ) AND ORGANISATION_ID IN (
+ SELECT ORGANISATION_ID
+ FROM INTRA_ORGANISATION
+ START WITH ORGANISATION_ID='025ee58f-35a3-4183-8679-01472838f753'
+ CONNECT BY ORGANISATION_ID= PRIOR PARENT_ORGANISATION_ID
+ );
+
+http://oracle.com
+ Oracle database uses CONNECT BY to generate EXPLAINs.
+
+http://practical-sql-tuning.blogspot.com/2009/01/use-of-statistically-incorrect.html
+
+ select sum(human_cnt) from facts
+ where territory_id in (select territory_id
+ from dic$territory
+ start with territory_code = :code
+ connect by prior territory_id = territory_parent);
+
+http://www.dbasupport.com/forums/archive/index.php/t-30008.html
+
+
+ SELECT LEVEL,LPAD(' ',8*(LEVEL-1))||T_COM_OBJ.OBJ_NAME, T_COM_OBJ.OBJ_PARENT,
+T_COM_OBJ.OBJ_ID
+ FROM VDR.T_COM_OBJ
+ START WITH T_COM_OBJ.OBJ_ID in (select obj_id obj_main from vdr.t_com_obj
+where obj_id=obj_parent)
+ CONNECT BY PRIOR T_COM_OBJ.OBJ_ID = T_COM_OBJ.OBJ_PARENT
+
+
DESCRIPTION:
<contents>
1. Background information
2. CONNECT BY semantics, properties and limitations
2.1 Additional CONNECT BY features
2.2 Limitations
3. Our implementation
3.1 Scope Questions
3.2 CONNECT BY execution
3.2.1 Straightforward (recursive) evaluation algorithm
3.2.2 Transitive-closure evaluation algorithms
3.2.3 Other algorithms
3.2.4 Loop detection
3.2.4.1 The upper bound of produced records
3.2.4.1 Straightforward approach: track chains
3.2.3 Improvements for straightforward execution strategy
3.3. Optimization
4. Use-cases dump
</contents>
1. Background information
-------------------------
* CONNECT BY is a non-standard, Oracle's syntax. It is also supported by
EnterpriseDB (Q: any other implementations?)
* PostgreSQL 8.4 (now beta) has support for SQL-standard compliant WITH
RECURSIVE (aka Common Table Expressions, CTE) query syntax:
http://www.postgresql.org/docs/8.4/static/queries-with.html
http://www.postgresql.org/about/news.1074
http://archives.postgresql.org/pgsql-hackers/2008-02/msg00642.php
http://archives.postgresql.org/pgsql-patches/2008-05/msg00362.php
* Evgen's attempt:
http://lists.mysql.com/internals/15569
DB2 and MS SQL support SQL standard's WITH RECURSIVE clause.
2. CONNECT BY semantics, properties and limitations
---------------------------------------------------
>From Oracle's manual:
<almost-quote>
SELECT ...
FROM ...
WHERE ...
START WITH cond
CONNECT BY connect_cond
ORDER [SIBLINGS] BY
In oracle, one expression in connect_cond must be
PRIOR expr = expr
or
expr = PRIOR expr
The manner in which Oracle processes a WHERE clause (if any) in a hierarchical
query depends on whether the WHERE clause contains a join:
* If the WHERE predicate contains a join, Oracle applies the join predicates
before doing the CONNECT BY processing.
* If the WHERE clause does not contain a join, Oracle applies all predicates
other than the CONNECT BY predicates after doing the CONNECT BY processing
without affecting the other rows of the hierarchy.
</almost-quote>
See http://www.adp-gmbh.ch/ora/sql/connect_by.html
http://download-uk.oracle.com/docs/cd/B10501_01/server.920/a96540/queries4a…
2.1 Additional CONNECT BY features
----------------------------------
LEVEL pseudocolumn
indicates ancestry depth of the record (inital row has level=1, its children
have level=2 and so forth). Can be used in CONNECT BY clause to limit
traversal depth.
SYS_CONNECT_BY_PATH(column, 'char')
returns path from root to the node.
NOCYCLE and CONNECT_BY_ISCYCLE
"With the 10g keyword NOCYCLE, hierarchical queries detect loops and do not
generate errors. CONNECT_BY_ISCYCLE pseudo-column is a flag that can be used
to detect which row is cycling"
http://www.dba-oracle.com/t_advanced_sql_connect_by_loop.htm
ORDER SIBLINGS BY
CONNECT BY produces records in "children follow parents" order, with order
of the siblings unspecified. ORDER SIBLINGS BY orders siblings within each
"generation".
2.2 Limitations
---------------
Other limitations (which we might or might not want to replicate)
* There is this error:
ORA-01437: cannot have join with CONNECT BY
Cause: A join operation was specified with a CONNECT BY clause. If a
CONNECT BY clause is used in a SELECT statement for a tree-
structured query, only one table may be referenced in the query.
Action: Remove either the CONNECT BY clause or the join operation from
the SQL statement.
It seems oracle had this limitation before version 10G
* LEVEL cannot be used on the left side of IN-comparison if the right side is a
subquery
http://download.oracle.com/docs/cd/B10501_01/server.920/a96540/sql_elements…
This seems to have been lifted in version 10?
3. Our implementation
---------------------
3.1 Scope Questions
-------------------
* Are we sure we want CONNECT BY syntax and not SQL standard' one? (I'm not
suggesting one or the other, just want to make sure we've made a conscious
decision)
* Any use-cases we need to make sure to handle well?
Will we implement any of these features:
* Output is ordered (children follow parents)
* "ORDER SIBLINGS BY" variant of ORDER BY
* NOCYCLE/CONNECT_BY_ISCYCLE
- It seems any checking for cycles will cause overhead. Do we implement a
mode for those who know what they are doing, where the server doesn't
actually check cycles but only reports error if it happened to enumerate,
say MAX(1M, #records_in_table * 10) records? (This doesn't guarantee that
there are no cycles, but this is just beyond what one could logically want)
* Oracle's treatment of WHERE (if there's a join - the WHERE is applied after
connect by, otherwise before) [Yes]
* Can one use SYS_CONNECT_BY_PATH in the CONNECT BY expression?
3.2 CONNECT BY execution
------------------------
3.2.1 Straightforward (recursive) evaluation algorithm
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
As specified in CONNECT BY definition, breadth-first, parent-to-children
traversal:
start with a scan that retrieves records using the START WITH condition;
pass rows to ouptut and also record them (i.e. needed columns) in
some sort of growable, overflow-to-disk buffer in_buf;
while(in_buf is not empty)
{
for each record in the buffer
{
do a scan based on CONNECT BY condition;
pass rows to output and also record them (i.e. needed columns) in
a growable, overflow-to-disk buffer out_buf;
}
in_buf= out_buf;
}
This algorithm will produce rows in the required order.
3.2.2 Transitive-closure evaluation algorithms
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
When CONNECT BY clause refers only to current and PRIOR records (and doesn't
refer to connect path using LEVEL or SYS_CONNECT_BY_PATH functions), then
evaluation of CONNECT BY operation is equivalent to building a transitive
closure of a certain relation.
TODO: can we use LEVEL/SYS_CONNECT_BY_PATH in select list with these
algorithms? looks like no?
There are special algorithms to build transitive closure of relation that is
represented as a table of edges, e.g. Blocked Warshall Algorithm.
Q: Do we investigate further in this direction?
3.2.3 Other algorithms
----------------------
To be resolved: Do we always start from the first clause and go to children?
Does it make sense to proceed in other direction, from children to parents?
Looks like no? TODO need definite answer.
3.2.4 Loop detection
~~~~~~~~~~~~~~~~~~~~
Transitive-closure algorithms can detect loops (it seems some of them can also
handle loop avoidance but that needs to be verified).
Straightforward-evaluation algorithm will work forever if there is a loop,
hence will need assistance in loop detection/avoidance.
3.2.4.1 The upper bound of produced records
-------------------------------------------
There is an upper bound of the amount of records CONNECT BY runtime can
generate without generating a loop.
The worst case is when
* every record in a source table was in the parent generation (and thus has
started a parent->child->child->... chain)
* every chain is of #table-records length.
example of such case:
SELECT * FROM employees
START WITH true
CONNECT BY
PRIOR emp_id = (emp_id + 1) MOD $n_employees AND
length(SYS_CONNECT_BY_PATH('-')) = $n_employees -- guard againist
-- forming loops
this gives that we can at most generate O(#table_records^2) records. This
limitation can be used as a primitive way to stop evaluation.
3.2.4.1 Straightforward approach: track chains
----------------------------------------------
In general case, we will have to track which records we have seen across each
of the parent-child chains. The same record can show up in different chains
at different times and this won't form a loop:
parent generation1 generation2
row1- --+---row2---- ---row3-- (chain1)
|
\--row3-+-- ---row2-- (chain2)
|
\- ---row4-- (chain3)
row4- ...
Tracking can be done by
- Numbering the chains and using one structure (e.g temptable) to store
(rowid, chain#) pairs and check them for uniqueness.
- Using per-chain data structure which we could serialize/deserialize. This
could be
- serializable hashtable
- ordered rowid list
- serializable sparse bitmap
One can expect a lot of chains to have common starts (eg. look at chain2 and
chain3). I don't see how one could take advantage of that, though.
3.2.3 Improvements for straightforward execution strategy
---------------------------------------------------------
* If the query is a join, it may make sense to materialize it join result
(including creation of appropriate index) so we're able to make
parent-to-child transitions faster.
This seems to be connected to Evgen's work on FROM subqueries.
* If there is a suitable index, we can employ a variant of BatchedKeyAccess.
* Part of CONNECT BY expression that places restrictions on subsequent
generation can be moved to the WHERE. If we do that, we get two recordsets:
1. Initial START WITH recordset
2. A recordset to be used to advance to subsequent generation
3.3. Optimization
-----------------
It seems it is nearly impossible to estimate how many iterations we'll have
to make and how many records we will end up producing.
TODO: some bad estimates.. assume a fixed number of generations, reuse ref
accces estimations for fanount, which gives
access_method_estimate ^ number_of_generations
estimate?
4. Use-cases dump
=================
http://www.orafaq.com/usenet/comp.databases.oracle.server/2007/01/05/0264.h…:
select mak_xx,nr_porz,level lvl from spacer_strona
where nvl(dervlvl,0)<3
start with mak_xx=125414 and nr_porz=0
connect by mak_xx = prior derv_mak_xx and nr_porz = prior derv_nr_porz
and prior dervlvl=3
http://www.orafaq.com/usenet/comp.databases.oracle.server/2007/01/04/0196.h…:
SELECT OPM_N_ID FROM RAFC_ADM.AFC_T_OPERATION_METIER START WITH OPM_N_ID IN
(
SELECT OPM_N_ID FROM RAFC_ADM.AFC_T_OPERATION_METIER x
START WITH x.OPM_N_ID IN (4846)
CONNECT BY ((PRIOR x.OPM_MERE_OPM_N_ID = x.OPM_N_ID)
OR (PRIOR x.OPM_ANNULEE_OPM_N_ID = x.OPM_N_ID))
)
CONNECT BY ((PRIOR OPM_N_ID = OPM_MERE_OPM_N_ID) OR (PRIOR OPM_N_ID =
OPM_ANNULEE_OPM_N_ID))
http://forums.enterprisedb.com/posts/list/737.page:
select lpad(' ',2*(level-1)) || to_char(child) s
from x
start with parent is null
connect by prior child = parent;
select *
from emp, dept
where dept.deptno = emp.deptno
start with mgr is null
connect by mgr = prior empno
http://forums.oracle.com/forums/thread.jspa?threadID=623173:
SELECT cust_number
FROM customer
START WITH cust_number = '5568677999'
CONNECT BY PRIOR cust_number = cust_group_code.
http://www.orafaq.com/forum/t/118879/0/
SELECT COUNT(a.dataid), c.name
FROM dauditnew a, dtree b, kuaf c
WHERE a.auditdate > SYSDATE-10 AND a.auditstr IN ('Create', 'AddVersion')
AND a.dataid = b.dataid AND c.id = a.performerid
AND a.SUBTYPE = 0
START WITH b.dataid = 6132086 CONNECT BY PRIOR a.dataid = b.parentid GROUP BY
c.name
http://www.postgresql-support.de/blog/blog_hans.html
SELECT METIER_ID||'|'||ORGANISATION_ID AS JOBORG
FROM INTRA_METIER,INTRA_ORGANISATION
WHERE METIER_ID IN(
SELECT METIER_ID
FROM INTRA_METIER
START WITH METIER_ID= '99533220-e8b2-4121-998c-808ea8ca2da7'
CONNECT BY METIER_ID= PRIOR PARENT_METIER_ID
) AND ORGANISATION_ID IN (
SELECT ORGANISATION_ID
FROM INTRA_ORGANISATION
START WITH ORGANISATION_ID='025ee58f-35a3-4183-8679-01472838f753'
CONNECT BY ORGANISATION_ID= PRIOR PARENT_ORGANISATION_ID
);
http://oracle.com
Oracle database uses CONNECT BY to generate EXPLAINs.
http://practical-sql-tuning.blogspot.com/2009/01/use-of-statistically-incor…
select sum(human_cnt) from facts
where territory_id in (select territory_id
from dic$territory
start with territory_code = :code
connect by prior territory_id = territory_parent);
http://www.dbasupport.com/forums/archive/index.php/t-30008.html
SELECT LEVEL,LPAD(' ',8*(LEVEL-1))||T_COM_OBJ.OBJ_NAME, T_COM_OBJ.OBJ_PARENT,
T_COM_OBJ.OBJ_ID
FROM VDR.T_COM_OBJ
START WITH T_COM_OBJ.OBJ_ID in (select obj_id obj_main from vdr.t_com_obj
where obj_id=obj_parent)
CONNECT BY PRIOR T_COM_OBJ.OBJ_ID = T_COM_OBJ.OBJ_PARENT
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
[Maria-developers] WL#10 Updated (by Sergei): Microseconds
by worklog-noreply@askmonty.org 29 Jun '10
by worklog-noreply@askmonty.org 29 Jun '10
29 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Microseconds
CREATION DATE..: Thu, 26 Mar 2009, 00:29
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......:
CATEGORY.......: Server-BackLog
TASK ID........: 10 (http://askmonty.org/worklog/?tid=10)
VERSION........: Server-5.3
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 80 (hours remain)
ORIG. ESTIMATE.: 80
PROGRESS NOTES:
-=-=(Sergei - Tue, 29 Jun 2010, 14:03)=-=-
Category updated.
--- /tmp/wklog.10.old.32058 2010-06-29 14:03:11.000000000 +0000
+++ /tmp/wklog.10.new.32058 2010-06-29 14:03:11.000000000 +0000
@@ -1 +1 @@
-Server-RawIdeaBin
+Server-BackLog
-=-=(Sergei - Tue, 29 Jun 2010, 14:03)=-=-
Category updated.
--- /tmp/wklog.10.old.31970 2010-06-29 14:03:01.000000000 +0000
+++ /tmp/wklog.10.new.31970 2010-06-29 14:03:01.000000000 +0000
@@ -1 +1 @@
-Server-Sprint
+Server-RawIdeaBin
-=-=(Monty - Fri, 29 Jan 2010, 19:05)=-=-
Version updated.
--- /tmp/wklog.10.old.5698 2010-01-29 19:05:42.000000000 +0200
+++ /tmp/wklog.10.new.5698 2010-01-29 19:05:42.000000000 +0200
@@ -1 +1 @@
-Server-5.2
+Server-5.3
DESCRIPTION:
Add microsecond precision to NOW()
Add new field types for time and datetime with microprecision
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
[Maria-developers] WL#10 Updated (by Sergei): Microseconds
by worklog-noreply@askmonty.org 29 Jun '10
by worklog-noreply@askmonty.org 29 Jun '10
29 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Microseconds
CREATION DATE..: Thu, 26 Mar 2009, 00:29
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......:
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 10 (http://askmonty.org/worklog/?tid=10)
VERSION........: Server-5.3
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 80 (hours remain)
ORIG. ESTIMATE.: 80
PROGRESS NOTES:
-=-=(Sergei - Tue, 29 Jun 2010, 14:03)=-=-
Category updated.
--- /tmp/wklog.10.old.31970 2010-06-29 14:03:01.000000000 +0000
+++ /tmp/wklog.10.new.31970 2010-06-29 14:03:01.000000000 +0000
@@ -1 +1 @@
-Server-Sprint
+Server-RawIdeaBin
-=-=(Monty - Fri, 29 Jan 2010, 19:05)=-=-
Version updated.
--- /tmp/wklog.10.old.5698 2010-01-29 19:05:42.000000000 +0200
+++ /tmp/wklog.10.new.5698 2010-01-29 19:05:42.000000000 +0200
@@ -1 +1 @@
-Server-5.2
+Server-5.3
DESCRIPTION:
Add microsecond precision to NOW()
Add new field types for time and datetime with microprecision
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
[Maria-developers] WL#24 Updated (by Sergei): index_merge: fair choice between index_merge union and range access
by worklog-noreply@askmonty.org 29 Jun '10
by worklog-noreply@askmonty.org 29 Jun '10
29 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: fair choice between index_merge union and range access
CREATION DATE..: Tue, 26 May 2009, 12:10
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Psergey
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 24 (http://askmonty.org/worklog/?tid=24)
VERSION........: Server-9.x
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Sergei - Tue, 29 Jun 2010, 14:00)=-=-
Category updated.
--- /tmp/wklog.24.old.31772 2010-06-29 14:00:05.000000000 +0000
+++ /tmp/wklog.24.new.31772 2010-06-29 14:00:05.000000000 +0000
@@ -1 +1 @@
-Server-Sprint
+Server-RawIdeaBin
-=-=(Guest - Sun, 16 Aug 2009, 02:13)=-=-
Low Level Design modified.
--- /tmp/wklog.24.old.23383 2009-08-16 02:13:54.000000000 +0300
+++ /tmp/wklog.24.new.23383 2009-08-16 02:13:54.000000000 +0300
@@ -125,7 +125,7 @@
The optimizer will generate the plans that only use the "col1=c1" part. The
right side of the AND will be ignored even if it has good selectivity.
-(Here an imerge for col2=c2 OR col3=c3 won't be built since neither col2=c2 nor
+(Here no imerge for col2=c2 OR col3=c3 will be built since neither col2=c2 nor
col3=c3 represent index ranges.)
@@ -199,7 +199,7 @@
O2. "Create index_merge accesses when possible"
Current tree_or() will not create index_merge access when it could create
- non-index merge access (see DISCARD-IMERGE-3 and its example in the "Problems
+ non-index merge access (see DISCARD-IMERGE-2 and its example in the "Problems
in the current implementation" section). This will be changed to work as
follows: we will create index_merge made for index scans that didn't have
their match in the other sel_tree.
-=-=(Guest - Sun, 16 Aug 2009, 01:03)=-=-
Low Level Design modified.
--- /tmp/wklog.24.old.20767 2009-08-16 01:03:11.000000000 +0300
+++ /tmp/wklog.24.new.20767 2009-08-16 01:03:11.000000000 +0300
@@ -18,6 +18,8 @@
# a range tree has range access options, possibly for several keys
range_tree = range(key1) AND range(key2) AND ... AND range(keyN);
+ (here range(keyi) may represent ranges not for initial keyi prefixes,
+ but ranges for any infixes for keyi)
# merge tree represents several way to index_merge
imerge_tree = imerge1 AND imerge2 AND ...
@@ -47,13 +49,13 @@
R.add(range_union(A.range(i), B.range(i)));
if (R has at least one range access)
- return R;
+ return R; // DISCARD-IMERGE-2
else
{
/* could not build any range accesses. construct index_merge */
- remove non-ranges from A; // DISCARD-IMERGE-2
+ remove non-ranges from A;
remove non-ranges from B;
- return new index_merge(A, B);
+ return new index_merge(A, B); // DISCARD-IMERGE-3
}
}
else if (A is range tree and B is index_merge tree (or vice versa))
@@ -65,12 +67,12 @@
(range_treeB_11 OR range_treeB_12 OR ... OR range_treeB_1N) AND
(range_treeB_21 OR range_treeB_22 OR ... OR range_treeB_2N) AND
...
- (range_treeB_K1 OR range_treeB_K2 OR ... OR range_treeB_kN) AND
+ (range_treeB_K1 OR range_treeB_K2 OR ... OR range_treeB_kN)
=
(range_treeA OR range_treeB_11 OR ... OR range_treeB_1N) AND
(range_treeA OR range_treeB_21 OR ... OR range_treeB_2N) AND
...
- (range_treeA OR range_treeB_11 OR ... OR range_treeB_1N) AND
+ (range_treeA OR range_treeB_11 OR ... OR range_treeB_1N)
Now each line represents an index_merge..
}
@@ -82,18 +84,18 @@
OR
imergeB1 AND imergeB2 AND ... AND imergeBN
- -> (discard all imergeA{i=2,3,...} -> // DISCARD-IMERGE-3
+ -> (discard all imergeA{i=2,3,...} -> // DISCARD-IMERGE-4
imergeA1
OR
- imergeB1 AND imergeB2 AND ... AND imergeBN =
+ imergeB1 =
- = (combine imergeA1 with each of the imergeB{i} ) =
+ = (combine imergeA1 with each of the range_treeB_1{i} ) =
- combine(imergeA1 OR imergeB1) AND
- combine(imergeA1 OR imergeB2) AND
+ combine(imergeA1 OR range_treeB_11) AND
+ combine(imergeA1 OR range_treeB_12) AND
... AND
- combine(imergeA1 OR imergeBN)
+ combine(imergeA1 OR range_treeB_1N)
}
}
@@ -109,7 +111,7 @@
DISCARD-IMERGE-2 step will cause index_merge option to be discarded when
the WHERE clause has this form (conditions t.badkey may have abritrary form):
- (t.badkey<c1 AND t.key1=c1) OR (t.key1=c2 AND t.badkey < c2)
+ (t.badkey<c1 AND t.key1=c1) OR (t.key2=c2 AND t.badkey < c2)
DISCARD-IMERGE-3 manifests itself as the following effect: suppose there are
two indexes:
@@ -123,6 +125,8 @@
The optimizer will generate the plans that only use the "col1=c1" part. The
right side of the AND will be ignored even if it has good selectivity.
+(Here an imerge for col2=c2 OR col3=c3 won't be built since neither col2=c2 nor
+col3=c3 represent index ranges.)
2. New implementation
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 24
-=-=(Guest - Sat, 20 Jun 2009, 09:34)=-=-
Low Level Design modified.
--- /tmp/wklog.24.old.21663 2009-06-20 09:34:48.000000000 +0300
+++ /tmp/wklog.24.new.21663 2009-06-20 09:34:48.000000000 +0300
@@ -4,6 +4,7 @@
2. New implementation
2.1 New tree_and()
2.2 New tree_or()
+3. Testing and required coverage
</contents>
1. Current implementation overview
@@ -240,3 +241,14 @@
In order to limit the impact of this combinatorial explosion, we will
introduce a rule that we won't generate more than #defined
MAX_IMERGE_OPTS options.
+
+3. Testing and required coverage
+================================
+So far could find the following user cases:
+
+* BUG#17259: Query optimizer chooses wrong index
+* BUG#17673: Optimizer does not use Index Merge optimization in some cases
+* BUG#23322: Optimizer sometimes erroniously prefers other index over index merge
+* BUG#30151: optimizer is very reluctant to chose index_merge algorithm
+
+
-=-=(Guest - Thu, 18 Jun 2009, 16:55)=-=-
Low Level Design modified.
--- /tmp/wklog.24.old.19152 2009-06-18 16:55:00.000000000 +0300
+++ /tmp/wklog.24.new.19152 2009-06-18 16:55:00.000000000 +0300
@@ -141,13 +141,15 @@
Operations on SEL_ARG trees will be modified to produce/process the trees of
this kind:
+
2.1 New tree_and()
------------------
In order not to lose plans, we'll make these changes:
-1. Don't remove index_merge part of the tree.
+A1. Don't remove index_merge part of the tree (this will take care of
+ DISCARD-IMERGE-1 problem)
-2. Push range conditions down into index_merge trees that may support them.
+A2. Push range conditions down into index_merge trees that may support them.
if one tree has range(key1) and the other tree has imerge(key1 OR key2)
then perform an equvalent of this operation:
@@ -155,8 +157,86 @@
(rangeA(key1) AND rangeB(key1)) OR (rangeA(key1) AND rangeB(key2))
-3. Just as before: if both sel_tree A and sel_tree B have index_merge options,
+A3. Just as before: if both sel_tree A and sel_tree B have index_merge options,
concatenate them together.
-2.2 New tree_or()
+2.2 New tree_or()
+-----------------
+O1. Dont remove non-range plans:
+ Current tree_or() code will refuse to produce index_merge plans for
+ conditions like
+
+ "t.key1part2=const OR t.key2part1=const"
+
+ (this is marked as DISCARD-IMERGE-3). This was justifed as the left part of
+ the AND condition is not usable for range access, and the operation of
+ tree_and() guaranteed that there was no way it could changed to make a
+ usable range plan. With new tree_and() and rule A2, this is no longer the
+ case. For example for this query:
+
+ (t.key1part2=const OR t.key2part1=const) AND t.key1part1=const
+
+ it will construct a
+
+ imerge(t.key1part2=const OR t.key2part1=const), range(t.key1part1=const)
+
+ then tree_and() will apply rule A2 to push the range down into index merge
+ and after that we'll have:
+
+ range(t.key1part1=const)
+ imerge(
+ t.key1part2=const AND t.key1part1=const,
+ t.key2part1=const
+ )
+ note that imerge(...) describes a usable index_merge plan and it's possible
+ that it will be the best access path.
+
+O2. "Create index_merge accesses when possible"
+ Current tree_or() will not create index_merge access when it could create
+ non-index merge access (see DISCARD-IMERGE-3 and its example in the "Problems
+ in the current implementation" section). This will be changed to work as
+ follows: we will create index_merge made for index scans that didn't have
+ their match in the other sel_tree.
+ Ilustrating it with an example:
+
+ | sel_tree_A | sel_tree_B | A or B | include in index_merge?
+ ------+------------+------------+--------+------------------------
+ key1 | cond1 | cond2 | condM | no
+ key2 | cond3 | cond4 | NULL | no
+ key3 | cond5 | | | yes, A-side
+ key4 | cond6 | | | yes, A-side
+ key5 | | cond7 | | yes, B-side
+ key6 | | cond8 | | yes, B-side
+
+ here we assume that
+ - (cond1 OR cond2) did produce a combined range. Not including them in
+ index_merge.
+ - (cond3 OR cond4) didn't produce a usable range (e.g. they were
+ t.key1part1=c1 AND t.key1part2=c1, respectively, and combining them
+ didn't yield any range list)
+ - All other scand didn't have their counterparts, so we'll end up with a
+ SEL_TREE of:
+
+ range(condM) AND index_merge((cond5 AND cond6),(cond7 AND cond8))
+ .
+
+O4. There is no O4. DISCARD-INDEX-MERGE-4 will remain there. The idea is
+that although DISCARD-INDEX-MERGE-4 does discard plans, so far we haven
+seen any complaints that could be attributed to it.
+If we face the need to lift DISCARD-INDEX-MERGE-4, our answer will be to
+lift it ,and produce a cross-product:
+
+ ((key1p OR key2p) AND (key3p OR key4p))
+ OR
+ ((key5p OR key6p) AND (key7p OR key8p))
+
+ = (key1p OR key2p OR key5p OR key6p) AND // this part is currently
+ (key3p OR key4p OR key5p OR key6p) AND // produced
+
+ (key1p OR key2p OR key5p OR key6p) AND // this part will be added
+ (key3p OR key4p OR key5p OR key6p) //.
+
+In order to limit the impact of this combinatorial explosion, we will
+introduce a rule that we won't generate more than #defined
+MAX_IMERGE_OPTS options.
-=-=(Guest - Thu, 18 Jun 2009, 14:56)=-=-
Low Level Design modified.
--- /tmp/wklog.24.old.15612 2009-06-18 14:56:09.000000000 +0300
+++ /tmp/wklog.24.new.15612 2009-06-18 14:56:09.000000000 +0300
@@ -1 +1,162 @@
+<contents>
+1. Current implementation overview
+1.1. Problems in the current implementation
+2. New implementation
+2.1 New tree_and()
+2.2 New tree_or()
+</contents>
+
+1. Current implementation overview
+==================================
+At the moment, range analyzer works as follows:
+
+SEL_TREE structure represents
+
+ # There are sel_trees, a sel_tree is either range or merge tree
+ sel_tree = range_tree | imerge_tree
+
+ # a range tree has range access options, possibly for several keys
+ range_tree = range(key1) AND range(key2) AND ... AND range(keyN);
+
+ # merge tree represents several way to index_merge
+ imerge_tree = imerge1 AND imerge2 AND ...
+
+ # a way to do index merge == a set to use of different indexes.
+ imergeX = range_tree1 OR range_tree2 OR ..
+ where no pair of range_treeX have ranges over the same index.
+
+
+ tree_and(A, B)
+ {
+ if (both A and B are range trees)
+ return a range_tree with computed intersection for each range;
+ if (only one of A and B is a range tree)
+ return that tree; // DISCARD-IMERGE-1
+ // at this point both trees are index_merge trees
+ return concat_lists( A.imerge1 ... A.imergeN, B.imerge1 ... B.imergeN);
+ }
+
+
+ tree_or(A, B)
+ {
+ if (A and B are range trees)
+ {
+ R = new range_tree;
+ for each index i
+ R.add(range_union(A.range(i), B.range(i)));
+
+ if (R has at least one range access)
+ return R;
+ else
+ {
+ /* could not build any range accesses. construct index_merge */
+ remove non-ranges from A; // DISCARD-IMERGE-2
+ remove non-ranges from B;
+ return new index_merge(A, B);
+ }
+ }
+ else if (A is range tree and B is index_merge tree (or vice versa))
+ {
+ Perform this transformation:
+
+ range_treeA // this is A
+ OR
+ (range_treeB_11 OR range_treeB_12 OR ... OR range_treeB_1N) AND
+ (range_treeB_21 OR range_treeB_22 OR ... OR range_treeB_2N) AND
+ ...
+ (range_treeB_K1 OR range_treeB_K2 OR ... OR range_treeB_kN) AND
+ =
+ (range_treeA OR range_treeB_11 OR ... OR range_treeB_1N) AND
+ (range_treeA OR range_treeB_21 OR ... OR range_treeB_2N) AND
+ ...
+ (range_treeA OR range_treeB_11 OR ... OR range_treeB_1N) AND
+
+ Now each line represents an index_merge..
+ }
+ else if (both A and B are index_merge trees)
+ {
+ Perform this transformation:
+
+ imergeA1 AND imergeA2 AND ... AND imergeAN
+ OR
+ imergeB1 AND imergeB2 AND ... AND imergeBN
+
+ -> (discard all imergeA{i=2,3,...} -> // DISCARD-IMERGE-3
+
+ imergeA1
+ OR
+ imergeB1 AND imergeB2 AND ... AND imergeBN =
+
+ = (combine imergeA1 with each of the imergeB{i} ) =
+
+ combine(imergeA1 OR imergeB1) AND
+ combine(imergeA1 OR imergeB2) AND
+ ... AND
+ combine(imergeA1 OR imergeBN)
+ }
+ }
+
+1.1. Problems in the current implementation
+-------------------------------------------
+As marked in the code above:
+
+DISCARD-IMERGE-1 step will cause index_merge option to be discarded when
+the WHERE clause has this form:
+
+ (t.key1=c1 OR t.key2=c2) AND t.badkey < c3
+
+DISCARD-IMERGE-2 step will cause index_merge option to be discarded when
+the WHERE clause has this form (conditions t.badkey may have abritrary form):
+
+ (t.badkey<c1 AND t.key1=c1) OR (t.key1=c2 AND t.badkey < c2)
+
+DISCARD-IMERGE-3 manifests itself as the following effect: suppose there are
+two indexes:
+
+ INDEX i1(col1, col2),
+ INDEX i2(col1, col3)
+
+and this WHERE clause:
+
+ col1=c1 AND (col2=c2 OR col3=c3)
+
+The optimizer will generate the plans that only use the "col1=c1" part. The
+right side of the AND will be ignored even if it has good selectivity.
+
+
+2. New implementation
+=====================
+
+<general idea>
+* Don't start fighting combinatorial explosion until we've actually got one.
+</>
+
+SEL_TREE structure will be now able to hold both index_merge and range scan
+candidates at the same time. That is,
+
+ sel_tree2 = range_tree AND imerge_tree
+
+where both parts are optional (i.e. can be empty)
+
+Operations on SEL_ARG trees will be modified to produce/process the trees of
+this kind:
+
+2.1 New tree_and()
+------------------
+In order not to lose plans, we'll make these changes:
+
+1. Don't remove index_merge part of the tree.
+
+2. Push range conditions down into index_merge trees that may support them.
+ if one tree has range(key1) and the other tree has imerge(key1 OR key2)
+ then perform an equvalent of this operation:
+
+ rangeA(key1) AND ( rangeB(key1) OR rangeB(key2)) =
+
+ (rangeA(key1) AND rangeB(key1)) OR (rangeA(key1) AND rangeB(key2))
+
+3. Just as before: if both sel_tree A and sel_tree B have index_merge options,
+ concatenate them together.
+
+2.2 New tree_or()
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 24
-=-=(Guest - Mon, 01 Jun 2009, 23:30)=-=-
High-Level Specification modified.
--- /tmp/wklog.24.old.21580 2009-06-01 23:30:06.000000000 +0300
+++ /tmp/wklog.24.new.21580 2009-06-01 23:30:06.000000000 +0300
@@ -64,6 +64,9 @@
* How strict is the limitation on the form of the WHERE?
+* Which version should this be based on? 5.1? Which patches are should be in
+ (google's/percona's/maria/etc?)
+
* TODO: The optimizer didn't compare costs of index_merge and range before (ok
it did but that was done for accesses to different tables). Will there be any
possible gotchas here?
-=-=(Guest - Wed, 27 May 2009, 13:59)=-=-
Title modified.
--- /tmp/wklog.24.old.9498 2009-05-27 13:59:23.000000000 +0300
+++ /tmp/wklog.24.new.9498 2009-05-27 13:59:23.000000000 +0300
@@ -1 +1 @@
-index_merge optimizer: dont discard index_merge union strategies when range is available
+index_merge: fair choice between index_merge union and range access
------------------------------------------------------------
-=-=(View All Progress Notes, 11 total)=-=-
http://askmonty.org/worklog/index.pl?tid=24&nolimit=1
DESCRIPTION:
Current range optimizer will discard possible index_merge/[sort]union
strategies when there is a possible range plan. This action is a part of
measures we take to avoid combinatorial explosion of possible range/
index_merge strategies.
A bad side effect of this is that for WHERE clauses in form
t.key1= 'very-frequent-value' AND (t.key2='rare-value1' OR t.key3='rare-value2')
the optimizer will
- discard union(key2,key3) in favor of range(key1)
- consider costs of using range(key1) and discard that plan also
and the overall effect is that possible poor range access will cause possible
good index_merge access not to be considered.
This WL is to about lifting this limitation at least for some subset of WHERE
clauses.
HIGH-LEVEL SPECIFICATION:
(Not a ready HLS but draft)
<contents>
Solution overview
Limitations
TODO
</contents>
Solution overview
=================
The idea is to delay discarding potential index_merge plans until the point
where it is really necessary.
This way, we won't have to do much changes in the range analyzer, but will be
able to keep potential index_merge plan just enough so that it's possible to
take it into consideration together with range access plans.
Since there are no changes in the optimizer, the ability to consider both
range and index_merge options will be limited to WHERE clauses of this form:
WHERE := range_cond(key1_1) AND
range_cond(key2_1) AND
other_cond AND
index_merge_OR_cond1(key3_1, key3_2, ...)
index_merge_OR_cond2(key4_1, key4_2, ...)
where
index_merge_OR_cond{N} := (range_cond(keyN_1) OR
range_cond(keyN_2) OR ...)
range_cond(keyX) := condition that allows to construct range access of keyX
and doesn't allow to construct range/index_merge accesses
for any keys of the table in question.
For such WHERE clauses, the range analyzer will produce SEL_TREE of this form:
SEL_TREE(
range(key1_1),
...
range(key2_1),
SEL_IMERGE( (1)
SEL_TREE(key3_1})
SEL_TREE(key3_2})
...
)
...
)
which can be used to make a cost-based choice between range and index_merge.
Limitations
-----------
This will not be a full solution in a sense that the range analyzer will not
be able to produce sel_tree (1) if the WHERE clause is specified in other form
(e.g. brackets were opened).
TODO
----
* is it a problem if there are keys that are referred to both from
index_merge and from range access?
* How strict is the limitation on the form of the WHERE?
* Which version should this be based on? 5.1? Which patches are should be in
(google's/percona's/maria/etc?)
* TODO: The optimizer didn't compare costs of index_merge and range before (ok
it did but that was done for accesses to different tables). Will there be any
possible gotchas here?
LOW-LEVEL DESIGN:
<contents>
1. Current implementation overview
1.1. Problems in the current implementation
2. New implementation
2.1 New tree_and()
2.2 New tree_or()
3. Testing and required coverage
</contents>
1. Current implementation overview
==================================
At the moment, range analyzer works as follows:
SEL_TREE structure represents
# There are sel_trees, a sel_tree is either range or merge tree
sel_tree = range_tree | imerge_tree
# a range tree has range access options, possibly for several keys
range_tree = range(key1) AND range(key2) AND ... AND range(keyN);
(here range(keyi) may represent ranges not for initial keyi prefixes,
but ranges for any infixes for keyi)
# merge tree represents several way to index_merge
imerge_tree = imerge1 AND imerge2 AND ...
# a way to do index merge == a set to use of different indexes.
imergeX = range_tree1 OR range_tree2 OR ..
where no pair of range_treeX have ranges over the same index.
tree_and(A, B)
{
if (both A and B are range trees)
return a range_tree with computed intersection for each range;
if (only one of A and B is a range tree)
return that tree; // DISCARD-IMERGE-1
// at this point both trees are index_merge trees
return concat_lists( A.imerge1 ... A.imergeN, B.imerge1 ... B.imergeN);
}
tree_or(A, B)
{
if (A and B are range trees)
{
R = new range_tree;
for each index i
R.add(range_union(A.range(i), B.range(i)));
if (R has at least one range access)
return R; // DISCARD-IMERGE-2
else
{
/* could not build any range accesses. construct index_merge */
remove non-ranges from A;
remove non-ranges from B;
return new index_merge(A, B); // DISCARD-IMERGE-3
}
}
else if (A is range tree and B is index_merge tree (or vice versa))
{
Perform this transformation:
range_treeA // this is A
OR
(range_treeB_11 OR range_treeB_12 OR ... OR range_treeB_1N) AND
(range_treeB_21 OR range_treeB_22 OR ... OR range_treeB_2N) AND
...
(range_treeB_K1 OR range_treeB_K2 OR ... OR range_treeB_kN)
=
(range_treeA OR range_treeB_11 OR ... OR range_treeB_1N) AND
(range_treeA OR range_treeB_21 OR ... OR range_treeB_2N) AND
...
(range_treeA OR range_treeB_11 OR ... OR range_treeB_1N)
Now each line represents an index_merge..
}
else if (both A and B are index_merge trees)
{
Perform this transformation:
imergeA1 AND imergeA2 AND ... AND imergeAN
OR
imergeB1 AND imergeB2 AND ... AND imergeBN
-> (discard all imergeA{i=2,3,...} -> // DISCARD-IMERGE-4
imergeA1
OR
imergeB1 =
= (combine imergeA1 with each of the range_treeB_1{i} ) =
combine(imergeA1 OR range_treeB_11) AND
combine(imergeA1 OR range_treeB_12) AND
... AND
combine(imergeA1 OR range_treeB_1N)
}
}
1.1. Problems in the current implementation
-------------------------------------------
As marked in the code above:
DISCARD-IMERGE-1 step will cause index_merge option to be discarded when
the WHERE clause has this form:
(t.key1=c1 OR t.key2=c2) AND t.badkey < c3
DISCARD-IMERGE-2 step will cause index_merge option to be discarded when
the WHERE clause has this form (conditions t.badkey may have abritrary form):
(t.badkey<c1 AND t.key1=c1) OR (t.key2=c2 AND t.badkey < c2)
DISCARD-IMERGE-3 manifests itself as the following effect: suppose there are
two indexes:
INDEX i1(col1, col2),
INDEX i2(col1, col3)
and this WHERE clause:
col1=c1 AND (col2=c2 OR col3=c3)
The optimizer will generate the plans that only use the "col1=c1" part. The
right side of the AND will be ignored even if it has good selectivity.
(Here no imerge for col2=c2 OR col3=c3 will be built since neither col2=c2 nor
col3=c3 represent index ranges.)
2. New implementation
=====================
<general idea>
* Don't start fighting combinatorial explosion until we've actually got one.
</>
SEL_TREE structure will be now able to hold both index_merge and range scan
candidates at the same time. That is,
sel_tree2 = range_tree AND imerge_tree
where both parts are optional (i.e. can be empty)
Operations on SEL_ARG trees will be modified to produce/process the trees of
this kind:
2.1 New tree_and()
------------------
In order not to lose plans, we'll make these changes:
A1. Don't remove index_merge part of the tree (this will take care of
DISCARD-IMERGE-1 problem)
A2. Push range conditions down into index_merge trees that may support them.
if one tree has range(key1) and the other tree has imerge(key1 OR key2)
then perform an equvalent of this operation:
rangeA(key1) AND ( rangeB(key1) OR rangeB(key2)) =
(rangeA(key1) AND rangeB(key1)) OR (rangeA(key1) AND rangeB(key2))
A3. Just as before: if both sel_tree A and sel_tree B have index_merge options,
concatenate them together.
2.2 New tree_or()
-----------------
O1. Dont remove non-range plans:
Current tree_or() code will refuse to produce index_merge plans for
conditions like
"t.key1part2=const OR t.key2part1=const"
(this is marked as DISCARD-IMERGE-3). This was justifed as the left part of
the AND condition is not usable for range access, and the operation of
tree_and() guaranteed that there was no way it could changed to make a
usable range plan. With new tree_and() and rule A2, this is no longer the
case. For example for this query:
(t.key1part2=const OR t.key2part1=const) AND t.key1part1=const
it will construct a
imerge(t.key1part2=const OR t.key2part1=const), range(t.key1part1=const)
then tree_and() will apply rule A2 to push the range down into index merge
and after that we'll have:
range(t.key1part1=const)
imerge(
t.key1part2=const AND t.key1part1=const,
t.key2part1=const
)
note that imerge(...) describes a usable index_merge plan and it's possible
that it will be the best access path.
O2. "Create index_merge accesses when possible"
Current tree_or() will not create index_merge access when it could create
non-index merge access (see DISCARD-IMERGE-2 and its example in the "Problems
in the current implementation" section). This will be changed to work as
follows: we will create index_merge made for index scans that didn't have
their match in the other sel_tree.
Ilustrating it with an example:
| sel_tree_A | sel_tree_B | A or B | include in index_merge?
------+------------+------------+--------+------------------------
key1 | cond1 | cond2 | condM | no
key2 | cond3 | cond4 | NULL | no
key3 | cond5 | | | yes, A-side
key4 | cond6 | | | yes, A-side
key5 | | cond7 | | yes, B-side
key6 | | cond8 | | yes, B-side
here we assume that
- (cond1 OR cond2) did produce a combined range. Not including them in
index_merge.
- (cond3 OR cond4) didn't produce a usable range (e.g. they were
t.key1part1=c1 AND t.key1part2=c1, respectively, and combining them
didn't yield any range list)
- All other scand didn't have their counterparts, so we'll end up with a
SEL_TREE of:
range(condM) AND index_merge((cond5 AND cond6),(cond7 AND cond8))
.
O4. There is no O4. DISCARD-INDEX-MERGE-4 will remain there. The idea is
that although DISCARD-INDEX-MERGE-4 does discard plans, so far we haven
seen any complaints that could be attributed to it.
If we face the need to lift DISCARD-INDEX-MERGE-4, our answer will be to
lift it ,and produce a cross-product:
((key1p OR key2p) AND (key3p OR key4p))
OR
((key5p OR key6p) AND (key7p OR key8p))
= (key1p OR key2p OR key5p OR key6p) AND // this part is currently
(key3p OR key4p OR key5p OR key6p) AND // produced
(key1p OR key2p OR key5p OR key6p) AND // this part will be added
(key3p OR key4p OR key5p OR key6p) //.
In order to limit the impact of this combinatorial explosion, we will
introduce a rule that we won't generate more than #defined
MAX_IMERGE_OPTS options.
3. Testing and required coverage
================================
So far could find the following user cases:
* BUG#17259: Query optimizer chooses wrong index
* BUG#17673: Optimizer does not use Index Merge optimization in some cases
* BUG#23322: Optimizer sometimes erroniously prefers other index over index merge
* BUG#30151: optimizer is very reluctant to chose index_merge algorithm
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
[Maria-developers] WL#24 Updated (by Sergei): index_merge: fair choice between index_merge union and range access
by worklog-noreply@askmonty.org 29 Jun '10
by worklog-noreply@askmonty.org 29 Jun '10
29 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: fair choice between index_merge union and range access
CREATION DATE..: Tue, 26 May 2009, 12:10
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Psergey
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 24 (http://askmonty.org/worklog/?tid=24)
VERSION........: Server-9.x
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Sergei - Tue, 29 Jun 2010, 14:00)=-=-
Category updated.
--- /tmp/wklog.24.old.31772 2010-06-29 14:00:05.000000000 +0000
+++ /tmp/wklog.24.new.31772 2010-06-29 14:00:05.000000000 +0000
@@ -1 +1 @@
-Server-Sprint
+Server-RawIdeaBin
-=-=(Guest - Sun, 16 Aug 2009, 02:13)=-=-
Low Level Design modified.
--- /tmp/wklog.24.old.23383 2009-08-16 02:13:54.000000000 +0300
+++ /tmp/wklog.24.new.23383 2009-08-16 02:13:54.000000000 +0300
@@ -125,7 +125,7 @@
The optimizer will generate the plans that only use the "col1=c1" part. The
right side of the AND will be ignored even if it has good selectivity.
-(Here an imerge for col2=c2 OR col3=c3 won't be built since neither col2=c2 nor
+(Here no imerge for col2=c2 OR col3=c3 will be built since neither col2=c2 nor
col3=c3 represent index ranges.)
@@ -199,7 +199,7 @@
O2. "Create index_merge accesses when possible"
Current tree_or() will not create index_merge access when it could create
- non-index merge access (see DISCARD-IMERGE-3 and its example in the "Problems
+ non-index merge access (see DISCARD-IMERGE-2 and its example in the "Problems
in the current implementation" section). This will be changed to work as
follows: we will create index_merge made for index scans that didn't have
their match in the other sel_tree.
-=-=(Guest - Sun, 16 Aug 2009, 01:03)=-=-
Low Level Design modified.
--- /tmp/wklog.24.old.20767 2009-08-16 01:03:11.000000000 +0300
+++ /tmp/wklog.24.new.20767 2009-08-16 01:03:11.000000000 +0300
@@ -18,6 +18,8 @@
# a range tree has range access options, possibly for several keys
range_tree = range(key1) AND range(key2) AND ... AND range(keyN);
+ (here range(keyi) may represent ranges not for initial keyi prefixes,
+ but ranges for any infixes for keyi)
# merge tree represents several way to index_merge
imerge_tree = imerge1 AND imerge2 AND ...
@@ -47,13 +49,13 @@
R.add(range_union(A.range(i), B.range(i)));
if (R has at least one range access)
- return R;
+ return R; // DISCARD-IMERGE-2
else
{
/* could not build any range accesses. construct index_merge */
- remove non-ranges from A; // DISCARD-IMERGE-2
+ remove non-ranges from A;
remove non-ranges from B;
- return new index_merge(A, B);
+ return new index_merge(A, B); // DISCARD-IMERGE-3
}
}
else if (A is range tree and B is index_merge tree (or vice versa))
@@ -65,12 +67,12 @@
(range_treeB_11 OR range_treeB_12 OR ... OR range_treeB_1N) AND
(range_treeB_21 OR range_treeB_22 OR ... OR range_treeB_2N) AND
...
- (range_treeB_K1 OR range_treeB_K2 OR ... OR range_treeB_kN) AND
+ (range_treeB_K1 OR range_treeB_K2 OR ... OR range_treeB_kN)
=
(range_treeA OR range_treeB_11 OR ... OR range_treeB_1N) AND
(range_treeA OR range_treeB_21 OR ... OR range_treeB_2N) AND
...
- (range_treeA OR range_treeB_11 OR ... OR range_treeB_1N) AND
+ (range_treeA OR range_treeB_11 OR ... OR range_treeB_1N)
Now each line represents an index_merge..
}
@@ -82,18 +84,18 @@
OR
imergeB1 AND imergeB2 AND ... AND imergeBN
- -> (discard all imergeA{i=2,3,...} -> // DISCARD-IMERGE-3
+ -> (discard all imergeA{i=2,3,...} -> // DISCARD-IMERGE-4
imergeA1
OR
- imergeB1 AND imergeB2 AND ... AND imergeBN =
+ imergeB1 =
- = (combine imergeA1 with each of the imergeB{i} ) =
+ = (combine imergeA1 with each of the range_treeB_1{i} ) =
- combine(imergeA1 OR imergeB1) AND
- combine(imergeA1 OR imergeB2) AND
+ combine(imergeA1 OR range_treeB_11) AND
+ combine(imergeA1 OR range_treeB_12) AND
... AND
- combine(imergeA1 OR imergeBN)
+ combine(imergeA1 OR range_treeB_1N)
}
}
@@ -109,7 +111,7 @@
DISCARD-IMERGE-2 step will cause index_merge option to be discarded when
the WHERE clause has this form (conditions t.badkey may have abritrary form):
- (t.badkey<c1 AND t.key1=c1) OR (t.key1=c2 AND t.badkey < c2)
+ (t.badkey<c1 AND t.key1=c1) OR (t.key2=c2 AND t.badkey < c2)
DISCARD-IMERGE-3 manifests itself as the following effect: suppose there are
two indexes:
@@ -123,6 +125,8 @@
The optimizer will generate the plans that only use the "col1=c1" part. The
right side of the AND will be ignored even if it has good selectivity.
+(Here an imerge for col2=c2 OR col3=c3 won't be built since neither col2=c2 nor
+col3=c3 represent index ranges.)
2. New implementation
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 24
-=-=(Guest - Sat, 20 Jun 2009, 09:34)=-=-
Low Level Design modified.
--- /tmp/wklog.24.old.21663 2009-06-20 09:34:48.000000000 +0300
+++ /tmp/wklog.24.new.21663 2009-06-20 09:34:48.000000000 +0300
@@ -4,6 +4,7 @@
2. New implementation
2.1 New tree_and()
2.2 New tree_or()
+3. Testing and required coverage
</contents>
1. Current implementation overview
@@ -240,3 +241,14 @@
In order to limit the impact of this combinatorial explosion, we will
introduce a rule that we won't generate more than #defined
MAX_IMERGE_OPTS options.
+
+3. Testing and required coverage
+================================
+So far could find the following user cases:
+
+* BUG#17259: Query optimizer chooses wrong index
+* BUG#17673: Optimizer does not use Index Merge optimization in some cases
+* BUG#23322: Optimizer sometimes erroniously prefers other index over index merge
+* BUG#30151: optimizer is very reluctant to chose index_merge algorithm
+
+
-=-=(Guest - Thu, 18 Jun 2009, 16:55)=-=-
Low Level Design modified.
--- /tmp/wklog.24.old.19152 2009-06-18 16:55:00.000000000 +0300
+++ /tmp/wklog.24.new.19152 2009-06-18 16:55:00.000000000 +0300
@@ -141,13 +141,15 @@
Operations on SEL_ARG trees will be modified to produce/process the trees of
this kind:
+
2.1 New tree_and()
------------------
In order not to lose plans, we'll make these changes:
-1. Don't remove index_merge part of the tree.
+A1. Don't remove index_merge part of the tree (this will take care of
+ DISCARD-IMERGE-1 problem)
-2. Push range conditions down into index_merge trees that may support them.
+A2. Push range conditions down into index_merge trees that may support them.
if one tree has range(key1) and the other tree has imerge(key1 OR key2)
then perform an equvalent of this operation:
@@ -155,8 +157,86 @@
(rangeA(key1) AND rangeB(key1)) OR (rangeA(key1) AND rangeB(key2))
-3. Just as before: if both sel_tree A and sel_tree B have index_merge options,
+A3. Just as before: if both sel_tree A and sel_tree B have index_merge options,
concatenate them together.
-2.2 New tree_or()
+2.2 New tree_or()
+-----------------
+O1. Dont remove non-range plans:
+ Current tree_or() code will refuse to produce index_merge plans for
+ conditions like
+
+ "t.key1part2=const OR t.key2part1=const"
+
+ (this is marked as DISCARD-IMERGE-3). This was justifed as the left part of
+ the AND condition is not usable for range access, and the operation of
+ tree_and() guaranteed that there was no way it could changed to make a
+ usable range plan. With new tree_and() and rule A2, this is no longer the
+ case. For example for this query:
+
+ (t.key1part2=const OR t.key2part1=const) AND t.key1part1=const
+
+ it will construct a
+
+ imerge(t.key1part2=const OR t.key2part1=const), range(t.key1part1=const)
+
+ then tree_and() will apply rule A2 to push the range down into index merge
+ and after that we'll have:
+
+ range(t.key1part1=const)
+ imerge(
+ t.key1part2=const AND t.key1part1=const,
+ t.key2part1=const
+ )
+ note that imerge(...) describes a usable index_merge plan and it's possible
+ that it will be the best access path.
+
+O2. "Create index_merge accesses when possible"
+ Current tree_or() will not create index_merge access when it could create
+ non-index merge access (see DISCARD-IMERGE-3 and its example in the "Problems
+ in the current implementation" section). This will be changed to work as
+ follows: we will create index_merge made for index scans that didn't have
+ their match in the other sel_tree.
+ Ilustrating it with an example:
+
+ | sel_tree_A | sel_tree_B | A or B | include in index_merge?
+ ------+------------+------------+--------+------------------------
+ key1 | cond1 | cond2 | condM | no
+ key2 | cond3 | cond4 | NULL | no
+ key3 | cond5 | | | yes, A-side
+ key4 | cond6 | | | yes, A-side
+ key5 | | cond7 | | yes, B-side
+ key6 | | cond8 | | yes, B-side
+
+ here we assume that
+ - (cond1 OR cond2) did produce a combined range. Not including them in
+ index_merge.
+ - (cond3 OR cond4) didn't produce a usable range (e.g. they were
+ t.key1part1=c1 AND t.key1part2=c1, respectively, and combining them
+ didn't yield any range list)
+ - All other scand didn't have their counterparts, so we'll end up with a
+ SEL_TREE of:
+
+ range(condM) AND index_merge((cond5 AND cond6),(cond7 AND cond8))
+ .
+
+O4. There is no O4. DISCARD-INDEX-MERGE-4 will remain there. The idea is
+that although DISCARD-INDEX-MERGE-4 does discard plans, so far we haven
+seen any complaints that could be attributed to it.
+If we face the need to lift DISCARD-INDEX-MERGE-4, our answer will be to
+lift it ,and produce a cross-product:
+
+ ((key1p OR key2p) AND (key3p OR key4p))
+ OR
+ ((key5p OR key6p) AND (key7p OR key8p))
+
+ = (key1p OR key2p OR key5p OR key6p) AND // this part is currently
+ (key3p OR key4p OR key5p OR key6p) AND // produced
+
+ (key1p OR key2p OR key5p OR key6p) AND // this part will be added
+ (key3p OR key4p OR key5p OR key6p) //.
+
+In order to limit the impact of this combinatorial explosion, we will
+introduce a rule that we won't generate more than #defined
+MAX_IMERGE_OPTS options.
-=-=(Guest - Thu, 18 Jun 2009, 14:56)=-=-
Low Level Design modified.
--- /tmp/wklog.24.old.15612 2009-06-18 14:56:09.000000000 +0300
+++ /tmp/wklog.24.new.15612 2009-06-18 14:56:09.000000000 +0300
@@ -1 +1,162 @@
+<contents>
+1. Current implementation overview
+1.1. Problems in the current implementation
+2. New implementation
+2.1 New tree_and()
+2.2 New tree_or()
+</contents>
+
+1. Current implementation overview
+==================================
+At the moment, range analyzer works as follows:
+
+SEL_TREE structure represents
+
+ # There are sel_trees, a sel_tree is either range or merge tree
+ sel_tree = range_tree | imerge_tree
+
+ # a range tree has range access options, possibly for several keys
+ range_tree = range(key1) AND range(key2) AND ... AND range(keyN);
+
+ # merge tree represents several way to index_merge
+ imerge_tree = imerge1 AND imerge2 AND ...
+
+ # a way to do index merge == a set to use of different indexes.
+ imergeX = range_tree1 OR range_tree2 OR ..
+ where no pair of range_treeX have ranges over the same index.
+
+
+ tree_and(A, B)
+ {
+ if (both A and B are range trees)
+ return a range_tree with computed intersection for each range;
+ if (only one of A and B is a range tree)
+ return that tree; // DISCARD-IMERGE-1
+ // at this point both trees are index_merge trees
+ return concat_lists( A.imerge1 ... A.imergeN, B.imerge1 ... B.imergeN);
+ }
+
+
+ tree_or(A, B)
+ {
+ if (A and B are range trees)
+ {
+ R = new range_tree;
+ for each index i
+ R.add(range_union(A.range(i), B.range(i)));
+
+ if (R has at least one range access)
+ return R;
+ else
+ {
+ /* could not build any range accesses. construct index_merge */
+ remove non-ranges from A; // DISCARD-IMERGE-2
+ remove non-ranges from B;
+ return new index_merge(A, B);
+ }
+ }
+ else if (A is range tree and B is index_merge tree (or vice versa))
+ {
+ Perform this transformation:
+
+ range_treeA // this is A
+ OR
+ (range_treeB_11 OR range_treeB_12 OR ... OR range_treeB_1N) AND
+ (range_treeB_21 OR range_treeB_22 OR ... OR range_treeB_2N) AND
+ ...
+ (range_treeB_K1 OR range_treeB_K2 OR ... OR range_treeB_kN) AND
+ =
+ (range_treeA OR range_treeB_11 OR ... OR range_treeB_1N) AND
+ (range_treeA OR range_treeB_21 OR ... OR range_treeB_2N) AND
+ ...
+ (range_treeA OR range_treeB_11 OR ... OR range_treeB_1N) AND
+
+ Now each line represents an index_merge..
+ }
+ else if (both A and B are index_merge trees)
+ {
+ Perform this transformation:
+
+ imergeA1 AND imergeA2 AND ... AND imergeAN
+ OR
+ imergeB1 AND imergeB2 AND ... AND imergeBN
+
+ -> (discard all imergeA{i=2,3,...} -> // DISCARD-IMERGE-3
+
+ imergeA1
+ OR
+ imergeB1 AND imergeB2 AND ... AND imergeBN =
+
+ = (combine imergeA1 with each of the imergeB{i} ) =
+
+ combine(imergeA1 OR imergeB1) AND
+ combine(imergeA1 OR imergeB2) AND
+ ... AND
+ combine(imergeA1 OR imergeBN)
+ }
+ }
+
+1.1. Problems in the current implementation
+-------------------------------------------
+As marked in the code above:
+
+DISCARD-IMERGE-1 step will cause index_merge option to be discarded when
+the WHERE clause has this form:
+
+ (t.key1=c1 OR t.key2=c2) AND t.badkey < c3
+
+DISCARD-IMERGE-2 step will cause index_merge option to be discarded when
+the WHERE clause has this form (conditions t.badkey may have abritrary form):
+
+ (t.badkey<c1 AND t.key1=c1) OR (t.key1=c2 AND t.badkey < c2)
+
+DISCARD-IMERGE-3 manifests itself as the following effect: suppose there are
+two indexes:
+
+ INDEX i1(col1, col2),
+ INDEX i2(col1, col3)
+
+and this WHERE clause:
+
+ col1=c1 AND (col2=c2 OR col3=c3)
+
+The optimizer will generate the plans that only use the "col1=c1" part. The
+right side of the AND will be ignored even if it has good selectivity.
+
+
+2. New implementation
+=====================
+
+<general idea>
+* Don't start fighting combinatorial explosion until we've actually got one.
+</>
+
+SEL_TREE structure will be now able to hold both index_merge and range scan
+candidates at the same time. That is,
+
+ sel_tree2 = range_tree AND imerge_tree
+
+where both parts are optional (i.e. can be empty)
+
+Operations on SEL_ARG trees will be modified to produce/process the trees of
+this kind:
+
+2.1 New tree_and()
+------------------
+In order not to lose plans, we'll make these changes:
+
+1. Don't remove index_merge part of the tree.
+
+2. Push range conditions down into index_merge trees that may support them.
+ if one tree has range(key1) and the other tree has imerge(key1 OR key2)
+ then perform an equvalent of this operation:
+
+ rangeA(key1) AND ( rangeB(key1) OR rangeB(key2)) =
+
+ (rangeA(key1) AND rangeB(key1)) OR (rangeA(key1) AND rangeB(key2))
+
+3. Just as before: if both sel_tree A and sel_tree B have index_merge options,
+ concatenate them together.
+
+2.2 New tree_or()
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 24
-=-=(Guest - Mon, 01 Jun 2009, 23:30)=-=-
High-Level Specification modified.
--- /tmp/wklog.24.old.21580 2009-06-01 23:30:06.000000000 +0300
+++ /tmp/wklog.24.new.21580 2009-06-01 23:30:06.000000000 +0300
@@ -64,6 +64,9 @@
* How strict is the limitation on the form of the WHERE?
+* Which version should this be based on? 5.1? Which patches are should be in
+ (google's/percona's/maria/etc?)
+
* TODO: The optimizer didn't compare costs of index_merge and range before (ok
it did but that was done for accesses to different tables). Will there be any
possible gotchas here?
-=-=(Guest - Wed, 27 May 2009, 13:59)=-=-
Title modified.
--- /tmp/wklog.24.old.9498 2009-05-27 13:59:23.000000000 +0300
+++ /tmp/wklog.24.new.9498 2009-05-27 13:59:23.000000000 +0300
@@ -1 +1 @@
-index_merge optimizer: dont discard index_merge union strategies when range is available
+index_merge: fair choice between index_merge union and range access
------------------------------------------------------------
-=-=(View All Progress Notes, 11 total)=-=-
http://askmonty.org/worklog/index.pl?tid=24&nolimit=1
DESCRIPTION:
Current range optimizer will discard possible index_merge/[sort]union
strategies when there is a possible range plan. This action is a part of
measures we take to avoid combinatorial explosion of possible range/
index_merge strategies.
A bad side effect of this is that for WHERE clauses in form
t.key1= 'very-frequent-value' AND (t.key2='rare-value1' OR t.key3='rare-value2')
the optimizer will
- discard union(key2,key3) in favor of range(key1)
- consider costs of using range(key1) and discard that plan also
and the overall effect is that possible poor range access will cause possible
good index_merge access not to be considered.
This WL is to about lifting this limitation at least for some subset of WHERE
clauses.
HIGH-LEVEL SPECIFICATION:
(Not a ready HLS but draft)
<contents>
Solution overview
Limitations
TODO
</contents>
Solution overview
=================
The idea is to delay discarding potential index_merge plans until the point
where it is really necessary.
This way, we won't have to do much changes in the range analyzer, but will be
able to keep potential index_merge plan just enough so that it's possible to
take it into consideration together with range access plans.
Since there are no changes in the optimizer, the ability to consider both
range and index_merge options will be limited to WHERE clauses of this form:
WHERE := range_cond(key1_1) AND
range_cond(key2_1) AND
other_cond AND
index_merge_OR_cond1(key3_1, key3_2, ...)
index_merge_OR_cond2(key4_1, key4_2, ...)
where
index_merge_OR_cond{N} := (range_cond(keyN_1) OR
range_cond(keyN_2) OR ...)
range_cond(keyX) := condition that allows to construct range access of keyX
and doesn't allow to construct range/index_merge accesses
for any keys of the table in question.
For such WHERE clauses, the range analyzer will produce SEL_TREE of this form:
SEL_TREE(
range(key1_1),
...
range(key2_1),
SEL_IMERGE( (1)
SEL_TREE(key3_1})
SEL_TREE(key3_2})
...
)
...
)
which can be used to make a cost-based choice between range and index_merge.
Limitations
-----------
This will not be a full solution in a sense that the range analyzer will not
be able to produce sel_tree (1) if the WHERE clause is specified in other form
(e.g. brackets were opened).
TODO
----
* is it a problem if there are keys that are referred to both from
index_merge and from range access?
* How strict is the limitation on the form of the WHERE?
* Which version should this be based on? 5.1? Which patches are should be in
(google's/percona's/maria/etc?)
* TODO: The optimizer didn't compare costs of index_merge and range before (ok
it did but that was done for accesses to different tables). Will there be any
possible gotchas here?
LOW-LEVEL DESIGN:
<contents>
1. Current implementation overview
1.1. Problems in the current implementation
2. New implementation
2.1 New tree_and()
2.2 New tree_or()
3. Testing and required coverage
</contents>
1. Current implementation overview
==================================
At the moment, range analyzer works as follows:
SEL_TREE structure represents
# There are sel_trees, a sel_tree is either range or merge tree
sel_tree = range_tree | imerge_tree
# a range tree has range access options, possibly for several keys
range_tree = range(key1) AND range(key2) AND ... AND range(keyN);
(here range(keyi) may represent ranges not for initial keyi prefixes,
but ranges for any infixes for keyi)
# merge tree represents several way to index_merge
imerge_tree = imerge1 AND imerge2 AND ...
# a way to do index merge == a set to use of different indexes.
imergeX = range_tree1 OR range_tree2 OR ..
where no pair of range_treeX have ranges over the same index.
tree_and(A, B)
{
if (both A and B are range trees)
return a range_tree with computed intersection for each range;
if (only one of A and B is a range tree)
return that tree; // DISCARD-IMERGE-1
// at this point both trees are index_merge trees
return concat_lists( A.imerge1 ... A.imergeN, B.imerge1 ... B.imergeN);
}
tree_or(A, B)
{
if (A and B are range trees)
{
R = new range_tree;
for each index i
R.add(range_union(A.range(i), B.range(i)));
if (R has at least one range access)
return R; // DISCARD-IMERGE-2
else
{
/* could not build any range accesses. construct index_merge */
remove non-ranges from A;
remove non-ranges from B;
return new index_merge(A, B); // DISCARD-IMERGE-3
}
}
else if (A is range tree and B is index_merge tree (or vice versa))
{
Perform this transformation:
range_treeA // this is A
OR
(range_treeB_11 OR range_treeB_12 OR ... OR range_treeB_1N) AND
(range_treeB_21 OR range_treeB_22 OR ... OR range_treeB_2N) AND
...
(range_treeB_K1 OR range_treeB_K2 OR ... OR range_treeB_kN)
=
(range_treeA OR range_treeB_11 OR ... OR range_treeB_1N) AND
(range_treeA OR range_treeB_21 OR ... OR range_treeB_2N) AND
...
(range_treeA OR range_treeB_11 OR ... OR range_treeB_1N)
Now each line represents an index_merge..
}
else if (both A and B are index_merge trees)
{
Perform this transformation:
imergeA1 AND imergeA2 AND ... AND imergeAN
OR
imergeB1 AND imergeB2 AND ... AND imergeBN
-> (discard all imergeA{i=2,3,...} -> // DISCARD-IMERGE-4
imergeA1
OR
imergeB1 =
= (combine imergeA1 with each of the range_treeB_1{i} ) =
combine(imergeA1 OR range_treeB_11) AND
combine(imergeA1 OR range_treeB_12) AND
... AND
combine(imergeA1 OR range_treeB_1N)
}
}
1.1. Problems in the current implementation
-------------------------------------------
As marked in the code above:
DISCARD-IMERGE-1 step will cause index_merge option to be discarded when
the WHERE clause has this form:
(t.key1=c1 OR t.key2=c2) AND t.badkey < c3
DISCARD-IMERGE-2 step will cause index_merge option to be discarded when
the WHERE clause has this form (conditions t.badkey may have abritrary form):
(t.badkey<c1 AND t.key1=c1) OR (t.key2=c2 AND t.badkey < c2)
DISCARD-IMERGE-3 manifests itself as the following effect: suppose there are
two indexes:
INDEX i1(col1, col2),
INDEX i2(col1, col3)
and this WHERE clause:
col1=c1 AND (col2=c2 OR col3=c3)
The optimizer will generate the plans that only use the "col1=c1" part. The
right side of the AND will be ignored even if it has good selectivity.
(Here no imerge for col2=c2 OR col3=c3 will be built since neither col2=c2 nor
col3=c3 represent index ranges.)
2. New implementation
=====================
<general idea>
* Don't start fighting combinatorial explosion until we've actually got one.
</>
SEL_TREE structure will be now able to hold both index_merge and range scan
candidates at the same time. That is,
sel_tree2 = range_tree AND imerge_tree
where both parts are optional (i.e. can be empty)
Operations on SEL_ARG trees will be modified to produce/process the trees of
this kind:
2.1 New tree_and()
------------------
In order not to lose plans, we'll make these changes:
A1. Don't remove index_merge part of the tree (this will take care of
DISCARD-IMERGE-1 problem)
A2. Push range conditions down into index_merge trees that may support them.
if one tree has range(key1) and the other tree has imerge(key1 OR key2)
then perform an equvalent of this operation:
rangeA(key1) AND ( rangeB(key1) OR rangeB(key2)) =
(rangeA(key1) AND rangeB(key1)) OR (rangeA(key1) AND rangeB(key2))
A3. Just as before: if both sel_tree A and sel_tree B have index_merge options,
concatenate them together.
2.2 New tree_or()
-----------------
O1. Dont remove non-range plans:
Current tree_or() code will refuse to produce index_merge plans for
conditions like
"t.key1part2=const OR t.key2part1=const"
(this is marked as DISCARD-IMERGE-3). This was justifed as the left part of
the AND condition is not usable for range access, and the operation of
tree_and() guaranteed that there was no way it could changed to make a
usable range plan. With new tree_and() and rule A2, this is no longer the
case. For example for this query:
(t.key1part2=const OR t.key2part1=const) AND t.key1part1=const
it will construct a
imerge(t.key1part2=const OR t.key2part1=const), range(t.key1part1=const)
then tree_and() will apply rule A2 to push the range down into index merge
and after that we'll have:
range(t.key1part1=const)
imerge(
t.key1part2=const AND t.key1part1=const,
t.key2part1=const
)
note that imerge(...) describes a usable index_merge plan and it's possible
that it will be the best access path.
O2. "Create index_merge accesses when possible"
Current tree_or() will not create index_merge access when it could create
non-index merge access (see DISCARD-IMERGE-2 and its example in the "Problems
in the current implementation" section). This will be changed to work as
follows: we will create index_merge made for index scans that didn't have
their match in the other sel_tree.
Ilustrating it with an example:
| sel_tree_A | sel_tree_B | A or B | include in index_merge?
------+------------+------------+--------+------------------------
key1 | cond1 | cond2 | condM | no
key2 | cond3 | cond4 | NULL | no
key3 | cond5 | | | yes, A-side
key4 | cond6 | | | yes, A-side
key5 | | cond7 | | yes, B-side
key6 | | cond8 | | yes, B-side
here we assume that
- (cond1 OR cond2) did produce a combined range. Not including them in
index_merge.
- (cond3 OR cond4) didn't produce a usable range (e.g. they were
t.key1part1=c1 AND t.key1part2=c1, respectively, and combining them
didn't yield any range list)
- All other scand didn't have their counterparts, so we'll end up with a
SEL_TREE of:
range(condM) AND index_merge((cond5 AND cond6),(cond7 AND cond8))
.
O4. There is no O4. DISCARD-INDEX-MERGE-4 will remain there. The idea is
that although DISCARD-INDEX-MERGE-4 does discard plans, so far we haven
seen any complaints that could be attributed to it.
If we face the need to lift DISCARD-INDEX-MERGE-4, our answer will be to
lift it ,and produce a cross-product:
((key1p OR key2p) AND (key3p OR key4p))
OR
((key5p OR key6p) AND (key7p OR key8p))
= (key1p OR key2p OR key5p OR key6p) AND // this part is currently
(key3p OR key4p OR key5p OR key6p) AND // produced
(key1p OR key2p OR key5p OR key6p) AND // this part will be added
(key3p OR key4p OR key5p OR key6p) //.
In order to limit the impact of this combinatorial explosion, we will
introduce a rule that we won't generate more than #defined
MAX_IMERGE_OPTS options.
3. Testing and required coverage
================================
So far could find the following user cases:
* BUG#17259: Query optimizer chooses wrong index
* BUG#17673: Optimizer does not use Index Merge optimization in some cases
* BUG#23322: Optimizer sometimes erroniously prefers other index over index merge
* BUG#30151: optimizer is very reluctant to chose index_merge algorithm
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
[Maria-developers] WL#67 Updated (by Psergey): ICP/MRR backport
by worklog-noreply@askmonty.org 29 Jun '10
by worklog-noreply@askmonty.org 29 Jun '10
29 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: ICP/MRR backport
CREATION DATE..: Thu, 26 Nov 2009, 15:19
SUPERVISOR.....: Monty
IMPLEMENTOR....: Psergey
COPIES TO......:
CATEGORY.......: Server-Sprint
TASK ID........: 67 (http://askmonty.org/worklog/?tid=67)
VERSION........: Server-9.x
STATUS.........: Complete
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Psergey - Tue, 29 Jun 2010, 13:57)=-=-
Status updated.
--- /tmp/wklog.67.old.31561 2010-06-29 13:57:50.000000000 +0000
+++ /tmp/wklog.67.new.31561 2010-06-29 13:57:50.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Complete
-=-=(Guest - Sun, 13 Jun 2010, 16:57)=-=-
Dependency deleted: 91 no longer depends on 67
-=-=(Igor - Wed, 10 Mar 2010, 19:14)=-=-
High Level Description modified.
--- /tmp/wklog.67.old.25641 2010-03-10 19:14:45.000000000 +0000
+++ /tmp/wklog.67.new.25641 2010-03-10 19:14:45.000000000 +0000
@@ -1,2 +1,2 @@
-Backport DS-MRR into MariaDB-5.2 codebase, also adding certain extra features to
-make it more usable.
+Backport ICP and DS-MRR into MariaDB-5.2 codebase, also adding certain extra
+features to make it more usable.
-=-=(Guest - Wed, 10 Mar 2010, 19:12)=-=-
Title modified.
--- /tmp/wklog.67.old.25456 2010-03-10 19:12:57.000000000 +0000
+++ /tmp/wklog.67.new.25456 2010-03-10 19:12:57.000000000 +0000
@@ -1 +1 @@
-MRR backport
+ICP/MRR backport
-=-=(Psergey - Sun, 28 Feb 2010, 14:56)=-=-
Dependency created: 91 now depends on 67
-=-=(Psergey - Sun, 28 Feb 2010, 14:54)=-=-
Dependency deleted: 94 no longer depends on 67
-=-=(Psergey - Sun, 28 Feb 2010, 14:09)=-=-
Dependency created: 94 now depends on 67
-=-=(Psergey - Thu, 26 Nov 2009, 20:21)=-=-
High-Level Specification modified.
--- /tmp/wklog.67.old.9329 2009-11-26 20:21:28.000000000 +0200
+++ /tmp/wklog.67.new.9329 2009-11-26 20:21:28.000000000 +0200
@@ -65,17 +65,19 @@
2.5 Make MRR code more of a module
----------------------------------
-Some code in handler.cc can be moved to separate file.
-But changes in opt_range.cc can't.
-TODO: Sort out how much we really can do here. Initial guess is not much as the
-code consists of:
+It is not possible to make MRR to be a totally separate module, as its code
+consists of :
- Default MRR implementation in handler.cc
- Changes in opt_range.cc to use MRR instead of multiple records_in_range()
- calls. These rely on opt_range.cc's internal structures like SEL_ARG trees and
+ calls. These rely on opt_range.cc's internal stuctures like SEL_ARG trees and
so there is not much point in moving them out.
-- DS-MRR implementations which are spread over storage engines.
-and the only modularization we see is to move #1 into a separate file which
-won't achieve much.
+- DS-MRR impelementations which are spread over storage engines.
+
+We'll try to modularize what we can:
+- Move out default MRR implementation from handler.cc
+- Move possible parts out of opt_range.cc into a separate file.
+
+
2.6 Improve the cost model
--------------------------
-=-=(Psergey - Thu, 26 Nov 2009, 19:06)=-=-
High-Level Specification modified.
--- /tmp/wklog.67.old.6449 2009-11-26 19:06:04.000000000 +0200
+++ /tmp/wklog.67.new.6449 2009-11-26 19:06:04.000000000 +0200
@@ -1,4 +1,3 @@
-
<contents>
1. Requirements
2. Required actions
@@ -44,6 +43,7 @@
http://bugs.mysql.com/search.php?cmd=display&status=Active&tags=index_condi…
http://bugs.mysql.com/search.php?cmd=display&status=Active&tags=mrr
+http://bugs.mysql.com/search.php?cmd=display&status=Active&tags=icp
2.2 Backport DS-MRR code to MariaDB 5.2
---------------------------------------
-=-=(Psergey - Thu, 26 Nov 2009, 18:15)=-=-
High-Level Specification modified.
--- /tmp/wklog.67.old.4161 2009-11-26 18:15:36.000000000 +0200
+++ /tmp/wklog.67.new.4161 2009-11-26 18:15:36.000000000 +0200
@@ -1,3 +1,17 @@
+
+<contents>
+1. Requirements
+2. Required actions
+2.1 Fix DS-MRR/InnoDB bugs
+2.2 Backport DS-MRR code to MariaDB 5.2
+2.3 Introduce control variables
+2.4 Other backport issues
+2.5 Make MRR code more of a module
+2.6 Improve the cost model
+2.7 Let DS-MRR support clustered primary keys
+</contents>
+
+
1. Requirements
===============
@@ -63,4 +77,28 @@
and the only modularization we see is to move #1 into a separate file which
won't achieve much.
+2.6 Improve the cost model
+--------------------------
+At the moment DS-MRR cost formula re-uses non-MRR scan costs, which uses
+records_in_range() calls, followed by index_only_read_time() or read_time()
+calls to produce the estimate for read cost.
+
+ We should change this (TODO sort out how exactly)
+
+Note: this means that the query plans will change from MariaDB 5.2.
+
+2.7 Let DS-MRR support clustered primary keys
+---------------------------------------------
+At the moment DS-MRR is not supported for clustered primary keys. It is not
+needed when MRR is used for range access, because range access is done over
+an ordered list of ranges, but it is useful for BKA.
+
+TODO:
+ it's useful for BKA because BKA makes MRR scans over un-orderered
+ non-disjoint lists of ranges. Then we can sort these and do ordered scans.
+ There is still no use for DS-MRR over clustered primary key for range
+ access, where the ranges are disjoint and ordered.
+ How about postponing this item until BKA is backported?
+
+
------------------------------------------------------------
-=-=(View All Progress Notes, 11 total)=-=-
http://askmonty.org/worklog/index.pl?tid=67&nolimit=1
DESCRIPTION:
Backport ICP and DS-MRR into MariaDB-5.2 codebase, also adding certain extra
features to make it more usable.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Requirements
2. Required actions
2.1 Fix DS-MRR/InnoDB bugs
2.2 Backport DS-MRR code to MariaDB 5.2
2.3 Introduce control variables
2.4 Other backport issues
2.5 Make MRR code more of a module
2.6 Improve the cost model
2.7 Let DS-MRR support clustered primary keys
</contents>
1. Requirements
===============
We need the following:
1. Latest MRR interface support, including extensions to support ICP when
using BKA.
2. Let DS-MRR support clustered primary keys (needed when using BKA).
3. Remove conditions used for key access from the condition pushed to index
(ATM this manifests itself as "Using index condition" appearing where there
was no "Using where". TODO: example of this?)
4. Introduce a separate @@optimizer_switch flag for turning on/out ICP (atm it
is switched on/off by @@engine_condition_pushdown)
5. Introduce a separate @@mrr_buffer_size variable to control MRR buffer size
for range+MRR scans. ATM it is controlled by @@read_rnd_size flag and that
makes it unobvious for a number of users.
6. Rename multi_range_read_info_const() to look like it is not a part of MRR
interface.
8. Try to make MRR to be more of a module
7. Improve MRR's cost model.
2. Required actions
===================
Roughly in the order in which it will be done:
2.1 Fix DS-MRR/InnoDB bugs
--------------------------
We need to fix the bugs listed here:
http://bugs.mysql.com/search.php?cmd=display&status=Active&tags=index_condi…
http://bugs.mysql.com/search.php?cmd=display&status=Active&tags=mrr
http://bugs.mysql.com/search.php?cmd=display&status=Active&tags=icp
2.2 Backport DS-MRR code to MariaDB 5.2
---------------------------------------
The easiest way seems to be to to manually move the needed code from mysql-6.0
(or whatever it's called now) to MariaDB.
2.3 Introduce control variables
-------------------------------
Act on items #4 and #5 from the requirements. Should be easy as
@@optimizer_switch is supported in 5.1 codebase.
2.4 Other backport issues
-------------------------
* Figure out what to do with NDB/MRR. 5.1 codebase has "old" NDB/MRR
implementation. mysql-6.0 (and NDB's branch) have the updated NDB/MRR
but merging it into 5.1 can be very labor-intensive.
Will it be ok to disable NDB/MRR altogether?
2.5 Make MRR code more of a module
----------------------------------
It is not possible to make MRR to be a totally separate module, as its code
consists of :
- Default MRR implementation in handler.cc
- Changes in opt_range.cc to use MRR instead of multiple records_in_range()
calls. These rely on opt_range.cc's internal stuctures like SEL_ARG trees and
so there is not much point in moving them out.
- DS-MRR impelementations which are spread over storage engines.
We'll try to modularize what we can:
- Move out default MRR implementation from handler.cc
- Move possible parts out of opt_range.cc into a separate file.
2.6 Improve the cost model
--------------------------
At the moment DS-MRR cost formula re-uses non-MRR scan costs, which uses
records_in_range() calls, followed by index_only_read_time() or read_time()
calls to produce the estimate for read cost.
We should change this (TODO sort out how exactly)
Note: this means that the query plans will change from MariaDB 5.2.
2.7 Let DS-MRR support clustered primary keys
---------------------------------------------
At the moment DS-MRR is not supported for clustered primary keys. It is not
needed when MRR is used for range access, because range access is done over
an ordered list of ranges, but it is useful for BKA.
TODO:
it's useful for BKA because BKA makes MRR scans over un-orderered
non-disjoint lists of ranges. Then we can sort these and do ordered scans.
There is still no use for DS-MRR over clustered primary key for range
access, where the ranges are disjoint and ordered.
How about postponing this item until BKA is backported?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
[Maria-developers] WL#67 Updated (by Psergey): ICP/MRR backport
by worklog-noreply@askmonty.org 29 Jun '10
by worklog-noreply@askmonty.org 29 Jun '10
29 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: ICP/MRR backport
CREATION DATE..: Thu, 26 Nov 2009, 15:19
SUPERVISOR.....: Monty
IMPLEMENTOR....: Psergey
COPIES TO......:
CATEGORY.......: Server-Sprint
TASK ID........: 67 (http://askmonty.org/worklog/?tid=67)
VERSION........: Server-9.x
STATUS.........: Complete
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Psergey - Tue, 29 Jun 2010, 13:57)=-=-
Status updated.
--- /tmp/wklog.67.old.31561 2010-06-29 13:57:50.000000000 +0000
+++ /tmp/wklog.67.new.31561 2010-06-29 13:57:50.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Complete
-=-=(Guest - Sun, 13 Jun 2010, 16:57)=-=-
Dependency deleted: 91 no longer depends on 67
-=-=(Igor - Wed, 10 Mar 2010, 19:14)=-=-
High Level Description modified.
--- /tmp/wklog.67.old.25641 2010-03-10 19:14:45.000000000 +0000
+++ /tmp/wklog.67.new.25641 2010-03-10 19:14:45.000000000 +0000
@@ -1,2 +1,2 @@
-Backport DS-MRR into MariaDB-5.2 codebase, also adding certain extra features to
-make it more usable.
+Backport ICP and DS-MRR into MariaDB-5.2 codebase, also adding certain extra
+features to make it more usable.
-=-=(Guest - Wed, 10 Mar 2010, 19:12)=-=-
Title modified.
--- /tmp/wklog.67.old.25456 2010-03-10 19:12:57.000000000 +0000
+++ /tmp/wklog.67.new.25456 2010-03-10 19:12:57.000000000 +0000
@@ -1 +1 @@
-MRR backport
+ICP/MRR backport
-=-=(Psergey - Sun, 28 Feb 2010, 14:56)=-=-
Dependency created: 91 now depends on 67
-=-=(Psergey - Sun, 28 Feb 2010, 14:54)=-=-
Dependency deleted: 94 no longer depends on 67
-=-=(Psergey - Sun, 28 Feb 2010, 14:09)=-=-
Dependency created: 94 now depends on 67
-=-=(Psergey - Thu, 26 Nov 2009, 20:21)=-=-
High-Level Specification modified.
--- /tmp/wklog.67.old.9329 2009-11-26 20:21:28.000000000 +0200
+++ /tmp/wklog.67.new.9329 2009-11-26 20:21:28.000000000 +0200
@@ -65,17 +65,19 @@
2.5 Make MRR code more of a module
----------------------------------
-Some code in handler.cc can be moved to separate file.
-But changes in opt_range.cc can't.
-TODO: Sort out how much we really can do here. Initial guess is not much as the
-code consists of:
+It is not possible to make MRR to be a totally separate module, as its code
+consists of :
- Default MRR implementation in handler.cc
- Changes in opt_range.cc to use MRR instead of multiple records_in_range()
- calls. These rely on opt_range.cc's internal structures like SEL_ARG trees and
+ calls. These rely on opt_range.cc's internal stuctures like SEL_ARG trees and
so there is not much point in moving them out.
-- DS-MRR implementations which are spread over storage engines.
-and the only modularization we see is to move #1 into a separate file which
-won't achieve much.
+- DS-MRR impelementations which are spread over storage engines.
+
+We'll try to modularize what we can:
+- Move out default MRR implementation from handler.cc
+- Move possible parts out of opt_range.cc into a separate file.
+
+
2.6 Improve the cost model
--------------------------
-=-=(Psergey - Thu, 26 Nov 2009, 19:06)=-=-
High-Level Specification modified.
--- /tmp/wklog.67.old.6449 2009-11-26 19:06:04.000000000 +0200
+++ /tmp/wklog.67.new.6449 2009-11-26 19:06:04.000000000 +0200
@@ -1,4 +1,3 @@
-
<contents>
1. Requirements
2. Required actions
@@ -44,6 +43,7 @@
http://bugs.mysql.com/search.php?cmd=display&status=Active&tags=index_condi…
http://bugs.mysql.com/search.php?cmd=display&status=Active&tags=mrr
+http://bugs.mysql.com/search.php?cmd=display&status=Active&tags=icp
2.2 Backport DS-MRR code to MariaDB 5.2
---------------------------------------
-=-=(Psergey - Thu, 26 Nov 2009, 18:15)=-=-
High-Level Specification modified.
--- /tmp/wklog.67.old.4161 2009-11-26 18:15:36.000000000 +0200
+++ /tmp/wklog.67.new.4161 2009-11-26 18:15:36.000000000 +0200
@@ -1,3 +1,17 @@
+
+<contents>
+1. Requirements
+2. Required actions
+2.1 Fix DS-MRR/InnoDB bugs
+2.2 Backport DS-MRR code to MariaDB 5.2
+2.3 Introduce control variables
+2.4 Other backport issues
+2.5 Make MRR code more of a module
+2.6 Improve the cost model
+2.7 Let DS-MRR support clustered primary keys
+</contents>
+
+
1. Requirements
===============
@@ -63,4 +77,28 @@
and the only modularization we see is to move #1 into a separate file which
won't achieve much.
+2.6 Improve the cost model
+--------------------------
+At the moment DS-MRR cost formula re-uses non-MRR scan costs, which uses
+records_in_range() calls, followed by index_only_read_time() or read_time()
+calls to produce the estimate for read cost.
+
+ We should change this (TODO sort out how exactly)
+
+Note: this means that the query plans will change from MariaDB 5.2.
+
+2.7 Let DS-MRR support clustered primary keys
+---------------------------------------------
+At the moment DS-MRR is not supported for clustered primary keys. It is not
+needed when MRR is used for range access, because range access is done over
+an ordered list of ranges, but it is useful for BKA.
+
+TODO:
+ it's useful for BKA because BKA makes MRR scans over un-orderered
+ non-disjoint lists of ranges. Then we can sort these and do ordered scans.
+ There is still no use for DS-MRR over clustered primary key for range
+ access, where the ranges are disjoint and ordered.
+ How about postponing this item until BKA is backported?
+
+
------------------------------------------------------------
-=-=(View All Progress Notes, 11 total)=-=-
http://askmonty.org/worklog/index.pl?tid=67&nolimit=1
DESCRIPTION:
Backport ICP and DS-MRR into MariaDB-5.2 codebase, also adding certain extra
features to make it more usable.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Requirements
2. Required actions
2.1 Fix DS-MRR/InnoDB bugs
2.2 Backport DS-MRR code to MariaDB 5.2
2.3 Introduce control variables
2.4 Other backport issues
2.5 Make MRR code more of a module
2.6 Improve the cost model
2.7 Let DS-MRR support clustered primary keys
</contents>
1. Requirements
===============
We need the following:
1. Latest MRR interface support, including extensions to support ICP when
using BKA.
2. Let DS-MRR support clustered primary keys (needed when using BKA).
3. Remove conditions used for key access from the condition pushed to index
(ATM this manifests itself as "Using index condition" appearing where there
was no "Using where". TODO: example of this?)
4. Introduce a separate @@optimizer_switch flag for turning on/out ICP (atm it
is switched on/off by @@engine_condition_pushdown)
5. Introduce a separate @@mrr_buffer_size variable to control MRR buffer size
for range+MRR scans. ATM it is controlled by @@read_rnd_size flag and that
makes it unobvious for a number of users.
6. Rename multi_range_read_info_const() to look like it is not a part of MRR
interface.
8. Try to make MRR to be more of a module
7. Improve MRR's cost model.
2. Required actions
===================
Roughly in the order in which it will be done:
2.1 Fix DS-MRR/InnoDB bugs
--------------------------
We need to fix the bugs listed here:
http://bugs.mysql.com/search.php?cmd=display&status=Active&tags=index_condi…
http://bugs.mysql.com/search.php?cmd=display&status=Active&tags=mrr
http://bugs.mysql.com/search.php?cmd=display&status=Active&tags=icp
2.2 Backport DS-MRR code to MariaDB 5.2
---------------------------------------
The easiest way seems to be to to manually move the needed code from mysql-6.0
(or whatever it's called now) to MariaDB.
2.3 Introduce control variables
-------------------------------
Act on items #4 and #5 from the requirements. Should be easy as
@@optimizer_switch is supported in 5.1 codebase.
2.4 Other backport issues
-------------------------
* Figure out what to do with NDB/MRR. 5.1 codebase has "old" NDB/MRR
implementation. mysql-6.0 (and NDB's branch) have the updated NDB/MRR
but merging it into 5.1 can be very labor-intensive.
Will it be ok to disable NDB/MRR altogether?
2.5 Make MRR code more of a module
----------------------------------
It is not possible to make MRR to be a totally separate module, as its code
consists of :
- Default MRR implementation in handler.cc
- Changes in opt_range.cc to use MRR instead of multiple records_in_range()
calls. These rely on opt_range.cc's internal stuctures like SEL_ARG trees and
so there is not much point in moving them out.
- DS-MRR impelementations which are spread over storage engines.
We'll try to modularize what we can:
- Move out default MRR implementation from handler.cc
- Move possible parts out of opt_range.cc into a separate file.
2.6 Improve the cost model
--------------------------
At the moment DS-MRR cost formula re-uses non-MRR scan costs, which uses
records_in_range() calls, followed by index_only_read_time() or read_time()
calls to produce the estimate for read cost.
We should change this (TODO sort out how exactly)
Note: this means that the query plans will change from MariaDB 5.2.
2.7 Let DS-MRR support clustered primary keys
---------------------------------------------
At the moment DS-MRR is not supported for clustered primary keys. It is not
needed when MRR is used for range access, because range access is done over
an ordered list of ranges, but it is useful for BKA.
TODO:
it's useful for BKA because BKA makes MRR scans over un-orderered
non-disjoint lists of ranges. Then we can sort these and do ordered scans.
There is still no use for DS-MRR over clustered primary key for range
access, where the ranges are disjoint and ordered.
How about postponing this item until BKA is backported?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
[Maria-developers] WL#120 Updated (by Knielsen): Replication API for stacked event generators
by worklog-noreply@askmonty.org 29 Jun '10
by worklog-noreply@askmonty.org 29 Jun '10
29 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Replication API for stacked event generators
CREATION DATE..: Mon, 07 Jun 2010, 13:13
SUPERVISOR.....: Knielsen
IMPLEMENTOR....: Knielsen
COPIES TO......:
CATEGORY.......: Server-Sprint
TASK ID........: 120 (http://askmonty.org/worklog/?tid=120)
VERSION........: Server-9.x
STATUS.........: In-Progress
PRIORITY.......: 60
WORKED HOURS...: 2
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Knielsen - Tue, 29 Jun 2010, 13:51)=-=-
Status updated.
--- /tmp/wklog.120.old.31179 2010-06-29 13:51:20.000000000 +0000
+++ /tmp/wklog.120.new.31179 2010-06-29 13:51:20.000000000 +0000
@@ -1 +1 @@
-Assigned
+In-Progress
-=-=(Knielsen - Mon, 28 Jun 2010, 07:11)=-=-
High-Level Specification modified.
--- /tmp/wklog.120.old.18416 2010-06-28 07:11:09.000000000 +0000
+++ /tmp/wklog.120.new.18416 2010-06-28 07:11:09.000000000 +0000
@@ -14,10 +14,14 @@
An example of an event consumer is the writing of the binlog on a master.
-Event generators are not really plugins. Rather, there are specific points in
-the server where events are generated. However, a generator can be part of a
-plugin, for example a PBXT engine-level replication event generator would be
-part of the PBXT storage engine plugin.
+Some event generators are not really plugins. Rather, there are specific
+points in the server where events are generated. However, a generator can be
+part of a plugin, for example a PBXT engine-level replication event generator
+would be part of the PBXT storage engine plugin. And for example we could
+write a filter plugin, which would be stacked on top of an existing generator
+and provide the same event types and interfaces, but filtered in some way (for
+example by removing certain events on the master side, or by re-writing events
+in certain ways).
Event consumers on the other hand could be a plugin.
@@ -85,6 +89,9 @@
some is kept as reference to context (eg. THD) only. This however looses most
of the mentioned advantages for materialisation.
+Also, the non-materialising interface should be a good interface on top of
+which to build a materialising interface.
+
The design proposed here aims for as little materialisation as possible.
-=-=(Knielsen - Thu, 24 Jun 2010, 14:29)=-=-
Category updated.
--- /tmp/wklog.120.old.7440 2010-06-24 14:29:38.000000000 +0000
+++ /tmp/wklog.120.new.7440 2010-06-24 14:29:38.000000000 +0000
@@ -1 +1 @@
-Server-RawIdeaBin
+Server-Sprint
-=-=(Knielsen - Thu, 24 Jun 2010, 14:29)=-=-
Status updated.
--- /tmp/wklog.120.old.7440 2010-06-24 14:29:38.000000000 +0000
+++ /tmp/wklog.120.new.7440 2010-06-24 14:29:38.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Knielsen - Thu, 24 Jun 2010, 14:29)=-=-
Low Level Design modified.
--- /tmp/wklog.120.old.7355 2010-06-24 14:29:11.000000000 +0000
+++ /tmp/wklog.120.new.7355 2010-06-24 14:29:11.000000000 +0000
@@ -1 +1,385 @@
+A consumer is implented as a virtual class (interface). There is one virtual
+function for every event that can be received. A consumer would derive from
+the base class and override methods for the events it wants to receive.
+
+There is one consumer interface for each generator. When a generator A is
+stacked on B, the consumer interface for A inherits from the interface for
+B. This way, when A defers an event to B, the consumer for A will receive the
+corresponding event from B.
+
+There are methods for a consumer to register itself to receive events from
+each generator. I still need to find a way for a consumer in one plugin to
+register itself with a generator implemented in another plugin (eg. PBXT
+engine-level replication). I also need to add a way for consumers to
+de-register themselves.
+
+The current design has consumer callbacks return 0 for success and error code
+otherwise. I still need to think more about whether this is useful (ie. what
+is the semantics of returning an error from a consumer callback).
+
+Each event passed to consumers is defined as a class with public accessor
+methods to a private context (which is mostly the THD).
+
+My intension is to make all events passed around const, so that the same event
+can be passed to each of multiple registered consumers (and to emphasise that
+consumers do not have the ability to modify events). It still needs to be seen
+whether that const-ness will be feasible in practise without very heavy
+modification/constification of exiting code.
+
+What follows is a partial draft of a possible definition of the API as
+concrete C++ class definitions.
+
+-----------------------------------------------------------------------
+
+/*
+ Virtual base class for generated replication events.
+
+ This is the parent of events generated from all kinds of generators. Only
+ child classes can be instantiated.
+
+ This class can be used by code that wants to treat events in a generic way,
+ without any knowledge of event details. I still need to decide whether such
+ generic code is sensible.
+*/
+class rpl_event_base
+{
+ /*
+ Maybe we will want the ability to materialise an event to a standard
+ binary format. This could be achieved with a base method like this. The
+ actual materialisation would be implemented in each deriving class. The
+ public methods would provide different interfaces for specifying the
+ buffer or for writing directly into IO_CACHE or file.
+ */
+
+ /* Return 0 on success, -1 on error, -2 on out-of-buffer. */
+ int materialise(uchar *buffer, size_t buflen) const;
+ /*
+ Returns NULL on error or else malloc()ed buffer with materialised event,
+ caller must free().
+ */
+ uchar *materialise() const;
+ /* Same but using passed in memroot. */
+ uchar *materialise(mem_root *memroot) const;
+ /*
+ Materialise to user-supplied writer function (could write directly to file
+ or the like).
+ */
+ int materialise(int (*writer)(uchar *data, size_t len, void *context)) const;
+
+ /*
+ As to for what to do with a materialised event, there are a couple of
+ possibilities.
+
+ One is to have a de_materialise() method somewhere that can construct an
+ rpl_event_base (really a derived class of course) from a buffer or writer
+ function. This would require each accessor function to conditionally read
+ its data from either THD context or buffer (GCC is able to optimise
+ several such conditionals in multiple accessor function calls into one
+ conditional), or we can make all accessors virtual if the performance hit
+ is acceptable.
+
+ Another is to have different classes for accessing events read from
+ materialised event data.
+
+ Also, I still need to think about whether it is at all useful to be able
+ to generically materialise an event at this level. It may be that any
+ binlog/transport will in any case need to undertand more of the format of
+ events, so that such materialisation/transport is better done at a
+ different layer.
+ */
+
+protected:
+ /* Implementation which is the basis for materialise(). */
+ virtual int do_materialise(int (*writer)(uchar *data, size_t len,
+ void *context)) const = 0;
+
+private:
+ /* Virtual base class, private constructor to prevent instantiation. */
+ rpl_event_base();
+};
+
+
+/*
+ These are the event types output from the transaction event generator.
+
+ This generator is not stacked on anything.
+
+ The transaction event generator marks the start and end (commit or rollback)
+ of transactions. It also gives information about whether the transaction was
+ a full transaction or autocommitted statement, whether transactional tables
+ were involved, whether non-transactional tables were involved, and XA
+ information (ToDo).
+*/
+
+/* Base class for transaction events. */
+class rpl_event_transaction_base : public rpl_event_base
+{
+public:
+ /*
+ Get the local transaction id. This idea is only unique within one server.
+ It is allocated whenever a new transaction is started.
+ Can be used to identify events belonging to the same transaction in a
+ binlog-like stream of events streamed in parallel among multiple
+ transactions.
+ */
+ uint64_t get_local_trx_id() const { return thd->local_trx_id; };
+
+ bool get_is_autocommit() const;
+
+private:
+ /* The context is the THD. */
+ THD *thd;
+
+ rpl_event_transaction_base(THD *_thd) : thd(_thd) { };
+};
+
+/* Transaction start event. */
+class rpl_event_transaction_start : public rpl_event_transaction_base
+{
+
+};
+
+/* Transaction commit. */
+class rpl_event_transaction_commit : public rpl_event_transaction_base
+{
+public:
+ /*
+ The global transaction id is unique cross-server.
+
+ It can be used to identify the position from which to start a slave
+ replicating from a master.
+
+ This global ID is only available once the transaction is decided to commit
+ by the TC manager / primary redundancy service. This TC also allocates the
+ ID and decides the exact semantics (can there be gaps, etc); however the
+ format is fixed (cluster_id, running_counter).
+ */
+ struct global_transaction_id
+ {
+ uint32_t cluster_id;
+ uint64_t counter;
+ };
+
+ const global_transaction_id *get_global_transaction_id() const;
+};
+
+/* Transaction rollback. */
+class rpl_event_transaction_rollback : public rpl_event_transaction_base
+{
+
+};
+
+
+/* Base class for statement events. */
+class rpl_event_statement_base : public rpl_event_base
+{
+public:
+ LEX_STRING get_current_db() const;
+};
+
+class rpl_event_statement_start : public rpl_event_statement_base
+{
+
+};
+
+class rpl_event_statement_end : public rpl_event_statement_base
+{
+public:
+ int get_errorcode() const;
+};
+
+class rpl_event_statement_query : public rpl_event_statement_base
+{
+public:
+ LEX_STRING get_query_string();
+ ulong get_sql_mode();
+ const CHARSET_INFO *get_character_set_client();
+ const CHARSET_INFO *get_collation_connection();
+ const CHARSET_INFO *get_collation_server();
+ const CHARSET_INFO *get_collation_default_db();
+
+ /*
+ Access to relevant flags that affect query execution.
+
+ Use as if (ev->get_flags() & (uint32)ROW_FOREIGN_KEY_CHECKS) { ... }
+ */
+ enum flag_bits
+ {
+ STMT_FOREIGN_KEY_CHECKS, // @@foreign_key_checks
+ STMT_UNIQUE_KEY_CHECKS, // @@unique_checks
+ STMT_AUTO_IS_NULL, // @@sql_auto_is_null
+ };
+ uint32_t get_flags();
+
+ ulong get_auto_increment_offset();
+ ulong get_auto_increment_increment();
+
+ // And so on for: time zone; day/month names; connection id; LAST_INSERT_ID;
+ // INSERT_ID; random seed; user variables.
+ //
+ // We probably also need get_uses_temporary_table(), get_used_user_vars(),
+ // get_uses_auto_increment() and so on, so a consumer can get more
+ // information about what kind of context information a query will need when
+ // executed on a slave.
+};
+
+class rpl_event_statement_load_query : public rpl_event_statement_query
+{
+
+};
+
+/*
+ This event is fired with blocks of data for files read (from server-local
+ file or client connection) for LOAD DATA.
+*/
+class rpl_event_statement_load_data_block : public rpl_event_statement_base
+{
+public:
+ struct block
+ {
+ const uchar *ptr;
+ size_t size;
+ };
+ block get_block() const;
+};
+
+/* Base class for row-based replication events. */
+class rpl_event_row_base : public rpl_event_base
+{
+public:
+ /*
+ Access to relevant handler extra flags and other flags that affect row
+ operations.
+
+ Use as if (ev->get_flags() & (uint32)ROW_WRITE_CAN_REPLACE) { ... }
+ */
+ enum flag_bits
+ {
+ ROW_WRITE_CAN_REPLACE, // HA_EXTRA_WRITE_CAN_REPLACE
+ ROW_IGNORE_DUP_KEY, // HA_EXTRA_IGNORE_DUP_KEY
+ ROW_IGNORE_NO_KEY, // HA_EXTRA_IGNORE_NO_KEY
+ ROW_DISABLE_FOREIGN_KEY_CHECKS, // ! @@foreign_key_checks
+ ROW_DISABLE_UNIQUE_KEY_CHECKS, // ! @@unique_checks
+ };
+ uint32_t get_flags();
+
+ /* Access to list of tables modified. */
+ class table_iterator
+ {
+ public:
+ /* Returns table, NULL after last. */
+ const TABLE *get_next();
+ private:
+ // ...
+ };
+ table_iterator get_modified_tables() const;
+
+private:
+ /* Context used to provide accessors. */
+ THD *thd;
+
+protected:
+ rpl_event_row_base(THD *_thd) : thd(_thd) { }
+};
+
+
+class rpl_event_row_write : public rpl_event_row_base
+{
+public:
+ const BITMAP *get_write_set() const;
+ const uchar *get_after_image() const;
+};
+
+class rpl_event_row_update : public rpl_event_row_base
+{
+public:
+ const BITMAP *get_read_set() const;
+ const BITMAP *get_write_set() const;
+ const uchar *get_before_image() const;
+ const uchar *get_after_image() const;
+};
+
+class rpl_event_row_delete : public rpl_event_row_base
+{
+public:
+ const BITMAP *get_read_set() const;
+ const uchar *get_before_image() const;
+};
+
+
+/*
+ Event consumer callbacks.
+
+ An event consumer registers with an event generator to receive event
+ notifications from that generator.
+
+ The consumer has callbacks (in the form of virtual functions) for the
+ individual event types the consumer is interested in. Only callbacks that
+ are non-NULL will be invoked. If an event applies to multiple callbacks in a
+ single callback struct, it will only be passed to the most specific non-NULL
+ callback (so events never fire more than once per registration).
+
+ The lifetime of the memory holding the event is only for the duration of the
+ callback invocation, unless otherwise noted.
+
+ Callbacks return 0 for success or error code (ToDo: does this make sense?).
+*/
+
+struct rpl_event_consumer_transaction
+{
+ virtual int trx_start(const rpl_event_transaction_start *) { return 0; }
+ virtual int trx_commit(const rpl_event_transaction_commit *) { return 0; }
+ virtual int trx_rollback(const rpl_event_transaction_rollback *) { return 0; }
+};
+
+/*
+ Consuming statement-based events.
+
+ The statement event generator is stacked on top of the transaction event
+ generator, so we can receive those events as well.
+*/
+struct rpl_event_consumer_statement : public rpl_event_consumer_transaction
+{
+ virtual int stmt_start(const rpl_event_statement_start *) { return 0; }
+ virtual int stmt_end(const rpl_event_statement_end *) { return 0; }
+
+ virtual int stmt_query(const rpl_event_statement_query *) { return 0; }
+
+ /* Data for a file used in LOAD DATA [LOCAL] INFILE. */
+ virtual int stmt_load_data_block(const rpl_event_statement_load_data_block *)
+ { return 0; }
+
+ /*
+ These are specific kinds of statements; if specified they override
+ consume_stmt_query() for the corresponding event.
+ */
+ virtual int stmt_load_query(const rpl_event_statement_load_query *ev)
+ { return stmt_query(ev); }
+};
+
+/*
+ Consuming row-based events.
+
+ The row event generator is stacked on top of the statement event generator.
+*/
+struct rpl_event_consumer_row : public rpl_event_consumer_statement
+{
+ virtual int row_write(const rpl_event_row_write *) { return 0; }
+ virtual int row_update(const rpl_event_row_update *) { return 0; }
+ virtual int row_delete(const rpl_event_row_delete *) { return 0; }
+};
+
+
+/*
+ Registration functions.
+
+ ToDo: Make a way to de-register.
+
+ ToDo: Find a good way for a plugin (eg. PBXT) to publish a generator
+ registration method.
+*/
+
+int rpl_event_transaction_register(const rpl_event_consumer_transaction *cbs);
+int rpl_event_statement_register(const rpl_event_consumer_statement *cbs);
+int rpl_event_row_register(const rpl_event_consumer_row *cbs);
-=-=(Knielsen - Thu, 24 Jun 2010, 14:28)=-=-
High-Level Specification modified.
--- /tmp/wklog.120.old.7341 2010-06-24 14:28:17.000000000 +0000
+++ /tmp/wklog.120.new.7341 2010-06-24 14:28:17.000000000 +0000
@@ -1 +1,159 @@
+Generators and consumbers
+-------------------------
+
+We have the two concepts:
+
+1. Event _generators_, that produce events describing all changes to data in a
+ server.
+
+2. Event consumers, that receive such events and use them in various ways.
+
+Examples of event generators is execution of SQL statements, which generates
+events like those used for statement-based replication. Another example is
+PBXT engine-level replication.
+
+An example of an event consumer is the writing of the binlog on a master.
+
+Event generators are not really plugins. Rather, there are specific points in
+the server where events are generated. However, a generator can be part of a
+plugin, for example a PBXT engine-level replication event generator would be
+part of the PBXT storage engine plugin.
+
+Event consumers on the other hand could be a plugin.
+
+One generator can be stacked on top of another. This means that a generator on
+top (for example row-based events) will handle some events itself
+(eg. non-deterministic update in mixed-mode binlogging). Other events that it
+does not want to or cannot handle (for example deterministic delete or DDL)
+will be defered to the generator below (for example statement-based events).
+
+
+Materialisation (or not)
+------------------------
+
+A central decision is how to represent events that are generated in the API at
+the point of generation.
+
+I want to avoid making the API require that events are materialised. By
+"Materialised" I mean that all (or most) of the data for the event is written
+into memory in a struct/class used inside the server or serialised in a data
+buffer (byte buffer) in a format suitable for network transport or disk
+storage.
+
+Using a non-materialised event means storing just a reference to appropriate
+context that allows to retrieve all information for the event using
+accessors. Ie. typically this would be based on getting the event information
+from the THD pointer.
+
+Some reasons to avoid using materialised events in the API:
+
+ - Replication events have a _lot_ of detailed context information that can be
+ needed in events: user-defined variables, random seed, character sets,
+ table column names and types, etc. etc. If we make the API based on
+ materialisation, then the initial decision about which context information
+ to include with which events will have to be done in the API, while ideally
+ we want this decision to be done by the individual consumer plugin. There
+ will this be a conflict between what to include (to allow consumers access)
+ and what to exclude (to avoid excessive needless work).
+
+ - Materialising means defining a very specific format, which will tend to
+ make the API less generic and flexible.
+
+ - Unless the materialised format is made _very_ specific (and thus very
+ inflexible), it is unlikely to be directly useful for transport
+ (eg. binlog), so it will need to be re-materialised into a different format
+ anyway, wasting work.
+
+ - If a generator on top handles an event, then we want to avoid wasting work
+ materialising an event in a generator below which would be completely
+ unused. Thus there would be a need for the upper generator to somehow
+ notify the lower generator ahead of event generation time to not fire an
+ event, complicating the API.
+
+Some advantages for materialisation:
+
+ - Using an API based on passing around some well-defined struct event (or
+ byte buffer) will be simpler than the complex class hierarchy proposed here
+ with no requirement for materialisation.
+
+ - Defining a materialised format would allow an easy way to use the same
+ consumer code on a generator that produces events at the source of
+ execution and on a generator that produces events from eg. reading them
+ from an event log.
+
+Note that there can be some middle way, where some data is materialised and
+some is kept as reference to context (eg. THD) only. This however looses most
+of the mentioned advantages for materialisation.
+
+The design proposed here aims for as little materialisation as possible.
+
+
+Default materialisation format
+------------------------------
+
+
+While the proposed API doesn't _require_ materialisation, we can still think
+about providing the _option_ for built-in materialisation. This could be
+useful if such materialisation is made suitable for transport to a different
+server (eg. no endian-dependance etc). If there is a facility for such
+materialisation built-in to the API, it becomes possible to write something
+like a generic binlog plugin or generic network transport plugin. This would
+be really useful for eg. PBXT engine-level replication, as it could be
+implemented without having to re-invent a binlog format.
+
+I added in the proposed API a simple facility to materialise every event as a
+string of bytes. To use this, I still need to add a suitable facility to
+de-materialise the event.
+
+However, it is still an open question whether such a facility will be at all
+useful. It still has some of the problems with materialisation mentioned
+above. And I think it is likely that a good binlog implementation will need
+to do more than just blindly copy opaque events from one endpoint to
+another. For example, it might need different event boundaries (merge and/or
+split events); it might need to augment or modify events, or inject new
+events, etc.
+
+So I think maybe it is better to add such a generic materialisation facility
+on top of the basic event generator API. Such a facility would provide
+materialisation of an replication event stream, not of individual events, so
+would be more flexible in providing a good implementation. It would be
+implemented for all generators. It would separate from both the event
+generator API (so we have flexibility to put a filter class in-between
+generator and materialisation), and could also be separate from the actual
+transport handling stuff like fsync() of binlog files and socket connections
+etc. It would be paired with a corresponding applier API which would handle
+executing events on a slave.
+
+Then we can have a default materialised event format, which is available, but
+not mandatory. So there can still be other formats alongside (like legacy
+MySQL 5.1 binlog event format and maybe Tungsten would have its own format).
+
+
+Encapsulation
+-------------
+
+Another fundamental question about the design is the level of encapsulation
+used for the API.
+
+At the implementation level, a lot of the work is basically to pull out all of
+the needed information from the THD object/context. The API I propose tries to
+_not_ expose the THD to consumers. Instead it provides accessor functions for
+all the bits and pieces relevant to each replication event, while the event
+class itself likely will be more or less just an encapsulated THD.
+
+So an alternative would be to have a generic event that was just (type, THD).
+Then consumers could just pull out whatever information they want from the
+THD. The THD implementation is already exposed to storage engines. This would
+of course greatly reduce the size of the API, eliminating lots of class
+definitions and accessor functions. Though arguably it wouldn't really
+simplify the API, as the complexity would just be in understanding the THD
+class.
+
+Note that we do not have to take any performance hit from using encapsulated
+accessors since compilers can inline them (though if inlining then we do not
+get any ABI stability with respect to THD implemetation).
+
+For now, the API is proposed without exposing the THD class. (Similar
+encapsulation could be added in actual implementation to also not expose TABLE
+and similar classes).
-=-=(Knielsen - Thu, 24 Jun 2010, 12:04)=-=-
Dependency created: 107 now depends on 120
-=-=(Knielsen - Thu, 24 Jun 2010, 11:59)=-=-
High Level Description modified.
--- /tmp/wklog.120.old.516 2010-06-24 11:59:24.000000000 +0000
+++ /tmp/wklog.120.new.516 2010-06-24 11:59:24.000000000 +0000
@@ -11,4 +11,4 @@
Event generators can be stacked, and a generator may defer event generation to
the next one down the stack. For example, the row-level replication event
-generator may defer DLL to the statement-level replication event generator.
+generator may defer DDL to the statement-level replication event generator.
-=-=(Knielsen - Mon, 21 Jun 2010, 08:35)=-=-
Research and design thoughts.
DESCRIPTION:
A part of the replication project, MWL#107.
Events are produced by event Generators. Examples are
- Generation of statement-based replication events
- Generation of row-based events
- Generation of PBXT engine-level replication events
and maybe reading of events from relay log on slave may also be an example of
generating events.
Event generators can be stacked, and a generator may defer event generation to
the next one down the stack. For example, the row-level replication event
generator may defer DDL to the statement-level replication event generator.
HIGH-LEVEL SPECIFICATION:
Generators and consumbers
-------------------------
We have the two concepts:
1. Event _generators_, that produce events describing all changes to data in a
server.
2. Event consumers, that receive such events and use them in various ways.
Examples of event generators is execution of SQL statements, which generates
events like those used for statement-based replication. Another example is
PBXT engine-level replication.
An example of an event consumer is the writing of the binlog on a master.
Some event generators are not really plugins. Rather, there are specific
points in the server where events are generated. However, a generator can be
part of a plugin, for example a PBXT engine-level replication event generator
would be part of the PBXT storage engine plugin. And for example we could
write a filter plugin, which would be stacked on top of an existing generator
and provide the same event types and interfaces, but filtered in some way (for
example by removing certain events on the master side, or by re-writing events
in certain ways).
Event consumers on the other hand could be a plugin.
One generator can be stacked on top of another. This means that a generator on
top (for example row-based events) will handle some events itself
(eg. non-deterministic update in mixed-mode binlogging). Other events that it
does not want to or cannot handle (for example deterministic delete or DDL)
will be defered to the generator below (for example statement-based events).
Materialisation (or not)
------------------------
A central decision is how to represent events that are generated in the API at
the point of generation.
I want to avoid making the API require that events are materialised. By
"Materialised" I mean that all (or most) of the data for the event is written
into memory in a struct/class used inside the server or serialised in a data
buffer (byte buffer) in a format suitable for network transport or disk
storage.
Using a non-materialised event means storing just a reference to appropriate
context that allows to retrieve all information for the event using
accessors. Ie. typically this would be based on getting the event information
from the THD pointer.
Some reasons to avoid using materialised events in the API:
- Replication events have a _lot_ of detailed context information that can be
needed in events: user-defined variables, random seed, character sets,
table column names and types, etc. etc. If we make the API based on
materialisation, then the initial decision about which context information
to include with which events will have to be done in the API, while ideally
we want this decision to be done by the individual consumer plugin. There
will this be a conflict between what to include (to allow consumers access)
and what to exclude (to avoid excessive needless work).
- Materialising means defining a very specific format, which will tend to
make the API less generic and flexible.
- Unless the materialised format is made _very_ specific (and thus very
inflexible), it is unlikely to be directly useful for transport
(eg. binlog), so it will need to be re-materialised into a different format
anyway, wasting work.
- If a generator on top handles an event, then we want to avoid wasting work
materialising an event in a generator below which would be completely
unused. Thus there would be a need for the upper generator to somehow
notify the lower generator ahead of event generation time to not fire an
event, complicating the API.
Some advantages for materialisation:
- Using an API based on passing around some well-defined struct event (or
byte buffer) will be simpler than the complex class hierarchy proposed here
with no requirement for materialisation.
- Defining a materialised format would allow an easy way to use the same
consumer code on a generator that produces events at the source of
execution and on a generator that produces events from eg. reading them
from an event log.
Note that there can be some middle way, where some data is materialised and
some is kept as reference to context (eg. THD) only. This however looses most
of the mentioned advantages for materialisation.
Also, the non-materialising interface should be a good interface on top of
which to build a materialising interface.
The design proposed here aims for as little materialisation as possible.
Default materialisation format
------------------------------
While the proposed API doesn't _require_ materialisation, we can still think
about providing the _option_ for built-in materialisation. This could be
useful if such materialisation is made suitable for transport to a different
server (eg. no endian-dependance etc). If there is a facility for such
materialisation built-in to the API, it becomes possible to write something
like a generic binlog plugin or generic network transport plugin. This would
be really useful for eg. PBXT engine-level replication, as it could be
implemented without having to re-invent a binlog format.
I added in the proposed API a simple facility to materialise every event as a
string of bytes. To use this, I still need to add a suitable facility to
de-materialise the event.
However, it is still an open question whether such a facility will be at all
useful. It still has some of the problems with materialisation mentioned
above. And I think it is likely that a good binlog implementation will need
to do more than just blindly copy opaque events from one endpoint to
another. For example, it might need different event boundaries (merge and/or
split events); it might need to augment or modify events, or inject new
events, etc.
So I think maybe it is better to add such a generic materialisation facility
on top of the basic event generator API. Such a facility would provide
materialisation of an replication event stream, not of individual events, so
would be more flexible in providing a good implementation. It would be
implemented for all generators. It would separate from both the event
generator API (so we have flexibility to put a filter class in-between
generator and materialisation), and could also be separate from the actual
transport handling stuff like fsync() of binlog files and socket connections
etc. It would be paired with a corresponding applier API which would handle
executing events on a slave.
Then we can have a default materialised event format, which is available, but
not mandatory. So there can still be other formats alongside (like legacy
MySQL 5.1 binlog event format and maybe Tungsten would have its own format).
Encapsulation
-------------
Another fundamental question about the design is the level of encapsulation
used for the API.
At the implementation level, a lot of the work is basically to pull out all of
the needed information from the THD object/context. The API I propose tries to
_not_ expose the THD to consumers. Instead it provides accessor functions for
all the bits and pieces relevant to each replication event, while the event
class itself likely will be more or less just an encapsulated THD.
So an alternative would be to have a generic event that was just (type, THD).
Then consumers could just pull out whatever information they want from the
THD. The THD implementation is already exposed to storage engines. This would
of course greatly reduce the size of the API, eliminating lots of class
definitions and accessor functions. Though arguably it wouldn't really
simplify the API, as the complexity would just be in understanding the THD
class.
Note that we do not have to take any performance hit from using encapsulated
accessors since compilers can inline them (though if inlining then we do not
get any ABI stability with respect to THD implemetation).
For now, the API is proposed without exposing the THD class. (Similar
encapsulation could be added in actual implementation to also not expose TABLE
and similar classes).
LOW-LEVEL DESIGN:
A consumer is implented as a virtual class (interface). There is one virtual
function for every event that can be received. A consumer would derive from
the base class and override methods for the events it wants to receive.
There is one consumer interface for each generator. When a generator A is
stacked on B, the consumer interface for A inherits from the interface for
B. This way, when A defers an event to B, the consumer for A will receive the
corresponding event from B.
There are methods for a consumer to register itself to receive events from
each generator. I still need to find a way for a consumer in one plugin to
register itself with a generator implemented in another plugin (eg. PBXT
engine-level replication). I also need to add a way for consumers to
de-register themselves.
The current design has consumer callbacks return 0 for success and error code
otherwise. I still need to think more about whether this is useful (ie. what
is the semantics of returning an error from a consumer callback).
Each event passed to consumers is defined as a class with public accessor
methods to a private context (which is mostly the THD).
My intension is to make all events passed around const, so that the same event
can be passed to each of multiple registered consumers (and to emphasise that
consumers do not have the ability to modify events). It still needs to be seen
whether that const-ness will be feasible in practise without very heavy
modification/constification of exiting code.
What follows is a partial draft of a possible definition of the API as
concrete C++ class definitions.
-----------------------------------------------------------------------
/*
Virtual base class for generated replication events.
This is the parent of events generated from all kinds of generators. Only
child classes can be instantiated.
This class can be used by code that wants to treat events in a generic way,
without any knowledge of event details. I still need to decide whether such
generic code is sensible.
*/
class rpl_event_base
{
/*
Maybe we will want the ability to materialise an event to a standard
binary format. This could be achieved with a base method like this. The
actual materialisation would be implemented in each deriving class. The
public methods would provide different interfaces for specifying the
buffer or for writing directly into IO_CACHE or file.
*/
/* Return 0 on success, -1 on error, -2 on out-of-buffer. */
int materialise(uchar *buffer, size_t buflen) const;
/*
Returns NULL on error or else malloc()ed buffer with materialised event,
caller must free().
*/
uchar *materialise() const;
/* Same but using passed in memroot. */
uchar *materialise(mem_root *memroot) const;
/*
Materialise to user-supplied writer function (could write directly to file
or the like).
*/
int materialise(int (*writer)(uchar *data, size_t len, void *context)) const;
/*
As to for what to do with a materialised event, there are a couple of
possibilities.
One is to have a de_materialise() method somewhere that can construct an
rpl_event_base (really a derived class of course) from a buffer or writer
function. This would require each accessor function to conditionally read
its data from either THD context or buffer (GCC is able to optimise
several such conditionals in multiple accessor function calls into one
conditional), or we can make all accessors virtual if the performance hit
is acceptable.
Another is to have different classes for accessing events read from
materialised event data.
Also, I still need to think about whether it is at all useful to be able
to generically materialise an event at this level. It may be that any
binlog/transport will in any case need to undertand more of the format of
events, so that such materialisation/transport is better done at a
different layer.
*/
protected:
/* Implementation which is the basis for materialise(). */
virtual int do_materialise(int (*writer)(uchar *data, size_t len,
void *context)) const = 0;
private:
/* Virtual base class, private constructor to prevent instantiation. */
rpl_event_base();
};
/*
These are the event types output from the transaction event generator.
This generator is not stacked on anything.
The transaction event generator marks the start and end (commit or rollback)
of transactions. It also gives information about whether the transaction was
a full transaction or autocommitted statement, whether transactional tables
were involved, whether non-transactional tables were involved, and XA
information (ToDo).
*/
/* Base class for transaction events. */
class rpl_event_transaction_base : public rpl_event_base
{
public:
/*
Get the local transaction id. This idea is only unique within one server.
It is allocated whenever a new transaction is started.
Can be used to identify events belonging to the same transaction in a
binlog-like stream of events streamed in parallel among multiple
transactions.
*/
uint64_t get_local_trx_id() const { return thd->local_trx_id; };
bool get_is_autocommit() const;
private:
/* The context is the THD. */
THD *thd;
rpl_event_transaction_base(THD *_thd) : thd(_thd) { };
};
/* Transaction start event. */
class rpl_event_transaction_start : public rpl_event_transaction_base
{
};
/* Transaction commit. */
class rpl_event_transaction_commit : public rpl_event_transaction_base
{
public:
/*
The global transaction id is unique cross-server.
It can be used to identify the position from which to start a slave
replicating from a master.
This global ID is only available once the transaction is decided to commit
by the TC manager / primary redundancy service. This TC also allocates the
ID and decides the exact semantics (can there be gaps, etc); however the
format is fixed (cluster_id, running_counter).
*/
struct global_transaction_id
{
uint32_t cluster_id;
uint64_t counter;
};
const global_transaction_id *get_global_transaction_id() const;
};
/* Transaction rollback. */
class rpl_event_transaction_rollback : public rpl_event_transaction_base
{
};
/* Base class for statement events. */
class rpl_event_statement_base : public rpl_event_base
{
public:
LEX_STRING get_current_db() const;
};
class rpl_event_statement_start : public rpl_event_statement_base
{
};
class rpl_event_statement_end : public rpl_event_statement_base
{
public:
int get_errorcode() const;
};
class rpl_event_statement_query : public rpl_event_statement_base
{
public:
LEX_STRING get_query_string();
ulong get_sql_mode();
const CHARSET_INFO *get_character_set_client();
const CHARSET_INFO *get_collation_connection();
const CHARSET_INFO *get_collation_server();
const CHARSET_INFO *get_collation_default_db();
/*
Access to relevant flags that affect query execution.
Use as if (ev->get_flags() & (uint32)ROW_FOREIGN_KEY_CHECKS) { ... }
*/
enum flag_bits
{
STMT_FOREIGN_KEY_CHECKS, // @@foreign_key_checks
STMT_UNIQUE_KEY_CHECKS, // @@unique_checks
STMT_AUTO_IS_NULL, // @@sql_auto_is_null
};
uint32_t get_flags();
ulong get_auto_increment_offset();
ulong get_auto_increment_increment();
// And so on for: time zone; day/month names; connection id; LAST_INSERT_ID;
// INSERT_ID; random seed; user variables.
//
// We probably also need get_uses_temporary_table(), get_used_user_vars(),
// get_uses_auto_increment() and so on, so a consumer can get more
// information about what kind of context information a query will need when
// executed on a slave.
};
class rpl_event_statement_load_query : public rpl_event_statement_query
{
};
/*
This event is fired with blocks of data for files read (from server-local
file or client connection) for LOAD DATA.
*/
class rpl_event_statement_load_data_block : public rpl_event_statement_base
{
public:
struct block
{
const uchar *ptr;
size_t size;
};
block get_block() const;
};
/* Base class for row-based replication events. */
class rpl_event_row_base : public rpl_event_base
{
public:
/*
Access to relevant handler extra flags and other flags that affect row
operations.
Use as if (ev->get_flags() & (uint32)ROW_WRITE_CAN_REPLACE) { ... }
*/
enum flag_bits
{
ROW_WRITE_CAN_REPLACE, // HA_EXTRA_WRITE_CAN_REPLACE
ROW_IGNORE_DUP_KEY, // HA_EXTRA_IGNORE_DUP_KEY
ROW_IGNORE_NO_KEY, // HA_EXTRA_IGNORE_NO_KEY
ROW_DISABLE_FOREIGN_KEY_CHECKS, // ! @@foreign_key_checks
ROW_DISABLE_UNIQUE_KEY_CHECKS, // ! @@unique_checks
};
uint32_t get_flags();
/* Access to list of tables modified. */
class table_iterator
{
public:
/* Returns table, NULL after last. */
const TABLE *get_next();
private:
// ...
};
table_iterator get_modified_tables() const;
private:
/* Context used to provide accessors. */
THD *thd;
protected:
rpl_event_row_base(THD *_thd) : thd(_thd) { }
};
class rpl_event_row_write : public rpl_event_row_base
{
public:
const BITMAP *get_write_set() const;
const uchar *get_after_image() const;
};
class rpl_event_row_update : public rpl_event_row_base
{
public:
const BITMAP *get_read_set() const;
const BITMAP *get_write_set() const;
const uchar *get_before_image() const;
const uchar *get_after_image() const;
};
class rpl_event_row_delete : public rpl_event_row_base
{
public:
const BITMAP *get_read_set() const;
const uchar *get_before_image() const;
};
/*
Event consumer callbacks.
An event consumer registers with an event generator to receive event
notifications from that generator.
The consumer has callbacks (in the form of virtual functions) for the
individual event types the consumer is interested in. Only callbacks that
are non-NULL will be invoked. If an event applies to multiple callbacks in a
single callback struct, it will only be passed to the most specific non-NULL
callback (so events never fire more than once per registration).
The lifetime of the memory holding the event is only for the duration of the
callback invocation, unless otherwise noted.
Callbacks return 0 for success or error code (ToDo: does this make sense?).
*/
struct rpl_event_consumer_transaction
{
virtual int trx_start(const rpl_event_transaction_start *) { return 0; }
virtual int trx_commit(const rpl_event_transaction_commit *) { return 0; }
virtual int trx_rollback(const rpl_event_transaction_rollback *) { return 0; }
};
/*
Consuming statement-based events.
The statement event generator is stacked on top of the transaction event
generator, so we can receive those events as well.
*/
struct rpl_event_consumer_statement : public rpl_event_consumer_transaction
{
virtual int stmt_start(const rpl_event_statement_start *) { return 0; }
virtual int stmt_end(const rpl_event_statement_end *) { return 0; }
virtual int stmt_query(const rpl_event_statement_query *) { return 0; }
/* Data for a file used in LOAD DATA [LOCAL] INFILE. */
virtual int stmt_load_data_block(const rpl_event_statement_load_data_block *)
{ return 0; }
/*
These are specific kinds of statements; if specified they override
consume_stmt_query() for the corresponding event.
*/
virtual int stmt_load_query(const rpl_event_statement_load_query *ev)
{ return stmt_query(ev); }
};
/*
Consuming row-based events.
The row event generator is stacked on top of the statement event generator.
*/
struct rpl_event_consumer_row : public rpl_event_consumer_statement
{
virtual int row_write(const rpl_event_row_write *) { return 0; }
virtual int row_update(const rpl_event_row_update *) { return 0; }
virtual int row_delete(const rpl_event_row_delete *) { return 0; }
};
/*
Registration functions.
ToDo: Make a way to de-register.
ToDo: Find a good way for a plugin (eg. PBXT) to publish a generator
registration method.
*/
int rpl_event_transaction_register(const rpl_event_consumer_transaction *cbs);
int rpl_event_statement_register(const rpl_event_consumer_statement *cbs);
int rpl_event_row_register(const rpl_event_consumer_row *cbs);
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
[Maria-developers] WL#120 Updated (by Knielsen): Replication API for stacked event generators
by worklog-noreply@askmonty.org 29 Jun '10
by worklog-noreply@askmonty.org 29 Jun '10
29 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Replication API for stacked event generators
CREATION DATE..: Mon, 07 Jun 2010, 13:13
SUPERVISOR.....: Knielsen
IMPLEMENTOR....: Knielsen
COPIES TO......:
CATEGORY.......: Server-Sprint
TASK ID........: 120 (http://askmonty.org/worklog/?tid=120)
VERSION........: Server-9.x
STATUS.........: In-Progress
PRIORITY.......: 60
WORKED HOURS...: 2
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Knielsen - Tue, 29 Jun 2010, 13:51)=-=-
Status updated.
--- /tmp/wklog.120.old.31179 2010-06-29 13:51:20.000000000 +0000
+++ /tmp/wklog.120.new.31179 2010-06-29 13:51:20.000000000 +0000
@@ -1 +1 @@
-Assigned
+In-Progress
-=-=(Knielsen - Mon, 28 Jun 2010, 07:11)=-=-
High-Level Specification modified.
--- /tmp/wklog.120.old.18416 2010-06-28 07:11:09.000000000 +0000
+++ /tmp/wklog.120.new.18416 2010-06-28 07:11:09.000000000 +0000
@@ -14,10 +14,14 @@
An example of an event consumer is the writing of the binlog on a master.
-Event generators are not really plugins. Rather, there are specific points in
-the server where events are generated. However, a generator can be part of a
-plugin, for example a PBXT engine-level replication event generator would be
-part of the PBXT storage engine plugin.
+Some event generators are not really plugins. Rather, there are specific
+points in the server where events are generated. However, a generator can be
+part of a plugin, for example a PBXT engine-level replication event generator
+would be part of the PBXT storage engine plugin. And for example we could
+write a filter plugin, which would be stacked on top of an existing generator
+and provide the same event types and interfaces, but filtered in some way (for
+example by removing certain events on the master side, or by re-writing events
+in certain ways).
Event consumers on the other hand could be a plugin.
@@ -85,6 +89,9 @@
some is kept as reference to context (eg. THD) only. This however looses most
of the mentioned advantages for materialisation.
+Also, the non-materialising interface should be a good interface on top of
+which to build a materialising interface.
+
The design proposed here aims for as little materialisation as possible.
-=-=(Knielsen - Thu, 24 Jun 2010, 14:29)=-=-
Category updated.
--- /tmp/wklog.120.old.7440 2010-06-24 14:29:38.000000000 +0000
+++ /tmp/wklog.120.new.7440 2010-06-24 14:29:38.000000000 +0000
@@ -1 +1 @@
-Server-RawIdeaBin
+Server-Sprint
-=-=(Knielsen - Thu, 24 Jun 2010, 14:29)=-=-
Status updated.
--- /tmp/wklog.120.old.7440 2010-06-24 14:29:38.000000000 +0000
+++ /tmp/wklog.120.new.7440 2010-06-24 14:29:38.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Knielsen - Thu, 24 Jun 2010, 14:29)=-=-
Low Level Design modified.
--- /tmp/wklog.120.old.7355 2010-06-24 14:29:11.000000000 +0000
+++ /tmp/wklog.120.new.7355 2010-06-24 14:29:11.000000000 +0000
@@ -1 +1,385 @@
+A consumer is implented as a virtual class (interface). There is one virtual
+function for every event that can be received. A consumer would derive from
+the base class and override methods for the events it wants to receive.
+
+There is one consumer interface for each generator. When a generator A is
+stacked on B, the consumer interface for A inherits from the interface for
+B. This way, when A defers an event to B, the consumer for A will receive the
+corresponding event from B.
+
+There are methods for a consumer to register itself to receive events from
+each generator. I still need to find a way for a consumer in one plugin to
+register itself with a generator implemented in another plugin (eg. PBXT
+engine-level replication). I also need to add a way for consumers to
+de-register themselves.
+
+The current design has consumer callbacks return 0 for success and error code
+otherwise. I still need to think more about whether this is useful (ie. what
+is the semantics of returning an error from a consumer callback).
+
+Each event passed to consumers is defined as a class with public accessor
+methods to a private context (which is mostly the THD).
+
+My intension is to make all events passed around const, so that the same event
+can be passed to each of multiple registered consumers (and to emphasise that
+consumers do not have the ability to modify events). It still needs to be seen
+whether that const-ness will be feasible in practise without very heavy
+modification/constification of exiting code.
+
+What follows is a partial draft of a possible definition of the API as
+concrete C++ class definitions.
+
+-----------------------------------------------------------------------
+
+/*
+ Virtual base class for generated replication events.
+
+ This is the parent of events generated from all kinds of generators. Only
+ child classes can be instantiated.
+
+ This class can be used by code that wants to treat events in a generic way,
+ without any knowledge of event details. I still need to decide whether such
+ generic code is sensible.
+*/
+class rpl_event_base
+{
+ /*
+ Maybe we will want the ability to materialise an event to a standard
+ binary format. This could be achieved with a base method like this. The
+ actual materialisation would be implemented in each deriving class. The
+ public methods would provide different interfaces for specifying the
+ buffer or for writing directly into IO_CACHE or file.
+ */
+
+ /* Return 0 on success, -1 on error, -2 on out-of-buffer. */
+ int materialise(uchar *buffer, size_t buflen) const;
+ /*
+ Returns NULL on error or else malloc()ed buffer with materialised event,
+ caller must free().
+ */
+ uchar *materialise() const;
+ /* Same but using passed in memroot. */
+ uchar *materialise(mem_root *memroot) const;
+ /*
+ Materialise to user-supplied writer function (could write directly to file
+ or the like).
+ */
+ int materialise(int (*writer)(uchar *data, size_t len, void *context)) const;
+
+ /*
+ As to for what to do with a materialised event, there are a couple of
+ possibilities.
+
+ One is to have a de_materialise() method somewhere that can construct an
+ rpl_event_base (really a derived class of course) from a buffer or writer
+ function. This would require each accessor function to conditionally read
+ its data from either THD context or buffer (GCC is able to optimise
+ several such conditionals in multiple accessor function calls into one
+ conditional), or we can make all accessors virtual if the performance hit
+ is acceptable.
+
+ Another is to have different classes for accessing events read from
+ materialised event data.
+
+ Also, I still need to think about whether it is at all useful to be able
+ to generically materialise an event at this level. It may be that any
+ binlog/transport will in any case need to undertand more of the format of
+ events, so that such materialisation/transport is better done at a
+ different layer.
+ */
+
+protected:
+ /* Implementation which is the basis for materialise(). */
+ virtual int do_materialise(int (*writer)(uchar *data, size_t len,
+ void *context)) const = 0;
+
+private:
+ /* Virtual base class, private constructor to prevent instantiation. */
+ rpl_event_base();
+};
+
+
+/*
+ These are the event types output from the transaction event generator.
+
+ This generator is not stacked on anything.
+
+ The transaction event generator marks the start and end (commit or rollback)
+ of transactions. It also gives information about whether the transaction was
+ a full transaction or autocommitted statement, whether transactional tables
+ were involved, whether non-transactional tables were involved, and XA
+ information (ToDo).
+*/
+
+/* Base class for transaction events. */
+class rpl_event_transaction_base : public rpl_event_base
+{
+public:
+ /*
+ Get the local transaction id. This idea is only unique within one server.
+ It is allocated whenever a new transaction is started.
+ Can be used to identify events belonging to the same transaction in a
+ binlog-like stream of events streamed in parallel among multiple
+ transactions.
+ */
+ uint64_t get_local_trx_id() const { return thd->local_trx_id; };
+
+ bool get_is_autocommit() const;
+
+private:
+ /* The context is the THD. */
+ THD *thd;
+
+ rpl_event_transaction_base(THD *_thd) : thd(_thd) { };
+};
+
+/* Transaction start event. */
+class rpl_event_transaction_start : public rpl_event_transaction_base
+{
+
+};
+
+/* Transaction commit. */
+class rpl_event_transaction_commit : public rpl_event_transaction_base
+{
+public:
+ /*
+ The global transaction id is unique cross-server.
+
+ It can be used to identify the position from which to start a slave
+ replicating from a master.
+
+ This global ID is only available once the transaction is decided to commit
+ by the TC manager / primary redundancy service. This TC also allocates the
+ ID and decides the exact semantics (can there be gaps, etc); however the
+ format is fixed (cluster_id, running_counter).
+ */
+ struct global_transaction_id
+ {
+ uint32_t cluster_id;
+ uint64_t counter;
+ };
+
+ const global_transaction_id *get_global_transaction_id() const;
+};
+
+/* Transaction rollback. */
+class rpl_event_transaction_rollback : public rpl_event_transaction_base
+{
+
+};
+
+
+/* Base class for statement events. */
+class rpl_event_statement_base : public rpl_event_base
+{
+public:
+ LEX_STRING get_current_db() const;
+};
+
+class rpl_event_statement_start : public rpl_event_statement_base
+{
+
+};
+
+class rpl_event_statement_end : public rpl_event_statement_base
+{
+public:
+ int get_errorcode() const;
+};
+
+class rpl_event_statement_query : public rpl_event_statement_base
+{
+public:
+ LEX_STRING get_query_string();
+ ulong get_sql_mode();
+ const CHARSET_INFO *get_character_set_client();
+ const CHARSET_INFO *get_collation_connection();
+ const CHARSET_INFO *get_collation_server();
+ const CHARSET_INFO *get_collation_default_db();
+
+ /*
+ Access to relevant flags that affect query execution.
+
+ Use as if (ev->get_flags() & (uint32)ROW_FOREIGN_KEY_CHECKS) { ... }
+ */
+ enum flag_bits
+ {
+ STMT_FOREIGN_KEY_CHECKS, // @@foreign_key_checks
+ STMT_UNIQUE_KEY_CHECKS, // @@unique_checks
+ STMT_AUTO_IS_NULL, // @@sql_auto_is_null
+ };
+ uint32_t get_flags();
+
+ ulong get_auto_increment_offset();
+ ulong get_auto_increment_increment();
+
+ // And so on for: time zone; day/month names; connection id; LAST_INSERT_ID;
+ // INSERT_ID; random seed; user variables.
+ //
+ // We probably also need get_uses_temporary_table(), get_used_user_vars(),
+ // get_uses_auto_increment() and so on, so a consumer can get more
+ // information about what kind of context information a query will need when
+ // executed on a slave.
+};
+
+class rpl_event_statement_load_query : public rpl_event_statement_query
+{
+
+};
+
+/*
+ This event is fired with blocks of data for files read (from server-local
+ file or client connection) for LOAD DATA.
+*/
+class rpl_event_statement_load_data_block : public rpl_event_statement_base
+{
+public:
+ struct block
+ {
+ const uchar *ptr;
+ size_t size;
+ };
+ block get_block() const;
+};
+
+/* Base class for row-based replication events. */
+class rpl_event_row_base : public rpl_event_base
+{
+public:
+ /*
+ Access to relevant handler extra flags and other flags that affect row
+ operations.
+
+ Use as if (ev->get_flags() & (uint32)ROW_WRITE_CAN_REPLACE) { ... }
+ */
+ enum flag_bits
+ {
+ ROW_WRITE_CAN_REPLACE, // HA_EXTRA_WRITE_CAN_REPLACE
+ ROW_IGNORE_DUP_KEY, // HA_EXTRA_IGNORE_DUP_KEY
+ ROW_IGNORE_NO_KEY, // HA_EXTRA_IGNORE_NO_KEY
+ ROW_DISABLE_FOREIGN_KEY_CHECKS, // ! @@foreign_key_checks
+ ROW_DISABLE_UNIQUE_KEY_CHECKS, // ! @@unique_checks
+ };
+ uint32_t get_flags();
+
+ /* Access to list of tables modified. */
+ class table_iterator
+ {
+ public:
+ /* Returns table, NULL after last. */
+ const TABLE *get_next();
+ private:
+ // ...
+ };
+ table_iterator get_modified_tables() const;
+
+private:
+ /* Context used to provide accessors. */
+ THD *thd;
+
+protected:
+ rpl_event_row_base(THD *_thd) : thd(_thd) { }
+};
+
+
+class rpl_event_row_write : public rpl_event_row_base
+{
+public:
+ const BITMAP *get_write_set() const;
+ const uchar *get_after_image() const;
+};
+
+class rpl_event_row_update : public rpl_event_row_base
+{
+public:
+ const BITMAP *get_read_set() const;
+ const BITMAP *get_write_set() const;
+ const uchar *get_before_image() const;
+ const uchar *get_after_image() const;
+};
+
+class rpl_event_row_delete : public rpl_event_row_base
+{
+public:
+ const BITMAP *get_read_set() const;
+ const uchar *get_before_image() const;
+};
+
+
+/*
+ Event consumer callbacks.
+
+ An event consumer registers with an event generator to receive event
+ notifications from that generator.
+
+ The consumer has callbacks (in the form of virtual functions) for the
+ individual event types the consumer is interested in. Only callbacks that
+ are non-NULL will be invoked. If an event applies to multiple callbacks in a
+ single callback struct, it will only be passed to the most specific non-NULL
+ callback (so events never fire more than once per registration).
+
+ The lifetime of the memory holding the event is only for the duration of the
+ callback invocation, unless otherwise noted.
+
+ Callbacks return 0 for success or error code (ToDo: does this make sense?).
+*/
+
+struct rpl_event_consumer_transaction
+{
+ virtual int trx_start(const rpl_event_transaction_start *) { return 0; }
+ virtual int trx_commit(const rpl_event_transaction_commit *) { return 0; }
+ virtual int trx_rollback(const rpl_event_transaction_rollback *) { return 0; }
+};
+
+/*
+ Consuming statement-based events.
+
+ The statement event generator is stacked on top of the transaction event
+ generator, so we can receive those events as well.
+*/
+struct rpl_event_consumer_statement : public rpl_event_consumer_transaction
+{
+ virtual int stmt_start(const rpl_event_statement_start *) { return 0; }
+ virtual int stmt_end(const rpl_event_statement_end *) { return 0; }
+
+ virtual int stmt_query(const rpl_event_statement_query *) { return 0; }
+
+ /* Data for a file used in LOAD DATA [LOCAL] INFILE. */
+ virtual int stmt_load_data_block(const rpl_event_statement_load_data_block *)
+ { return 0; }
+
+ /*
+ These are specific kinds of statements; if specified they override
+ consume_stmt_query() for the corresponding event.
+ */
+ virtual int stmt_load_query(const rpl_event_statement_load_query *ev)
+ { return stmt_query(ev); }
+};
+
+/*
+ Consuming row-based events.
+
+ The row event generator is stacked on top of the statement event generator.
+*/
+struct rpl_event_consumer_row : public rpl_event_consumer_statement
+{
+ virtual int row_write(const rpl_event_row_write *) { return 0; }
+ virtual int row_update(const rpl_event_row_update *) { return 0; }
+ virtual int row_delete(const rpl_event_row_delete *) { return 0; }
+};
+
+
+/*
+ Registration functions.
+
+ ToDo: Make a way to de-register.
+
+ ToDo: Find a good way for a plugin (eg. PBXT) to publish a generator
+ registration method.
+*/
+
+int rpl_event_transaction_register(const rpl_event_consumer_transaction *cbs);
+int rpl_event_statement_register(const rpl_event_consumer_statement *cbs);
+int rpl_event_row_register(const rpl_event_consumer_row *cbs);
-=-=(Knielsen - Thu, 24 Jun 2010, 14:28)=-=-
High-Level Specification modified.
--- /tmp/wklog.120.old.7341 2010-06-24 14:28:17.000000000 +0000
+++ /tmp/wklog.120.new.7341 2010-06-24 14:28:17.000000000 +0000
@@ -1 +1,159 @@
+Generators and consumbers
+-------------------------
+
+We have the two concepts:
+
+1. Event _generators_, that produce events describing all changes to data in a
+ server.
+
+2. Event consumers, that receive such events and use them in various ways.
+
+Examples of event generators is execution of SQL statements, which generates
+events like those used for statement-based replication. Another example is
+PBXT engine-level replication.
+
+An example of an event consumer is the writing of the binlog on a master.
+
+Event generators are not really plugins. Rather, there are specific points in
+the server where events are generated. However, a generator can be part of a
+plugin, for example a PBXT engine-level replication event generator would be
+part of the PBXT storage engine plugin.
+
+Event consumers on the other hand could be a plugin.
+
+One generator can be stacked on top of another. This means that a generator on
+top (for example row-based events) will handle some events itself
+(eg. non-deterministic update in mixed-mode binlogging). Other events that it
+does not want to or cannot handle (for example deterministic delete or DDL)
+will be defered to the generator below (for example statement-based events).
+
+
+Materialisation (or not)
+------------------------
+
+A central decision is how to represent events that are generated in the API at
+the point of generation.
+
+I want to avoid making the API require that events are materialised. By
+"Materialised" I mean that all (or most) of the data for the event is written
+into memory in a struct/class used inside the server or serialised in a data
+buffer (byte buffer) in a format suitable for network transport or disk
+storage.
+
+Using a non-materialised event means storing just a reference to appropriate
+context that allows to retrieve all information for the event using
+accessors. Ie. typically this would be based on getting the event information
+from the THD pointer.
+
+Some reasons to avoid using materialised events in the API:
+
+ - Replication events have a _lot_ of detailed context information that can be
+ needed in events: user-defined variables, random seed, character sets,
+ table column names and types, etc. etc. If we make the API based on
+ materialisation, then the initial decision about which context information
+ to include with which events will have to be done in the API, while ideally
+ we want this decision to be done by the individual consumer plugin. There
+ will this be a conflict between what to include (to allow consumers access)
+ and what to exclude (to avoid excessive needless work).
+
+ - Materialising means defining a very specific format, which will tend to
+ make the API less generic and flexible.
+
+ - Unless the materialised format is made _very_ specific (and thus very
+ inflexible), it is unlikely to be directly useful for transport
+ (eg. binlog), so it will need to be re-materialised into a different format
+ anyway, wasting work.
+
+ - If a generator on top handles an event, then we want to avoid wasting work
+ materialising an event in a generator below which would be completely
+ unused. Thus there would be a need for the upper generator to somehow
+ notify the lower generator ahead of event generation time to not fire an
+ event, complicating the API.
+
+Some advantages for materialisation:
+
+ - Using an API based on passing around some well-defined struct event (or
+ byte buffer) will be simpler than the complex class hierarchy proposed here
+ with no requirement for materialisation.
+
+ - Defining a materialised format would allow an easy way to use the same
+ consumer code on a generator that produces events at the source of
+ execution and on a generator that produces events from eg. reading them
+ from an event log.
+
+Note that there can be some middle way, where some data is materialised and
+some is kept as reference to context (eg. THD) only. This however looses most
+of the mentioned advantages for materialisation.
+
+The design proposed here aims for as little materialisation as possible.
+
+
+Default materialisation format
+------------------------------
+
+
+While the proposed API doesn't _require_ materialisation, we can still think
+about providing the _option_ for built-in materialisation. This could be
+useful if such materialisation is made suitable for transport to a different
+server (eg. no endian-dependance etc). If there is a facility for such
+materialisation built-in to the API, it becomes possible to write something
+like a generic binlog plugin or generic network transport plugin. This would
+be really useful for eg. PBXT engine-level replication, as it could be
+implemented without having to re-invent a binlog format.
+
+I added in the proposed API a simple facility to materialise every event as a
+string of bytes. To use this, I still need to add a suitable facility to
+de-materialise the event.
+
+However, it is still an open question whether such a facility will be at all
+useful. It still has some of the problems with materialisation mentioned
+above. And I think it is likely that a good binlog implementation will need
+to do more than just blindly copy opaque events from one endpoint to
+another. For example, it might need different event boundaries (merge and/or
+split events); it might need to augment or modify events, or inject new
+events, etc.
+
+So I think maybe it is better to add such a generic materialisation facility
+on top of the basic event generator API. Such a facility would provide
+materialisation of an replication event stream, not of individual events, so
+would be more flexible in providing a good implementation. It would be
+implemented for all generators. It would separate from both the event
+generator API (so we have flexibility to put a filter class in-between
+generator and materialisation), and could also be separate from the actual
+transport handling stuff like fsync() of binlog files and socket connections
+etc. It would be paired with a corresponding applier API which would handle
+executing events on a slave.
+
+Then we can have a default materialised event format, which is available, but
+not mandatory. So there can still be other formats alongside (like legacy
+MySQL 5.1 binlog event format and maybe Tungsten would have its own format).
+
+
+Encapsulation
+-------------
+
+Another fundamental question about the design is the level of encapsulation
+used for the API.
+
+At the implementation level, a lot of the work is basically to pull out all of
+the needed information from the THD object/context. The API I propose tries to
+_not_ expose the THD to consumers. Instead it provides accessor functions for
+all the bits and pieces relevant to each replication event, while the event
+class itself likely will be more or less just an encapsulated THD.
+
+So an alternative would be to have a generic event that was just (type, THD).
+Then consumers could just pull out whatever information they want from the
+THD. The THD implementation is already exposed to storage engines. This would
+of course greatly reduce the size of the API, eliminating lots of class
+definitions and accessor functions. Though arguably it wouldn't really
+simplify the API, as the complexity would just be in understanding the THD
+class.
+
+Note that we do not have to take any performance hit from using encapsulated
+accessors since compilers can inline them (though if inlining then we do not
+get any ABI stability with respect to THD implemetation).
+
+For now, the API is proposed without exposing the THD class. (Similar
+encapsulation could be added in actual implementation to also not expose TABLE
+and similar classes).
-=-=(Knielsen - Thu, 24 Jun 2010, 12:04)=-=-
Dependency created: 107 now depends on 120
-=-=(Knielsen - Thu, 24 Jun 2010, 11:59)=-=-
High Level Description modified.
--- /tmp/wklog.120.old.516 2010-06-24 11:59:24.000000000 +0000
+++ /tmp/wklog.120.new.516 2010-06-24 11:59:24.000000000 +0000
@@ -11,4 +11,4 @@
Event generators can be stacked, and a generator may defer event generation to
the next one down the stack. For example, the row-level replication event
-generator may defer DLL to the statement-level replication event generator.
+generator may defer DDL to the statement-level replication event generator.
-=-=(Knielsen - Mon, 21 Jun 2010, 08:35)=-=-
Research and design thoughts.
DESCRIPTION:
A part of the replication project, MWL#107.
Events are produced by event Generators. Examples are
- Generation of statement-based replication events
- Generation of row-based events
- Generation of PBXT engine-level replication events
and maybe reading of events from relay log on slave may also be an example of
generating events.
Event generators can be stacked, and a generator may defer event generation to
the next one down the stack. For example, the row-level replication event
generator may defer DDL to the statement-level replication event generator.
HIGH-LEVEL SPECIFICATION:
Generators and consumbers
-------------------------
We have the two concepts:
1. Event _generators_, that produce events describing all changes to data in a
server.
2. Event consumers, that receive such events and use them in various ways.
Examples of event generators is execution of SQL statements, which generates
events like those used for statement-based replication. Another example is
PBXT engine-level replication.
An example of an event consumer is the writing of the binlog on a master.
Some event generators are not really plugins. Rather, there are specific
points in the server where events are generated. However, a generator can be
part of a plugin, for example a PBXT engine-level replication event generator
would be part of the PBXT storage engine plugin. And for example we could
write a filter plugin, which would be stacked on top of an existing generator
and provide the same event types and interfaces, but filtered in some way (for
example by removing certain events on the master side, or by re-writing events
in certain ways).
Event consumers on the other hand could be a plugin.
One generator can be stacked on top of another. This means that a generator on
top (for example row-based events) will handle some events itself
(eg. non-deterministic update in mixed-mode binlogging). Other events that it
does not want to or cannot handle (for example deterministic delete or DDL)
will be defered to the generator below (for example statement-based events).
Materialisation (or not)
------------------------
A central decision is how to represent events that are generated in the API at
the point of generation.
I want to avoid making the API require that events are materialised. By
"Materialised" I mean that all (or most) of the data for the event is written
into memory in a struct/class used inside the server or serialised in a data
buffer (byte buffer) in a format suitable for network transport or disk
storage.
Using a non-materialised event means storing just a reference to appropriate
context that allows to retrieve all information for the event using
accessors. Ie. typically this would be based on getting the event information
from the THD pointer.
Some reasons to avoid using materialised events in the API:
- Replication events have a _lot_ of detailed context information that can be
needed in events: user-defined variables, random seed, character sets,
table column names and types, etc. etc. If we make the API based on
materialisation, then the initial decision about which context information
to include with which events will have to be done in the API, while ideally
we want this decision to be done by the individual consumer plugin. There
will this be a conflict between what to include (to allow consumers access)
and what to exclude (to avoid excessive needless work).
- Materialising means defining a very specific format, which will tend to
make the API less generic and flexible.
- Unless the materialised format is made _very_ specific (and thus very
inflexible), it is unlikely to be directly useful for transport
(eg. binlog), so it will need to be re-materialised into a different format
anyway, wasting work.
- If a generator on top handles an event, then we want to avoid wasting work
materialising an event in a generator below which would be completely
unused. Thus there would be a need for the upper generator to somehow
notify the lower generator ahead of event generation time to not fire an
event, complicating the API.
Some advantages for materialisation:
- Using an API based on passing around some well-defined struct event (or
byte buffer) will be simpler than the complex class hierarchy proposed here
with no requirement for materialisation.
- Defining a materialised format would allow an easy way to use the same
consumer code on a generator that produces events at the source of
execution and on a generator that produces events from eg. reading them
from an event log.
Note that there can be some middle way, where some data is materialised and
some is kept as reference to context (eg. THD) only. This however looses most
of the mentioned advantages for materialisation.
Also, the non-materialising interface should be a good interface on top of
which to build a materialising interface.
The design proposed here aims for as little materialisation as possible.
Default materialisation format
------------------------------
While the proposed API doesn't _require_ materialisation, we can still think
about providing the _option_ for built-in materialisation. This could be
useful if such materialisation is made suitable for transport to a different
server (eg. no endian-dependance etc). If there is a facility for such
materialisation built-in to the API, it becomes possible to write something
like a generic binlog plugin or generic network transport plugin. This would
be really useful for eg. PBXT engine-level replication, as it could be
implemented without having to re-invent a binlog format.
I added in the proposed API a simple facility to materialise every event as a
string of bytes. To use this, I still need to add a suitable facility to
de-materialise the event.
However, it is still an open question whether such a facility will be at all
useful. It still has some of the problems with materialisation mentioned
above. And I think it is likely that a good binlog implementation will need
to do more than just blindly copy opaque events from one endpoint to
another. For example, it might need different event boundaries (merge and/or
split events); it might need to augment or modify events, or inject new
events, etc.
So I think maybe it is better to add such a generic materialisation facility
on top of the basic event generator API. Such a facility would provide
materialisation of an replication event stream, not of individual events, so
would be more flexible in providing a good implementation. It would be
implemented for all generators. It would separate from both the event
generator API (so we have flexibility to put a filter class in-between
generator and materialisation), and could also be separate from the actual
transport handling stuff like fsync() of binlog files and socket connections
etc. It would be paired with a corresponding applier API which would handle
executing events on a slave.
Then we can have a default materialised event format, which is available, but
not mandatory. So there can still be other formats alongside (like legacy
MySQL 5.1 binlog event format and maybe Tungsten would have its own format).
Encapsulation
-------------
Another fundamental question about the design is the level of encapsulation
used for the API.
At the implementation level, a lot of the work is basically to pull out all of
the needed information from the THD object/context. The API I propose tries to
_not_ expose the THD to consumers. Instead it provides accessor functions for
all the bits and pieces relevant to each replication event, while the event
class itself likely will be more or less just an encapsulated THD.
So an alternative would be to have a generic event that was just (type, THD).
Then consumers could just pull out whatever information they want from the
THD. The THD implementation is already exposed to storage engines. This would
of course greatly reduce the size of the API, eliminating lots of class
definitions and accessor functions. Though arguably it wouldn't really
simplify the API, as the complexity would just be in understanding the THD
class.
Note that we do not have to take any performance hit from using encapsulated
accessors since compilers can inline them (though if inlining then we do not
get any ABI stability with respect to THD implemetation).
For now, the API is proposed without exposing the THD class. (Similar
encapsulation could be added in actual implementation to also not expose TABLE
and similar classes).
LOW-LEVEL DESIGN:
A consumer is implented as a virtual class (interface). There is one virtual
function for every event that can be received. A consumer would derive from
the base class and override methods for the events it wants to receive.
There is one consumer interface for each generator. When a generator A is
stacked on B, the consumer interface for A inherits from the interface for
B. This way, when A defers an event to B, the consumer for A will receive the
corresponding event from B.
There are methods for a consumer to register itself to receive events from
each generator. I still need to find a way for a consumer in one plugin to
register itself with a generator implemented in another plugin (eg. PBXT
engine-level replication). I also need to add a way for consumers to
de-register themselves.
The current design has consumer callbacks return 0 for success and error code
otherwise. I still need to think more about whether this is useful (ie. what
is the semantics of returning an error from a consumer callback).
Each event passed to consumers is defined as a class with public accessor
methods to a private context (which is mostly the THD).
My intension is to make all events passed around const, so that the same event
can be passed to each of multiple registered consumers (and to emphasise that
consumers do not have the ability to modify events). It still needs to be seen
whether that const-ness will be feasible in practise without very heavy
modification/constification of exiting code.
What follows is a partial draft of a possible definition of the API as
concrete C++ class definitions.
-----------------------------------------------------------------------
/*
Virtual base class for generated replication events.
This is the parent of events generated from all kinds of generators. Only
child classes can be instantiated.
This class can be used by code that wants to treat events in a generic way,
without any knowledge of event details. I still need to decide whether such
generic code is sensible.
*/
class rpl_event_base
{
/*
Maybe we will want the ability to materialise an event to a standard
binary format. This could be achieved with a base method like this. The
actual materialisation would be implemented in each deriving class. The
public methods would provide different interfaces for specifying the
buffer or for writing directly into IO_CACHE or file.
*/
/* Return 0 on success, -1 on error, -2 on out-of-buffer. */
int materialise(uchar *buffer, size_t buflen) const;
/*
Returns NULL on error or else malloc()ed buffer with materialised event,
caller must free().
*/
uchar *materialise() const;
/* Same but using passed in memroot. */
uchar *materialise(mem_root *memroot) const;
/*
Materialise to user-supplied writer function (could write directly to file
or the like).
*/
int materialise(int (*writer)(uchar *data, size_t len, void *context)) const;
/*
As to for what to do with a materialised event, there are a couple of
possibilities.
One is to have a de_materialise() method somewhere that can construct an
rpl_event_base (really a derived class of course) from a buffer or writer
function. This would require each accessor function to conditionally read
its data from either THD context or buffer (GCC is able to optimise
several such conditionals in multiple accessor function calls into one
conditional), or we can make all accessors virtual if the performance hit
is acceptable.
Another is to have different classes for accessing events read from
materialised event data.
Also, I still need to think about whether it is at all useful to be able
to generically materialise an event at this level. It may be that any
binlog/transport will in any case need to undertand more of the format of
events, so that such materialisation/transport is better done at a
different layer.
*/
protected:
/* Implementation which is the basis for materialise(). */
virtual int do_materialise(int (*writer)(uchar *data, size_t len,
void *context)) const = 0;
private:
/* Virtual base class, private constructor to prevent instantiation. */
rpl_event_base();
};
/*
These are the event types output from the transaction event generator.
This generator is not stacked on anything.
The transaction event generator marks the start and end (commit or rollback)
of transactions. It also gives information about whether the transaction was
a full transaction or autocommitted statement, whether transactional tables
were involved, whether non-transactional tables were involved, and XA
information (ToDo).
*/
/* Base class for transaction events. */
class rpl_event_transaction_base : public rpl_event_base
{
public:
/*
Get the local transaction id. This idea is only unique within one server.
It is allocated whenever a new transaction is started.
Can be used to identify events belonging to the same transaction in a
binlog-like stream of events streamed in parallel among multiple
transactions.
*/
uint64_t get_local_trx_id() const { return thd->local_trx_id; };
bool get_is_autocommit() const;
private:
/* The context is the THD. */
THD *thd;
rpl_event_transaction_base(THD *_thd) : thd(_thd) { };
};
/* Transaction start event. */
class rpl_event_transaction_start : public rpl_event_transaction_base
{
};
/* Transaction commit. */
class rpl_event_transaction_commit : public rpl_event_transaction_base
{
public:
/*
The global transaction id is unique cross-server.
It can be used to identify the position from which to start a slave
replicating from a master.
This global ID is only available once the transaction is decided to commit
by the TC manager / primary redundancy service. This TC also allocates the
ID and decides the exact semantics (can there be gaps, etc); however the
format is fixed (cluster_id, running_counter).
*/
struct global_transaction_id
{
uint32_t cluster_id;
uint64_t counter;
};
const global_transaction_id *get_global_transaction_id() const;
};
/* Transaction rollback. */
class rpl_event_transaction_rollback : public rpl_event_transaction_base
{
};
/* Base class for statement events. */
class rpl_event_statement_base : public rpl_event_base
{
public:
LEX_STRING get_current_db() const;
};
class rpl_event_statement_start : public rpl_event_statement_base
{
};
class rpl_event_statement_end : public rpl_event_statement_base
{
public:
int get_errorcode() const;
};
class rpl_event_statement_query : public rpl_event_statement_base
{
public:
LEX_STRING get_query_string();
ulong get_sql_mode();
const CHARSET_INFO *get_character_set_client();
const CHARSET_INFO *get_collation_connection();
const CHARSET_INFO *get_collation_server();
const CHARSET_INFO *get_collation_default_db();
/*
Access to relevant flags that affect query execution.
Use as if (ev->get_flags() & (uint32)ROW_FOREIGN_KEY_CHECKS) { ... }
*/
enum flag_bits
{
STMT_FOREIGN_KEY_CHECKS, // @@foreign_key_checks
STMT_UNIQUE_KEY_CHECKS, // @@unique_checks
STMT_AUTO_IS_NULL, // @@sql_auto_is_null
};
uint32_t get_flags();
ulong get_auto_increment_offset();
ulong get_auto_increment_increment();
// And so on for: time zone; day/month names; connection id; LAST_INSERT_ID;
// INSERT_ID; random seed; user variables.
//
// We probably also need get_uses_temporary_table(), get_used_user_vars(),
// get_uses_auto_increment() and so on, so a consumer can get more
// information about what kind of context information a query will need when
// executed on a slave.
};
class rpl_event_statement_load_query : public rpl_event_statement_query
{
};
/*
This event is fired with blocks of data for files read (from server-local
file or client connection) for LOAD DATA.
*/
class rpl_event_statement_load_data_block : public rpl_event_statement_base
{
public:
struct block
{
const uchar *ptr;
size_t size;
};
block get_block() const;
};
/* Base class for row-based replication events. */
class rpl_event_row_base : public rpl_event_base
{
public:
/*
Access to relevant handler extra flags and other flags that affect row
operations.
Use as if (ev->get_flags() & (uint32)ROW_WRITE_CAN_REPLACE) { ... }
*/
enum flag_bits
{
ROW_WRITE_CAN_REPLACE, // HA_EXTRA_WRITE_CAN_REPLACE
ROW_IGNORE_DUP_KEY, // HA_EXTRA_IGNORE_DUP_KEY
ROW_IGNORE_NO_KEY, // HA_EXTRA_IGNORE_NO_KEY
ROW_DISABLE_FOREIGN_KEY_CHECKS, // ! @@foreign_key_checks
ROW_DISABLE_UNIQUE_KEY_CHECKS, // ! @@unique_checks
};
uint32_t get_flags();
/* Access to list of tables modified. */
class table_iterator
{
public:
/* Returns table, NULL after last. */
const TABLE *get_next();
private:
// ...
};
table_iterator get_modified_tables() const;
private:
/* Context used to provide accessors. */
THD *thd;
protected:
rpl_event_row_base(THD *_thd) : thd(_thd) { }
};
class rpl_event_row_write : public rpl_event_row_base
{
public:
const BITMAP *get_write_set() const;
const uchar *get_after_image() const;
};
class rpl_event_row_update : public rpl_event_row_base
{
public:
const BITMAP *get_read_set() const;
const BITMAP *get_write_set() const;
const uchar *get_before_image() const;
const uchar *get_after_image() const;
};
class rpl_event_row_delete : public rpl_event_row_base
{
public:
const BITMAP *get_read_set() const;
const uchar *get_before_image() const;
};
/*
Event consumer callbacks.
An event consumer registers with an event generator to receive event
notifications from that generator.
The consumer has callbacks (in the form of virtual functions) for the
individual event types the consumer is interested in. Only callbacks that
are non-NULL will be invoked. If an event applies to multiple callbacks in a
single callback struct, it will only be passed to the most specific non-NULL
callback (so events never fire more than once per registration).
The lifetime of the memory holding the event is only for the duration of the
callback invocation, unless otherwise noted.
Callbacks return 0 for success or error code (ToDo: does this make sense?).
*/
struct rpl_event_consumer_transaction
{
virtual int trx_start(const rpl_event_transaction_start *) { return 0; }
virtual int trx_commit(const rpl_event_transaction_commit *) { return 0; }
virtual int trx_rollback(const rpl_event_transaction_rollback *) { return 0; }
};
/*
Consuming statement-based events.
The statement event generator is stacked on top of the transaction event
generator, so we can receive those events as well.
*/
struct rpl_event_consumer_statement : public rpl_event_consumer_transaction
{
virtual int stmt_start(const rpl_event_statement_start *) { return 0; }
virtual int stmt_end(const rpl_event_statement_end *) { return 0; }
virtual int stmt_query(const rpl_event_statement_query *) { return 0; }
/* Data for a file used in LOAD DATA [LOCAL] INFILE. */
virtual int stmt_load_data_block(const rpl_event_statement_load_data_block *)
{ return 0; }
/*
These are specific kinds of statements; if specified they override
consume_stmt_query() for the corresponding event.
*/
virtual int stmt_load_query(const rpl_event_statement_load_query *ev)
{ return stmt_query(ev); }
};
/*
Consuming row-based events.
The row event generator is stacked on top of the statement event generator.
*/
struct rpl_event_consumer_row : public rpl_event_consumer_statement
{
virtual int row_write(const rpl_event_row_write *) { return 0; }
virtual int row_update(const rpl_event_row_update *) { return 0; }
virtual int row_delete(const rpl_event_row_delete *) { return 0; }
};
/*
Registration functions.
ToDo: Make a way to de-register.
ToDo: Find a good way for a plugin (eg. PBXT) to publish a generator
registration method.
*/
int rpl_event_transaction_register(const rpl_event_consumer_transaction *cbs);
int rpl_event_statement_register(const rpl_event_consumer_statement *cbs);
int rpl_event_row_register(const rpl_event_consumer_row *cbs);
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
[Maria-developers] WL#107 Updated (by Sergei): New replication APIs
by worklog-noreply@askmonty.org 29 Jun '10
by worklog-noreply@askmonty.org 29 Jun '10
29 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: New replication APIs
CREATION DATE..: Mon, 15 Mar 2010, 13:55
SUPERVISOR.....: Knielsen
IMPLEMENTOR....: Knielsen
COPIES TO......: Sergei
CATEGORY.......: Server-Sprint
TASK ID........: 107 (http://askmonty.org/worklog/?tid=107)
VERSION........: Server-9.x
STATUS.........: In-Progress
PRIORITY.......: 60
WORKED HOURS...: 69
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Sergei - Tue, 29 Jun 2010, 13:50)=-=-
Status updated.
--- /tmp/wklog.107.old.31164 2010-06-29 13:50:15.000000000 +0000
+++ /tmp/wklog.107.new.31164 2010-06-29 13:50:15.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+In-Progress
-=-=(Knielsen - Thu, 24 Jun 2010, 12:04)=-=-
Dependency created: 107 now depends on 120
-=-=(Knielsen - Mon, 21 Jun 2010, 08:36)=-=-
Research and design thoughts.
-=-=(Knielsen - Mon, 07 Jun 2010, 12:11)=-=-
High Level Description modified.
--- /tmp/wklog.107.old.31097 2010-06-07 12:11:57.000000000 +0000
+++ /tmp/wklog.107.new.31097 2010-06-07 12:11:57.000000000 +0000
@@ -7,3 +7,6 @@
https://lists.launchpad.net/maria-developers/msg01998.html
+Wiki page for the project:
+
+ http://askmonty.org/wiki/ReplicationProject
-=-=(Knielsen - Mon, 29 Mar 2010, 07:33)=-=-
Research and design discussions: Galera, 2pc/XA, group commit, multi-engine transactions.
-=-=(Knielsen - Wed, 24 Mar 2010, 10:39)=-=-
Design discussions
-=-=(Knielsen - Mon, 15 Mar 2010, 14:28)=-=-
Research into the problem, and discussions on phone/mailing list
-=-=(Guest - Mon, 15 Mar 2010, 14:18)=-=-
High-Level Specification modified.
--- /tmp/wklog.107.old.9086 2010-03-15 14:18:18.000000000 +0000
+++ /tmp/wklog.107.new.9086 2010-03-15 14:18:18.000000000 +0000
@@ -1 +1,43 @@
+Current ideas/status after discussions on the mailing list:
+
+ - Implement a set of plugin APIs and use them to move all of the existing
+ MySQL replication into a (set of) plugins.
+
+ - Design the APIs so that they can support full MySQL replication, but also
+ so that they do not hardcode assumptions about how this replication
+ implementation is done, and so that they will be suitable for other types of
+ replication (Tungsten, Galera, parallel replication, ...).
+
+ - APIs need to include the concept of a global transaction ID. Need to
+ determine the extent to which the semantics of such ID will be defined
+ by the API, and to which extend it will be defined by the plugin
+ implementations.
+
+ - APIs should properly support reliable crash-recovery with decent
+ performance (eg. not require multiple mandatory fsync()s per commit, and
+ not make group commit impossible).
+
+ - Would be nice if the API provided facilities for implementing good
+ consistency checking support (mainly checking master tables against slave
+ tables is hard here I think, but also applying wrong binlog data and
+ individual event checksums).
+
+
+Steps to make this more concrete:
+
+ - Investigate the current MySQL replication, and list all of the places where
+ a plugin implementation will need to connect/hook into the MySQL server.
+ * handler::{write,update,delete}_row()
+ * Statement execution
+ * Transaction start/commit
+ * Table open
+ * Query safe/not/safe for statement based replication
+ * Statement-based logging details (user variables, random seed, etc.)
+ * ...
+
+ - Use this list to make an initial sketch of the set of APIs we need.
+
+ - Use the list to determine the feasibility of this project and the level of
+ detail in the API needed to support a full replication implementation as a
+ plugin.
-=-=(Sergei - Mon, 15 Mar 2010, 14:13)=-=-
Observers changed: Sergei
DESCRIPTION:
This is a top-level task for the project of designing a new set of replication
APIs for MariaDB.
This task is for the initial discussion of what to do and where to focus.
The project is started in this email thread:
https://lists.launchpad.net/maria-developers/msg01998.html
Wiki page for the project:
http://askmonty.org/wiki/ReplicationProject
HIGH-LEVEL SPECIFICATION:
Current ideas/status after discussions on the mailing list:
- Implement a set of plugin APIs and use them to move all of the existing
MySQL replication into a (set of) plugins.
- Design the APIs so that they can support full MySQL replication, but also
so that they do not hardcode assumptions about how this replication
implementation is done, and so that they will be suitable for other types of
replication (Tungsten, Galera, parallel replication, ...).
- APIs need to include the concept of a global transaction ID. Need to
determine the extent to which the semantics of such ID will be defined
by the API, and to which extend it will be defined by the plugin
implementations.
- APIs should properly support reliable crash-recovery with decent
performance (eg. not require multiple mandatory fsync()s per commit, and
not make group commit impossible).
- Would be nice if the API provided facilities for implementing good
consistency checking support (mainly checking master tables against slave
tables is hard here I think, but also applying wrong binlog data and
individual event checksums).
Steps to make this more concrete:
- Investigate the current MySQL replication, and list all of the places where
a plugin implementation will need to connect/hook into the MySQL server.
* handler::{write,update,delete}_row()
* Statement execution
* Transaction start/commit
* Table open
* Query safe/not/safe for statement based replication
* Statement-based logging details (user variables, random seed, etc.)
* ...
- Use this list to make an initial sketch of the set of APIs we need.
- Use the list to determine the feasibility of this project and the level of
detail in the API needed to support a full replication implementation as a
plugin.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
[Maria-developers] WL#107 Updated (by Sergei): New replication APIs
by worklog-noreply@askmonty.org 29 Jun '10
by worklog-noreply@askmonty.org 29 Jun '10
29 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: New replication APIs
CREATION DATE..: Mon, 15 Mar 2010, 13:55
SUPERVISOR.....: Knielsen
IMPLEMENTOR....: Knielsen
COPIES TO......: Sergei
CATEGORY.......: Server-Sprint
TASK ID........: 107 (http://askmonty.org/worklog/?tid=107)
VERSION........: Server-9.x
STATUS.........: In-Progress
PRIORITY.......: 60
WORKED HOURS...: 69
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Sergei - Tue, 29 Jun 2010, 13:50)=-=-
Status updated.
--- /tmp/wklog.107.old.31164 2010-06-29 13:50:15.000000000 +0000
+++ /tmp/wklog.107.new.31164 2010-06-29 13:50:15.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+In-Progress
-=-=(Knielsen - Thu, 24 Jun 2010, 12:04)=-=-
Dependency created: 107 now depends on 120
-=-=(Knielsen - Mon, 21 Jun 2010, 08:36)=-=-
Research and design thoughts.
-=-=(Knielsen - Mon, 07 Jun 2010, 12:11)=-=-
High Level Description modified.
--- /tmp/wklog.107.old.31097 2010-06-07 12:11:57.000000000 +0000
+++ /tmp/wklog.107.new.31097 2010-06-07 12:11:57.000000000 +0000
@@ -7,3 +7,6 @@
https://lists.launchpad.net/maria-developers/msg01998.html
+Wiki page for the project:
+
+ http://askmonty.org/wiki/ReplicationProject
-=-=(Knielsen - Mon, 29 Mar 2010, 07:33)=-=-
Research and design discussions: Galera, 2pc/XA, group commit, multi-engine transactions.
-=-=(Knielsen - Wed, 24 Mar 2010, 10:39)=-=-
Design discussions
-=-=(Knielsen - Mon, 15 Mar 2010, 14:28)=-=-
Research into the problem, and discussions on phone/mailing list
-=-=(Guest - Mon, 15 Mar 2010, 14:18)=-=-
High-Level Specification modified.
--- /tmp/wklog.107.old.9086 2010-03-15 14:18:18.000000000 +0000
+++ /tmp/wklog.107.new.9086 2010-03-15 14:18:18.000000000 +0000
@@ -1 +1,43 @@
+Current ideas/status after discussions on the mailing list:
+
+ - Implement a set of plugin APIs and use them to move all of the existing
+ MySQL replication into a (set of) plugins.
+
+ - Design the APIs so that they can support full MySQL replication, but also
+ so that they do not hardcode assumptions about how this replication
+ implementation is done, and so that they will be suitable for other types of
+ replication (Tungsten, Galera, parallel replication, ...).
+
+ - APIs need to include the concept of a global transaction ID. Need to
+ determine the extent to which the semantics of such ID will be defined
+ by the API, and to which extend it will be defined by the plugin
+ implementations.
+
+ - APIs should properly support reliable crash-recovery with decent
+ performance (eg. not require multiple mandatory fsync()s per commit, and
+ not make group commit impossible).
+
+ - Would be nice if the API provided facilities for implementing good
+ consistency checking support (mainly checking master tables against slave
+ tables is hard here I think, but also applying wrong binlog data and
+ individual event checksums).
+
+
+Steps to make this more concrete:
+
+ - Investigate the current MySQL replication, and list all of the places where
+ a plugin implementation will need to connect/hook into the MySQL server.
+ * handler::{write,update,delete}_row()
+ * Statement execution
+ * Transaction start/commit
+ * Table open
+ * Query safe/not/safe for statement based replication
+ * Statement-based logging details (user variables, random seed, etc.)
+ * ...
+
+ - Use this list to make an initial sketch of the set of APIs we need.
+
+ - Use the list to determine the feasibility of this project and the level of
+ detail in the API needed to support a full replication implementation as a
+ plugin.
-=-=(Sergei - Mon, 15 Mar 2010, 14:13)=-=-
Observers changed: Sergei
DESCRIPTION:
This is a top-level task for the project of designing a new set of replication
APIs for MariaDB.
This task is for the initial discussion of what to do and where to focus.
The project is started in this email thread:
https://lists.launchpad.net/maria-developers/msg01998.html
Wiki page for the project:
http://askmonty.org/wiki/ReplicationProject
HIGH-LEVEL SPECIFICATION:
Current ideas/status after discussions on the mailing list:
- Implement a set of plugin APIs and use them to move all of the existing
MySQL replication into a (set of) plugins.
- Design the APIs so that they can support full MySQL replication, but also
so that they do not hardcode assumptions about how this replication
implementation is done, and so that they will be suitable for other types of
replication (Tungsten, Galera, parallel replication, ...).
- APIs need to include the concept of a global transaction ID. Need to
determine the extent to which the semantics of such ID will be defined
by the API, and to which extend it will be defined by the plugin
implementations.
- APIs should properly support reliable crash-recovery with decent
performance (eg. not require multiple mandatory fsync()s per commit, and
not make group commit impossible).
- Would be nice if the API provided facilities for implementing good
consistency checking support (mainly checking master tables against slave
tables is hard here I think, but also applying wrong binlog data and
individual event checksums).
Steps to make this more concrete:
- Investigate the current MySQL replication, and list all of the places where
a plugin implementation will need to connect/hook into the MySQL server.
* handler::{write,update,delete}_row()
* Statement execution
* Transaction start/commit
* Table open
* Query safe/not/safe for statement based replication
* Statement-based logging details (user variables, random seed, etc.)
* ...
- Use this list to make an initial sketch of the set of APIs we need.
- Use the list to determine the feasibility of this project and the level of
detail in the API needed to support a full replication implementation as a
plugin.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
This patch installs all the files that were missing from the installer
package. Now, the installer has the same set of files as the zip file.
Diff'ed against the current 5.1 tree.
Bo Thorsen.
Monty Program AB.
--
MariaDB: MySQL replacement
Community developed. Feature enhanced. Backward compatible.
2
2
[Maria-developers] bzr commit into MariaDB 5.1, with Maria 1.5:maria branch (igor:2779)
by Igor Babaev 29 Jun '10
by Igor Babaev 29 Jun '10
29 Jun '10
#At lp:maria based on revid:knielsen@knielsen-hq.org-20091130132430-edrwle5zh6udx9rp
2779 Igor Babaev 2010-06-28
Optimization that checks for expressions whether they are always null.
modified:
mysql-test/r/func_in.result
sql/item.h
sql/item_cmpfunc.cc
sql/item_cmpfunc.h
sql/item_func.cc
sql/item_func.h
sql/item_sum.h
sql/sql_select.cc
sql/sql_udf.h
=== modified file 'mysql-test/r/func_in.result'
--- a/mysql-test/r/func_in.result 2009-10-05 05:27:36 +0000
+++ b/mysql-test/r/func_in.result 2010-06-29 00:24:26 +0000
@@ -642,10 +642,10 @@ id select_type table type possible_keys
1 SIMPLE t1 range c_int c_int 4 NULL 3 Using where
EXPLAIN SELECT * FROM t1 WHERE c_int IN (NULL);
id select_type table type possible_keys key key_len ref rows Extra
-1 SIMPLE NULL NULL NULL NULL NULL NULL NULL Impossible WHERE noticed after reading const tables
+1 SIMPLE NULL NULL NULL NULL NULL NULL NULL Impossible WHERE
EXPLAIN SELECT * FROM t1 WHERE c_int IN (NULL, NULL);
id select_type table type possible_keys key key_len ref rows Extra
-1 SIMPLE NULL NULL NULL NULL NULL NULL NULL Impossible WHERE noticed after reading const tables
+1 SIMPLE NULL NULL NULL NULL NULL NULL NULL Impossible WHERE
EXPLAIN SELECT * FROM t1 WHERE c_decimal IN (1, 2, 3);
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE t1 range c_decimal c_decimal 3 NULL 3 Using where
@@ -654,10 +654,10 @@ id select_type table type possible_keys
1 SIMPLE t1 range c_decimal c_decimal 3 NULL 3 Using where
EXPLAIN SELECT * FROM t1 WHERE c_decimal IN (NULL);
id select_type table type possible_keys key key_len ref rows Extra
-1 SIMPLE NULL NULL NULL NULL NULL NULL NULL Impossible WHERE noticed after reading const tables
+1 SIMPLE NULL NULL NULL NULL NULL NULL NULL Impossible WHERE
EXPLAIN SELECT * FROM t1 WHERE c_decimal IN (NULL, NULL);
id select_type table type possible_keys key key_len ref rows Extra
-1 SIMPLE NULL NULL NULL NULL NULL NULL NULL Impossible WHERE noticed after reading const tables
+1 SIMPLE NULL NULL NULL NULL NULL NULL NULL Impossible WHERE
EXPLAIN SELECT * FROM t1 WHERE c_float IN (1, 2, 3);
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE t1 range c_float c_float 4 NULL 3 Using where
@@ -666,10 +666,10 @@ id select_type table type possible_keys
1 SIMPLE t1 range c_float c_float 4 NULL 3 Using where
EXPLAIN SELECT * FROM t1 WHERE c_float IN (NULL);
id select_type table type possible_keys key key_len ref rows Extra
-1 SIMPLE NULL NULL NULL NULL NULL NULL NULL Impossible WHERE noticed after reading const tables
+1 SIMPLE NULL NULL NULL NULL NULL NULL NULL Impossible WHERE
EXPLAIN SELECT * FROM t1 WHERE c_float IN (NULL, NULL);
id select_type table type possible_keys key key_len ref rows Extra
-1 SIMPLE NULL NULL NULL NULL NULL NULL NULL Impossible WHERE noticed after reading const tables
+1 SIMPLE NULL NULL NULL NULL NULL NULL NULL Impossible WHERE
EXPLAIN SELECT * FROM t1 WHERE c_bit IN (1, 2, 3);
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE t1 range c_bit c_bit 2 NULL 3 Using where
@@ -678,10 +678,10 @@ id select_type table type possible_keys
1 SIMPLE t1 range c_bit c_bit 2 NULL 3 Using where
EXPLAIN SELECT * FROM t1 WHERE c_bit IN (NULL);
id select_type table type possible_keys key key_len ref rows Extra
-1 SIMPLE NULL NULL NULL NULL NULL NULL NULL Impossible WHERE noticed after reading const tables
+1 SIMPLE NULL NULL NULL NULL NULL NULL NULL Impossible WHERE
EXPLAIN SELECT * FROM t1 WHERE c_bit IN (NULL, NULL);
id select_type table type possible_keys key key_len ref rows Extra
-1 SIMPLE NULL NULL NULL NULL NULL NULL NULL Impossible WHERE noticed after reading const tables
+1 SIMPLE NULL NULL NULL NULL NULL NULL NULL Impossible WHERE
EXPLAIN SELECT * FROM t1 WHERE c_date
IN ('2009-09-01', '2009-09-02', '2009-09-03');
id select_type table type possible_keys key key_len ref rows Extra
@@ -692,10 +692,10 @@ id select_type table type possible_keys
1 SIMPLE t1 range c_date c_date 3 NULL 3 Using where
EXPLAIN SELECT * FROM t1 WHERE c_date IN (NULL);
id select_type table type possible_keys key key_len ref rows Extra
-1 SIMPLE NULL NULL NULL NULL NULL NULL NULL Impossible WHERE noticed after reading const tables
+1 SIMPLE NULL NULL NULL NULL NULL NULL NULL Impossible WHERE
EXPLAIN SELECT * FROM t1 WHERE c_date IN (NULL, NULL);
id select_type table type possible_keys key key_len ref rows Extra
-1 SIMPLE NULL NULL NULL NULL NULL NULL NULL Impossible WHERE noticed after reading const tables
+1 SIMPLE NULL NULL NULL NULL NULL NULL NULL Impossible WHERE
EXPLAIN SELECT * FROM t1 WHERE c_datetime
IN ('2009-09-01 00:00:01', '2009-09-02 00:00:01', '2009-09-03 00:00:01');
id select_type table type possible_keys key key_len ref rows Extra
@@ -706,10 +706,10 @@ id select_type table type possible_keys
1 SIMPLE t1 range c_datetime c_datetime 8 NULL 3 Using where
EXPLAIN SELECT * FROM t1 WHERE c_datetime IN (NULL);
id select_type table type possible_keys key key_len ref rows Extra
-1 SIMPLE NULL NULL NULL NULL NULL NULL NULL Impossible WHERE noticed after reading const tables
+1 SIMPLE NULL NULL NULL NULL NULL NULL NULL Impossible WHERE
EXPLAIN SELECT * FROM t1 WHERE c_datetime IN (NULL, NULL);
id select_type table type possible_keys key key_len ref rows Extra
-1 SIMPLE NULL NULL NULL NULL NULL NULL NULL Impossible WHERE noticed after reading const tables
+1 SIMPLE NULL NULL NULL NULL NULL NULL NULL Impossible WHERE
EXPLAIN SELECT * FROM t1 WHERE c_timestamp
IN ('2009-09-01 00:00:01', '2009-09-01 00:00:02', '2009-09-01 00:00:03');
id select_type table type possible_keys key key_len ref rows Extra
@@ -720,10 +720,10 @@ id select_type table type possible_keys
1 SIMPLE t1 range c_timestamp c_timestamp 4 NULL 3 Using where
EXPLAIN SELECT * FROM t1 WHERE c_timestamp IN (NULL);
id select_type table type possible_keys key key_len ref rows Extra
-1 SIMPLE NULL NULL NULL NULL NULL NULL NULL Impossible WHERE noticed after reading const tables
+1 SIMPLE NULL NULL NULL NULL NULL NULL NULL Impossible WHERE
EXPLAIN SELECT * FROM t1 WHERE c_timestamp IN (NULL, NULL);
id select_type table type possible_keys key key_len ref rows Extra
-1 SIMPLE NULL NULL NULL NULL NULL NULL NULL Impossible WHERE noticed after reading const tables
+1 SIMPLE NULL NULL NULL NULL NULL NULL NULL Impossible WHERE
EXPLAIN SELECT * FROM t1 WHERE c_year IN (1, 2, 3);
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE t1 range c_year c_year 1 NULL 3 Using where
@@ -732,10 +732,10 @@ id select_type table type possible_keys
1 SIMPLE t1 range c_year c_year 1 NULL 3 Using where
EXPLAIN SELECT * FROM t1 WHERE c_year IN (NULL);
id select_type table type possible_keys key key_len ref rows Extra
-1 SIMPLE NULL NULL NULL NULL NULL NULL NULL Impossible WHERE noticed after reading const tables
+1 SIMPLE NULL NULL NULL NULL NULL NULL NULL Impossible WHERE
EXPLAIN SELECT * FROM t1 WHERE c_year IN (NULL, NULL);
id select_type table type possible_keys key key_len ref rows Extra
-1 SIMPLE NULL NULL NULL NULL NULL NULL NULL Impossible WHERE noticed after reading const tables
+1 SIMPLE NULL NULL NULL NULL NULL NULL NULL Impossible WHERE
EXPLAIN SELECT * FROM t1 WHERE c_char IN ('1', '2', '3');
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE t1 range c_char c_char 10 NULL 3 Using where
@@ -744,10 +744,10 @@ id select_type table type possible_keys
1 SIMPLE t1 range c_char c_char 10 NULL 3 Using where
EXPLAIN SELECT * FROM t1 WHERE c_char IN (NULL);
id select_type table type possible_keys key key_len ref rows Extra
-1 SIMPLE NULL NULL NULL NULL NULL NULL NULL Impossible WHERE noticed after reading const tables
+1 SIMPLE NULL NULL NULL NULL NULL NULL NULL Impossible WHERE
EXPLAIN SELECT * FROM t1 WHERE c_char IN (NULL, NULL);
id select_type table type possible_keys key key_len ref rows Extra
-1 SIMPLE NULL NULL NULL NULL NULL NULL NULL Impossible WHERE noticed after reading const tables
+1 SIMPLE NULL NULL NULL NULL NULL NULL NULL Impossible WHERE
DROP TABLE t1;
#
End of 5.1 tests
=== modified file 'sql/item.h'
--- a/sql/item.h 2009-11-16 20:49:51 +0000
+++ b/sql/item.h 2010-06-29 00:24:26 +0000
@@ -774,6 +774,7 @@ public:
will not change until next fix_fields) and its value is known.
*/
virtual bool const_item() const { return used_tables() == 0; }
+ virtual bool is_always_null() const { return 0; }
/*
Returns true if this is constant but its value may be not known yet.
(Can be used for parameters of prep. stmts or of stored procedures.)
@@ -1563,6 +1564,7 @@ public:
enum Item_result result_type () const { return STRING_RESULT; }
enum_field_types field_type() const { return MYSQL_TYPE_NULL; }
bool basic_const_item() const { return 1; }
+ bool is_always_null() const { return 1; }
Item *clone_item() { return new Item_null(name); }
bool is_null() { return 1; }
=== modified file 'sql/item_cmpfunc.cc'
--- a/sql/item_cmpfunc.cc 2009-11-16 20:49:51 +0000
+++ b/sql/item_cmpfunc.cc 2010-06-29 00:24:26 +0000
@@ -1983,6 +1983,19 @@ bool Item_func_between::fix_fields(THD *
(args[1]->not_null_tables() &
args[2]->not_null_tables()));
+ if (negated)
+ {
+ always_null_cache= 1;
+ if (!args[0]->is_always_null())
+ always_null_cache= args[1]->is_always_null() &&
+ args[2]->is_always_null();
+ }
+ else
+ {
+ always_null_cache= args[0]->is_always_null() ||
+ args[1]->is_always_null() ||
+ args[2]->is_always_null();
+ }
return 0;
}
@@ -3545,6 +3558,23 @@ Item_func_in::fix_fields(THD *thd, Item
for (arg= args + 1, arg_end= args + arg_count; arg != arg_end; arg++)
not_null_tables_cache&= (*arg)->not_null_tables();
not_null_tables_cache|= (*args)->not_null_tables();
+ if (negated)
+ {
+ always_null_cache= 0;
+ for (arg= args, arg_end= args + arg_count;
+ !always_null_cache && arg != arg_end; arg++)
+ always_null_cache= (*arg)->is_always_null();
+ }
+ else
+ {
+ always_null_cache= 1;
+ if (!(*args)->is_always_null())
+ {
+ for (arg= args + 1, arg_end= args + arg_count;
+ always_null_cache && arg != arg_end; arg++)
+ always_null_cache= (*arg)->is_always_null();
+ }
+ }
return 0;
}
=== modified file 'sql/item_cmpfunc.h'
--- a/sql/item_cmpfunc.h 2009-11-16 20:49:51 +0000
+++ b/sql/item_cmpfunc.h 2010-06-29 00:24:26 +0000
@@ -239,6 +239,7 @@ public:
longlong val_int();
void cleanup();
const char *func_name() const { return "<in_optimizer>"; }
+ bool is_null_preserving() const { return 0; }
Item_cache **get_cache() { return &cache; }
void keep_top_level_cache();
};
@@ -438,6 +439,7 @@ public:
longlong val_int();
enum Functype functype() const { return NOT_ALL_FUNC; }
const char *func_name() const { return "<not>"; }
+ bool is_null_preserving() const { return 0; }
virtual void print(String *str, enum_query_type query_type);
void set_sum_test(Item_sum_hybrid *item) { test_sum_item= item; };
void set_sub_test(Item_maxmin_subselect *item) { test_sub_item= item; };
@@ -453,6 +455,7 @@ public:
Item_func_nop_all(Item *a) :Item_func_not_all(a) {}
longlong val_int();
const char *func_name() const { return "<nop>"; }
+ bool is_null_preserving() const { return 0; }
Item *neg_transformer(THD *thd);
};
@@ -480,6 +483,7 @@ public:
enum Functype rev_functype() const { return EQUAL_FUNC; }
cond_result eq_cmp_result() const { return COND_TRUE; }
const char *func_name() const { return "<=>"; }
+ bool is_null_preserving() const { return 0; }
Item *neg_transformer(THD *thd) { return 0; }
};
@@ -597,6 +601,7 @@ public:
optimize_type select_optimize() const { return OPTIMIZE_KEY; }
enum Functype functype() const { return BETWEEN; }
const char *func_name() const { return "between"; }
+ bool is_null_preserving() const { return !negated; }
bool fix_fields(THD *, Item **);
void fix_length_and_dec();
virtual void print(String *str, enum_query_type query_type);
@@ -663,6 +668,7 @@ public:
const char *func_name() const { return "coalesce"; }
table_map not_null_tables() const { return 0; }
enum_field_types field_type() const { return cached_field_type; }
+ bool is_null_preserving() const { return 0; }
};
@@ -720,6 +726,7 @@ public:
void fix_length_and_dec();
uint decimal_precision() const { return args[0]->decimal_precision(); }
const char *func_name() const { return "nullif"; }
+ bool is_null_preserving() const { return 0; }
virtual inline void print(String *str, enum_query_type query_type)
{
@@ -1152,6 +1159,7 @@ public:
void fix_length_and_dec();
uint decimal_precision() const;
table_map not_null_tables() const { return 0; }
+ bool is_null_preserving() const { return 0; }
enum Item_result result_type () const { return cached_result_type; }
enum_field_types field_type() const { return cached_field_type; }
const char *func_name() const { return "case"; }
@@ -1225,6 +1233,7 @@ public:
virtual void print(String *str, enum_query_type query_type);
enum Functype functype() const { return IN_FUNC; }
const char *func_name() const { return " IN "; }
+ bool is_null_preserving() { return arg_count == 2 || negated; }
bool nulls_in_row();
bool is_bool_func() { return 1; }
CHARSET_INFO *compare_collation() { return cmp_collation.collation; }
@@ -1275,6 +1284,7 @@ public:
update_used_tables();
}
const char *func_name() const { return "isnull"; }
+ bool is_null_preserving() const { return 0; }
/* Optimize case of not_null_column IS NULL */
virtual void update_used_tables()
{
@@ -1340,6 +1350,7 @@ public:
decimals=0; max_length=1; maybe_null=0;
}
const char *func_name() const { return "isnotnull"; }
+ bool is_null_preserving() const { return 0; }
optimize_type select_optimize() const { return OPTIMIZE_NULL; }
table_map not_null_tables() const
{ return abort_on_null ? not_null_tables_cache : 0; }
@@ -1465,6 +1476,7 @@ public:
enum Type type() const { return COND_ITEM; }
List<Item>* argument_list() { return &list; }
+ bool is_null_preserving() const { return 0; }
table_map used_tables() const;
void update_used_tables();
virtual void print(String *str, enum_query_type query_type);
=== modified file 'sql/item_func.cc'
--- a/sql/item_func.cc 2009-11-16 20:49:51 +0000
+++ b/sql/item_func.cc 2010-06-29 00:24:26 +0000
@@ -156,6 +156,8 @@ Item_func::fix_fields(THD *thd, Item **r
used_tables_cache= not_null_tables_cache= 0;
const_item_cache=1;
+ always_null_cache= 0;
+ bool maybe_always_null= is_null_preserving();
if (check_stack_overrun(thd, STACK_MIN_SIZE, buff))
return TRUE; // Fatal error if flag is set!
@@ -193,6 +195,8 @@ Item_func::fix_fields(THD *thd, Item **r
not_null_tables_cache|= item->not_null_tables();
const_item_cache&= item->const_item();
with_subselect|= item->with_subselect;
+ if (maybe_always_null && !always_null_cache)
+ always_null_cache= item->is_always_null();
}
}
fix_length_and_dec();
@@ -202,7 +206,7 @@ Item_func::fix_fields(THD *thd, Item **r
return FALSE;
}
-
+
bool Item_func::walk(Item_processor processor, bool walk_subquery,
uchar *argument)
{
@@ -2863,6 +2867,7 @@ udf_handler::fix_fields(THD *thd, Item_r
func->maybe_null=0;
used_tables_cache=0;
const_item_cache=1;
+ always_null_cache= 0;
if ((f_args.arg_count=arg_count))
{
=== modified file 'sql/item_func.h'
--- a/sql/item_func.h 2009-11-16 20:49:51 +0000
+++ b/sql/item_func.h 2010-06-29 00:24:26 +0000
@@ -40,6 +40,7 @@ public:
uint arg_count;
table_map used_tables_cache, not_null_tables_cache;
bool const_item_cache;
+ bool always_null_cache;
enum Functype { UNKNOWN_FUNC,EQ_FUNC,EQUAL_FUNC,NE_FUNC,LT_FUNC,LE_FUNC,
GE_FUNC,GT_FUNC,FT_FUNC,
LIKE_FUNC,ISNULL_FUNC,ISNOTNULL_FUNC,
@@ -135,7 +136,9 @@ public:
instead.
*/
virtual const char *func_name() const= 0;
+ virtual bool is_null_preserving() const { return TRUE; }
virtual bool const_item() const { return const_item_cache; }
+ bool is_always_null() const { return always_null_cache; }
inline Item **arguments() const { return args; }
void set_arguments(List<Item> &list);
inline uint argument_count() const { return arg_count; }
@@ -983,6 +986,7 @@ public:
Item_func_last_insert_id(Item *a) :Item_int_func(a) {}
longlong val_int();
const char *func_name() const { return "last_insert_id"; }
+ bool is_null_preserving() const { return 0; }
void fix_length_and_dec()
{
if (arg_count)
@@ -1034,6 +1038,7 @@ public:
Item_udf_func(udf_func *udf_arg, List<Item> &list)
:Item_func(list), udf(udf_arg) {}
const char *func_name() const { return udf.name(); }
+ bool is_null_preserving() const { return 0; }
enum Functype functype() const { return UDF_FUNC; }
bool fix_fields(THD *thd, Item **ref)
{
@@ -1041,6 +1046,7 @@ public:
bool res= udf.fix_fields(thd, this, arg_count, args);
used_tables_cache= udf.used_tables_cache;
const_item_cache= udf.const_item_cache;
+ always_null_cache= udf.always_null_cache;
fixed= 1;
return res;
}
@@ -1352,6 +1358,7 @@ public:
virtual void print(String *str, enum_query_type query_type);
void print_as_stmt(String *str, enum_query_type query_type);
const char *func_name() const { return "set_user_var"; }
+ bool is_null_preserving() const { return 0; }
int save_in_field(Field *field, bool no_conversions,
bool can_use_result_field);
int save_in_field(Field *field, bool no_conversions)
@@ -1468,6 +1475,7 @@ public:
String* val_str(String*);
/* TODO: fix to support views */
const char *func_name() const { return "get_system_var"; }
+ bool is_null_preserving() const { return 0; }
/**
Indicates whether this system variable is written to the binlog or not.
@@ -1628,6 +1636,8 @@ public:
const char *func_name() const;
+ bool is_null_preserving() const { return 0; }
+
enum enum_field_types field_type() const;
Field *tmp_table_field(TABLE *t_arg);
=== modified file 'sql/item_sum.h'
--- a/sql/item_sum.h 2009-09-15 10:46:35 +0000
+++ b/sql/item_sum.h 2010-06-29 00:24:26 +0000
@@ -287,6 +287,7 @@ public:
Item_sum(THD *thd, Item_sum *item);
enum Type type() const { return SUM_FUNC_ITEM; }
virtual enum Sumfunctype sum_func () const=0;
+ bool is_null_preserving() const { return 0; }
/*
This method is similar to add(), but it is called when the current
=== modified file 'sql/sql_select.cc'
--- a/sql/sql_select.cc 2009-11-27 13:20:59 +0000
+++ b/sql/sql_select.cc 2010-06-29 00:24:26 +0000
@@ -7886,6 +7886,10 @@ static COND *build_equal_items_for_cond(
}
else if (cond->type() == Item::FUNC_ITEM)
{
+ Item *new_item;
+ if ((Item_func *)cond->is_always_null() && (new_item= new Item_null()))
+ return new_item;
+
List<Item> eq_list;
/*
If an equality predicate forms the whole and level,
=== modified file 'sql/sql_udf.h'
--- a/sql/sql_udf.h 2007-07-06 12:18:49 +0000
+++ b/sql/sql_udf.h 2010-06-29 00:24:26 +0000
@@ -63,6 +63,7 @@ class udf_handler :public Sql_alloc
public:
table_map used_tables_cache;
bool const_item_cache;
+ bool always_null_cache;
bool not_original;
udf_handler(udf_func *udf_arg) :u_d(udf_arg), buffers(0), error(0),
is_null(0), initialized(0), not_original(0)
1
0
[Maria-developers] bzr commit into MariaDB 5.1, with Maria 1.5:maria branch (igor:2747)
by Igor Babaev 29 Jun '10
by Igor Babaev 29 Jun '10
29 Jun '10
#At lp:maria based on revid:monty@askmonty.org-20091014080956-d6xr2v3glk4v53sg
2747 Igor Babaev 2010-06-28
Partitioned key cache (mwl#85) for maria-5.1. Saved for possible
future request.
modified:
include/keycache.h
mysql-test/r/information_schema.result
mysql-test/r/information_schema_all_engines.result
mysql-test/r/key_cache.result
mysql-test/t/key_cache.test
mysys/mf_keycache.c
sql/handler.cc
sql/handler.h
sql/mysqld.cc
sql/set_var.cc
sql/set_var.h
sql/sql_show.cc
sql/sql_test.cc
sql/table.h
storage/myisam/mi_check.c
storage/myisam/mi_close.c
storage/myisam/mi_delete_all.c
storage/myisam/mi_extra.c
storage/myisam/mi_keycache.c
storage/myisam/mi_locking.c
storage/myisam/mi_page.c
storage/myisam/mi_panic.c
storage/myisam/mi_preload.c
storage/myisam/mi_test1.c
storage/myisam/mi_test2.c
storage/myisam/mi_test3.c
storage/myisam/myisam_ftdump.c
storage/myisam/myisamchk.c
storage/myisam/myisamdef.h
storage/myisam/myisamlog.c
=== modified file 'include/keycache.h'
--- a/include/keycache.h 2007-12-16 15:03:44 +0000
+++ b/include/keycache.h 2010-06-29 00:10:53 +0000
@@ -19,96 +19,121 @@
#define _keycache_h
C_MODE_START
-/* declare structures that is used by st_key_cache */
-struct st_block_link;
-typedef struct st_block_link BLOCK_LINK;
-struct st_keycache_page;
-typedef struct st_keycache_page KEYCACHE_PAGE;
-struct st_hash_link;
-typedef struct st_hash_link HASH_LINK;
-/* info about requests in a waiting queue */
-typedef struct st_keycache_wqueue
+/*
+ Currently the default key cache is created as non-partitioned at
+ the start of the server unless the server is started with the parameter
+ --key-cache-partitions that is greater than 0
+*/
+
+#define DEFAULT_KEY_CACHE_PARTITIONS 0
+
+/*
+ MAX_KEY_CACHE_PARTITIONS cannot be greater than
+ sizeof(MYISAM_SHARE::dirty_part_map)
+ Currently sizeof(MYISAM_SHARE::dirty_part_map)=sizeof(ulonglong)
+*/
+
+#define MAX_KEY_CACHE_PARTITIONS 64
+
+
+/* The structure to get statistical data about a key cache */
+
+typedef struct st_key_cache_statistics
+{
+ ulonglong mem_size; /* memory for cache buffers/auxiliary structures */
+ ulonglong block_size; /* size of the each buffers in the key cache */
+ ulonglong blocks_used; /* maximum number of used blocks/buffers */
+ ulonglong blocks_unused; /* number of currently unused blocks */
+ ulonglong blocks_changed; /* number of currently dirty blocks */
+ ulonglong read_requests; /* number of read requests (read hits) */
+ ulonglong reads; /* number of actual reads from files into buffers */
+ ulonglong write_requests; /* number of write requests (write hits) */
+ ulonglong writes; /* number of actual writes from buffers into files */
+} KEY_CACHE_STATISTICS;
+
+/* The type of a key cache object */
+typedef enum key_cache_type
{
- struct st_my_thread_var *last_thread; /* circular list of waiting threads */
-} KEYCACHE_WQUEUE;
+ SIMPLE_KEY_CACHE,
+ PARTITIONED_KEY_CACHE
+} KEY_CACHE_TYPE;
-#define CHANGED_BLOCKS_HASH 128 /* must be power of 2 */
/*
- The key cache structure
- It also contains read-only statistics parameters.
+ An object of the type KEY_CACHE_FUNCS contains pointers to all functions
+ from the key cache interface.
+ Currently a key cache can be of two types: simple and partitioned.
+ For each of them its own static structure of the type KEY_CACHE_FUNCS is
+ defined . The structures contain the pointers to the implementations of
+ the interface functions used by simple key caches and partitioned key
+ caches respectively. Pointers to these structures are assigned to key cache
+ objects at the time of their creation.
*/
+typedef struct st_key_cache_funcs
+{
+ int (*init) (void *, uint key_cache_block_size,
+ size_t use_mem, uint division_limit, uint age_threshold);
+ int (*resize) (void *, uint key_cache_block_size,
+ size_t use_mem, uint division_limit, uint age_threshold);
+ void (*change_param) (void *keycache_cb,
+ uint division_limit, uint age_threshold);
+ uchar* (*read) (void *keycache_cb,
+ File file, my_off_t filepos, int level,
+ uchar *buff, uint length,
+ uint block_length, int return_buffer);
+ int (*insert) (void *keycache_cb,
+ File file, my_off_t filepos, int level,
+ uchar *buff, uint length);
+ int (*write) (void *keycache_cb,
+ File file, void *file_extra,
+ my_off_t filepos, int level,
+ uchar *buff, uint length,
+ uint block_length, int force_write);
+ int (*flush) (void *keycache_cb,
+ int file, void *file_extra,
+ enum flush_type type);
+ int (*reset_counters) (const char *name, void *keycache_cb);
+ void (*end) (void *keycache_cb, my_bool cleanup);
+ void (*get_stats) (void *keycache_cb, uint partition_no,
+ KEY_CACHE_STATISTICS *key_cache_stats);
+ ulonglong (*get_stat_val) (void *keycache_cb, uint var_no);
+} KEY_CACHE_FUNCS;
+
+
typedef struct st_key_cache
{
- my_bool key_cache_inited;
- my_bool in_resize; /* true during resize operation */
- my_bool resize_in_flush; /* true during flush of resize operation */
+ KEY_CACHE_TYPE key_cache_type; /* type of the key cache used for debugging */
+ void *keycache_cb; /* control block of the used key cache */
+ KEY_CACHE_FUNCS *interface_funcs; /* interface functions of the key cache */
+ ulonglong param_buff_size; /* size the memory allocated for the cache */
+ ulong param_block_size; /* size of the blocks in the key cache */
+ ulong param_division_limit; /* min. percentage of warm blocks */
+ ulong param_age_threshold; /* determines when hot block is downgraded */
+ ulong param_partitions; /* number of the key cache partitions */
+ my_bool key_cache_inited; /* <=> key cache has been created */
my_bool can_be_used; /* usage of cache for read/write is allowed */
- size_t key_cache_mem_size; /* specified size of the cache memory */
- uint key_cache_block_size; /* size of the page buffer of a cache block */
- ulong min_warm_blocks; /* min number of warm blocks; */
- ulong age_threshold; /* age threshold for hot blocks */
- ulonglong keycache_time; /* total number of block link operations */
- uint hash_entries; /* max number of entries in the hash table */
- int hash_links; /* max number of hash links */
- int hash_links_used; /* number of hash links currently used */
- int disk_blocks; /* max number of blocks in the cache */
- ulong blocks_used; /* maximum number of concurrently used blocks */
- ulong blocks_unused; /* number of currently unused blocks */
- ulong blocks_changed; /* number of currently dirty blocks */
- ulong warm_blocks; /* number of blocks in warm sub-chain */
- ulong cnt_for_resize_op; /* counter to block resize operation */
- long blocks_available; /* number of blocks available in the LRU chain */
- HASH_LINK **hash_root; /* arr. of entries into hash table buckets */
- HASH_LINK *hash_link_root; /* memory for hash table links */
- HASH_LINK *free_hash_list; /* list of free hash links */
- BLOCK_LINK *free_block_list; /* list of free blocks */
- BLOCK_LINK *block_root; /* memory for block links */
- uchar HUGE_PTR *block_mem; /* memory for block buffers */
- BLOCK_LINK *used_last; /* ptr to the last block of the LRU chain */
- BLOCK_LINK *used_ins; /* ptr to the insertion block in LRU chain */
- pthread_mutex_t cache_lock; /* to lock access to the cache structure */
- KEYCACHE_WQUEUE resize_queue; /* threads waiting during resize operation */
- /*
- Waiting for a zero resize count. Using a queue for symmetry though
- only one thread can wait here.
- */
- KEYCACHE_WQUEUE waiting_for_resize_cnt;
- KEYCACHE_WQUEUE waiting_for_hash_link; /* waiting for a free hash link */
- KEYCACHE_WQUEUE waiting_for_block; /* requests waiting for a free block */
- BLOCK_LINK *changed_blocks[CHANGED_BLOCKS_HASH]; /* hash for dirty file bl.*/
- BLOCK_LINK *file_blocks[CHANGED_BLOCKS_HASH]; /* hash for other file bl.*/
-
- /*
- The following variables are and variables used to hold parameters for
- initializing the key cache.
- */
-
- ulonglong param_buff_size; /* size the memory allocated for the cache */
- ulong param_block_size; /* size of the blocks in the key cache */
- ulong param_division_limit; /* min. percentage of warm blocks */
- ulong param_age_threshold; /* determines when hot block is downgraded */
-
- /* Statistics variables. These are reset in reset_key_cache_counters(). */
- ulong global_blocks_changed; /* number of currently dirty blocks */
+ my_bool in_init; /* Set to 1 in MySQL during init/resize */
+ uint partitions; /* actual number of partitions */
+ size_t key_cache_mem_size; /* specified size of the cache memory */
+ ulong blocks_used; /* maximum number of concurrently used blocks */
+ ulong blocks_unused; /* number of currently unused blocks */
+ ulong global_blocks_changed; /* number of currently dirty blocks */
ulonglong global_cache_w_requests;/* number of write requests (write hits) */
ulonglong global_cache_write; /* number of writes from cache to files */
ulonglong global_cache_r_requests;/* number of read requests (read hits) */
ulonglong global_cache_read; /* number of reads from files to cache */
-
- int blocks; /* max number of blocks in the cache */
- my_bool in_init; /* Set to 1 in MySQL during init/resize */
} KEY_CACHE;
+
/* The default key cache */
extern KEY_CACHE dflt_key_cache_var, *dflt_key_cache;
extern int init_key_cache(KEY_CACHE *keycache, uint key_cache_block_size,
size_t use_mem, uint division_limit,
- uint age_threshold);
+ uint age_threshold, uint partitions);
extern int resize_key_cache(KEY_CACHE *keycache, uint key_cache_block_size,
size_t use_mem, uint division_limit,
uint age_threshold);
@@ -122,12 +147,18 @@ extern int key_cache_insert(KEY_CACHE *k
File file, my_off_t filepos, int level,
uchar *buff, uint length);
extern int key_cache_write(KEY_CACHE *keycache,
- File file, my_off_t filepos, int level,
+ File file, void *file_extra,
+ my_off_t filepos, int level,
uchar *buff, uint length,
- uint block_length,int force_write);
+ uint block_length, int force_write);
extern int flush_key_blocks(KEY_CACHE *keycache,
- int file, enum flush_type type);
+ int file, void *file_extra,
+ enum flush_type type);
extern void end_key_cache(KEY_CACHE *keycache, my_bool cleanup);
+extern void get_key_cache_statistics(KEY_CACHE *keycache,
+ uint partition_no,
+ KEY_CACHE_STATISTICS *key_cache_stats);
+extern ulonglong get_key_cache_stat_value(KEY_CACHE *keycache, uint var_no);
/* Functions to handle multiple key caches */
extern my_bool multi_keycache_init(void);
@@ -140,5 +171,11 @@ extern void multi_key_cache_change(KEY_C
KEY_CACHE *new_data);
extern int reset_key_cache_counters(const char *name,
KEY_CACHE *key_cache);
+extern int repartition_key_cache(KEY_CACHE *keycache,
+ uint key_cache_block_size,
+ size_t use_mem,
+ uint division_limit,
+ uint age_threshold,
+ uint partitions);
C_MODE_END
#endif /* _keycache_h */
=== modified file 'mysql-test/r/information_schema.result'
--- a/mysql-test/r/information_schema.result 2009-09-29 20:19:43 +0000
+++ b/mysql-test/r/information_schema.result 2010-06-29 00:10:53 +0000
@@ -67,6 +67,7 @@ INNODB_LOCK_WAITS
INNODB_RSEG
INNODB_TABLE_STATS
INNODB_TRX
+KEY_CACHES
KEY_COLUMN_USAGE
PARTITIONS
PLUGINS
=== modified file 'mysql-test/r/information_schema_all_engines.result'
--- a/mysql-test/r/information_schema_all_engines.result 2009-08-03 20:09:53 +0000
+++ b/mysql-test/r/information_schema_all_engines.result 2010-06-29 00:10:53 +0000
@@ -11,6 +11,7 @@ EVENTS
FILES
GLOBAL_STATUS
GLOBAL_VARIABLES
+KEY_CACHES
KEY_COLUMN_USAGE
PARTITIONS
PLUGINS
@@ -69,6 +70,7 @@ EVENTS EVENT_SCHEMA
FILES TABLE_SCHEMA
GLOBAL_STATUS VARIABLE_NAME
GLOBAL_VARIABLES VARIABLE_NAME
+KEY_CACHES KEY_CACHE_NAME
KEY_COLUMN_USAGE CONSTRAINT_SCHEMA
PARTITIONS TABLE_SCHEMA
PLUGINS PLUGIN_NAME
@@ -127,6 +129,7 @@ EVENTS EVENT_SCHEMA
FILES TABLE_SCHEMA
GLOBAL_STATUS VARIABLE_NAME
GLOBAL_VARIABLES VARIABLE_NAME
+KEY_CACHES KEY_CACHE_NAME
KEY_COLUMN_USAGE CONSTRAINT_SCHEMA
PARTITIONS TABLE_SCHEMA
PLUGINS PLUGIN_NAME
@@ -204,6 +207,7 @@ INNODB_LOCK_WAITS information_schema.INN
INNODB_RSEG information_schema.INNODB_RSEG 1
INNODB_TABLE_STATS information_schema.INNODB_TABLE_STATS 1
INNODB_TRX information_schema.INNODB_TRX 1
+KEY_CACHES information_schema.KEY_CACHES 1
KEY_COLUMN_USAGE information_schema.KEY_COLUMN_USAGE 1
PARTITIONS information_schema.PARTITIONS 1
PBXT_STATISTICS information_schema.PBXT_STATISTICS 1
@@ -238,6 +242,7 @@ Database: information_schema
| FILES |
| GLOBAL_STATUS |
| GLOBAL_VARIABLES |
+| KEY_CACHES |
| KEY_COLUMN_USAGE |
| PARTITIONS |
| PLUGINS |
@@ -286,6 +291,7 @@ Database: INFORMATION_SCHEMA
| FILES |
| GLOBAL_STATUS |
| GLOBAL_VARIABLES |
+| KEY_CACHES |
| KEY_COLUMN_USAGE |
| PARTITIONS |
| PLUGINS |
@@ -328,5 +334,5 @@ Wildcard: inf_rmation_schema
+--------------------+
SELECT table_schema, count(*) FROM information_schema.TABLES WHERE table_schema IN ('mysql', 'INFORMATION_SCHEMA', 'test', 'mysqltest') AND table_name<>'ndb_binlog_index' AND table_name<>'ndb_apply_status' GROUP BY TABLE_SCHEMA;
table_schema count(*)
-information_schema 43
+information_schema 44
mysql 22
=== modified file 'mysql-test/r/key_cache.result'
--- a/mysql-test/r/key_cache.result 2009-03-16 19:54:50 +0000
+++ b/mysql-test/r/key_cache.result 2010-06-29 00:10:53 +0000
@@ -1,5 +1,7 @@
drop table if exists t1, t2, t3;
-SET @save_key_buffer=@@key_buffer_size;
+SET @save_key_buffer_size=@@key_buffer_size;
+SET @save_key_cache_block_size=@@key_cache_block_size;
+SET @save_key_cache_partitions=@@key_cache_partitions;
SELECT @@key_buffer_size, @@small.key_buffer_size;
@@key_buffer_size @@small.key_buffer_size
2097152 131072
@@ -37,7 +39,7 @@ SELECT @@small.key_buffer_size;
SELECT @@medium.key_buffer_size;
@@medium.key_buffer_size
0
-SET @@global.key_buffer_size=@save_key_buffer;
+SET @@global.key_buffer_size=@save_key_buffer_size;
SELECT @@default.key_buffer_size;
ERROR 42000: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'default.key_buffer_size' at line 1
SELECT @@skr.storage_engine="test";
@@ -366,3 +368,537 @@ Variable_name Value
key_cache_block_size 1536
SET GLOBAL key_cache_block_size= @bug28478_key_cache_block_size;
DROP TABLE t1;
+set global key_buffer_size=@save_key_buffer_size;
+set global key_cache_block_size=@save_key_cache_block_size;
+select @@key_buffer_size;
+@@key_buffer_size
+2097152
+select @@key_cache_block_size;
+@@key_cache_block_size
+1024
+select @@key_cache_partitions;
+@@key_cache_partitions
+0
+create table t1 (
+p int not null auto_increment primary key,
+a char(10));
+create table t2 (
+p int not null auto_increment primary key,
+i int, a char(10), key k1(i), key k2(a));
+select @@key_cache_partitions;
+@@key_cache_partitions
+0
+select * from information_schema.key_caches;
+KEY_CACHE_NAME PARTITIONS PARTITION_NUMBER FULL_SIZE BLOCK_SIZE USED_BLOCKS UNUSED_BLOCKS DIRTY_BLOCKS READ_REQUESTS READS WRITE_REQUESTS WRITES
+default NULL NULL 2097152 1024 0 # 0 0 0 0 0
+small NULL NULL 1048576 1024 1 # 0 1 0 2 1
+insert into t1 values (1, 'qqqq'), (2, 'yyyy');
+insert into t2 values (1, 1, 'qqqq'), (2, 1, 'pppp'),
+(3, 1, 'yyyy'), (4, 3, 'zzzz');
+select * from t1;
+p a
+1 qqqq
+2 yyyy
+select * from t2;
+p i a
+1 1 qqqq
+2 1 pppp
+3 1 yyyy
+4 3 zzzz
+update t1 set p=3 where p=1;
+update t2 set i=2 where i=1;
+show status like 'key_%';
+Variable_name Value
+Key_blocks_not_flushed 0
+Key_blocks_unused KEY_BLOCKS_UNUSED
+Key_blocks_used 4
+Key_read_requests 22
+Key_reads 0
+Key_write_requests 26
+Key_writes 6
+select * from information_schema.key_caches;
+KEY_CACHE_NAME PARTITIONS PARTITION_NUMBER FULL_SIZE BLOCK_SIZE USED_BLOCKS UNUSED_BLOCKS DIRTY_BLOCKS READ_REQUESTS READS WRITE_REQUESTS WRITES
+default NULL NULL 2097152 1024 4 # 0 22 0 26 6
+small NULL NULL 1048576 1024 1 # 0 1 0 2 1
+delete from t2 where a='zzzz';
+select * from information_schema.key_caches;
+KEY_CACHE_NAME PARTITIONS PARTITION_NUMBER FULL_SIZE BLOCK_SIZE USED_BLOCKS UNUSED_BLOCKS DIRTY_BLOCKS READ_REQUESTS READS WRITE_REQUESTS WRITES
+default NULL NULL 2097152 1024 4 # 0 29 0 32 9
+small NULL NULL 1048576 1024 1 # 0 1 0 2 1
+delete from t1;
+delete from t2;
+select * from information_schema.key_caches;
+KEY_CACHE_NAME PARTITIONS PARTITION_NUMBER FULL_SIZE BLOCK_SIZE USED_BLOCKS UNUSED_BLOCKS DIRTY_BLOCKS READ_REQUESTS READS WRITE_REQUESTS WRITES
+default NULL NULL 2097152 1024 4 # 0 29 0 32 9
+small NULL NULL 1048576 1024 1 # 0 1 0 2 1
+set global key_cache_partitions=2;
+select @@key_cache_partitions;
+@@key_cache_partitions
+2
+select * from information_schema.key_caches;
+KEY_CACHE_NAME PARTITIONS PARTITION_NUMBER FULL_SIZE BLOCK_SIZE USED_BLOCKS UNUSED_BLOCKS DIRTY_BLOCKS READ_REQUESTS READS WRITE_REQUESTS WRITES
+default 2 1 1048576 1024 0 # 0 0 0 0 0
+default 2 2 1048576 1024 0 # 0 0 0 0 0
+default 2 NULL 2097152 1024 0 # 0 0 0 0 0
+small NULL NULL 1048576 1024 1 # 0 1 0 2 1
+insert into t1 values (1, 'qqqq'), (2, 'yyyy');
+insert into t2 values (1, 1, 'qqqq'), (2, 1, 'pppp'),
+(3, 1, 'yyyy'), (4, 3, 'zzzz');
+select * from t1;
+p a
+1 qqqq
+2 yyyy
+select * from t2;
+p i a
+1 1 qqqq
+2 1 pppp
+3 1 yyyy
+4 3 zzzz
+update t1 set p=3 where p=1;
+update t2 set i=2 where i=1;
+show status like 'key_%';
+Variable_name Value
+Key_blocks_not_flushed 0
+Key_blocks_unused KEY_BLOCKS_UNUSED
+Key_blocks_used 4
+Key_read_requests 22
+Key_reads 0
+Key_write_requests 26
+Key_writes 6
+select * from information_schema.key_caches;
+KEY_CACHE_NAME PARTITIONS PARTITION_NUMBER FULL_SIZE BLOCK_SIZE USED_BLOCKS UNUSED_BLOCKS DIRTY_BLOCKS READ_REQUESTS READS WRITE_REQUESTS WRITES
+default 2 1 1048576 1024 3 # 0 10 0 13 4
+default 2 2 1048576 1024 1 # 0 12 0 13 2
+default 2 NULL 2097152 1024 4 # 0 22 0 26 6
+small NULL NULL 1048576 1024 1 # 0 1 0 2 1
+delete from t1;
+delete from t2;
+select * from information_schema.key_caches;
+KEY_CACHE_NAME PARTITIONS PARTITION_NUMBER FULL_SIZE BLOCK_SIZE USED_BLOCKS UNUSED_BLOCKS DIRTY_BLOCKS READ_REQUESTS READS WRITE_REQUESTS WRITES
+default 2 1 1048576 1024 3 # 0 10 0 13 4
+default 2 2 1048576 1024 1 # 0 12 0 13 2
+default 2 NULL 2097152 1024 4 # 0 22 0 26 6
+small NULL NULL 1048576 1024 1 # 0 1 0 2 1
+set global key_cache_partitions=1;
+select @@key_cache_partitions;
+@@key_cache_partitions
+1
+select * from information_schema.key_caches;
+KEY_CACHE_NAME PARTITIONS PARTITION_NUMBER FULL_SIZE BLOCK_SIZE USED_BLOCKS UNUSED_BLOCKS DIRTY_BLOCKS READ_REQUESTS READS WRITE_REQUESTS WRITES
+default 1 1 2097152 1024 0 # 0 0 0 0 0
+default 1 NULL 2097152 1024 0 # 0 0 0 0 0
+small NULL NULL 1048576 1024 1 # 0 1 0 2 1
+insert into t1 values (1, 'qqqq'), (2, 'yyyy');
+insert into t2 values (1, 1, 'qqqq'), (2, 1, 'pppp'),
+(3, 1, 'yyyy'), (4, 3, 'zzzz');
+select * from t1;
+p a
+1 qqqq
+2 yyyy
+select * from t2;
+p i a
+1 1 qqqq
+2 1 pppp
+3 1 yyyy
+4 3 zzzz
+update t1 set p=3 where p=1;
+update t2 set i=2 where i=1;
+show status like 'key_%';
+Variable_name Value
+Key_blocks_not_flushed 0
+Key_blocks_unused KEY_BLOCKS_UNUSED
+Key_blocks_used 4
+Key_read_requests 22
+Key_reads 0
+Key_write_requests 26
+Key_writes 6
+select * from information_schema.key_caches;
+KEY_CACHE_NAME PARTITIONS PARTITION_NUMBER FULL_SIZE BLOCK_SIZE USED_BLOCKS UNUSED_BLOCKS DIRTY_BLOCKS READ_REQUESTS READS WRITE_REQUESTS WRITES
+default 1 1 2097152 1024 4 # 0 22 0 26 6
+default 1 NULL 2097152 1024 4 # 0 22 0 26 6
+small NULL NULL 1048576 1024 1 # 0 1 0 2 1
+delete from t1;
+delete from t2;
+select * from information_schema.key_caches;
+KEY_CACHE_NAME PARTITIONS PARTITION_NUMBER FULL_SIZE BLOCK_SIZE USED_BLOCKS UNUSED_BLOCKS DIRTY_BLOCKS READ_REQUESTS READS WRITE_REQUESTS WRITES
+default 1 1 2097152 1024 4 # 0 22 0 26 6
+default 1 NULL 2097152 1024 4 # 0 22 0 26 6
+small NULL NULL 1048576 1024 1 # 0 1 0 2 1
+flush tables;
+flush status;
+select * from information_schema.key_caches;
+KEY_CACHE_NAME PARTITIONS PARTITION_NUMBER FULL_SIZE BLOCK_SIZE USED_BLOCKS UNUSED_BLOCKS DIRTY_BLOCKS READ_REQUESTS READS WRITE_REQUESTS WRITES
+default 1 1 2097152 1024 4 # 0 0 0 0 0
+default 1 NULL 2097152 1024 4 # 0 0 0 0 0
+small NULL NULL 1048576 1024 1 # 0 0 0 0 0
+set global key_buffer_size=32*1024;
+select @@key_buffer_size;
+@@key_buffer_size
+32768
+set global key_cache_partitions=2;
+select @@key_cache_partitions;
+@@key_cache_partitions
+2
+select * from information_schema.key_caches;
+KEY_CACHE_NAME PARTITIONS PARTITION_NUMBER FULL_SIZE BLOCK_SIZE USED_BLOCKS UNUSED_BLOCKS DIRTY_BLOCKS READ_REQUESTS READS WRITE_REQUESTS WRITES
+default 2 1 16384 1024 0 # 0 0 0 0 0
+default 2 2 16384 1024 0 # 0 0 0 0 0
+default 2 NULL 32768 1024 0 # 0 0 0 0 0
+small NULL NULL 1048576 1024 1 # 0 0 0 0 0
+insert into t1 values (1, 'qqqq'), (2, 'yyyy');
+insert into t2 values (1, 1, 'qqqq'), (2, 1, 'pppp'),
+(3, 1, 'yyyy'), (4, 3, 'zzzz');
+select * from t1;
+p a
+1 qqqq
+2 yyyy
+select * from t2;
+p i a
+1 1 qqqq
+2 1 pppp
+3 1 yyyy
+4 3 zzzz
+update t1 set p=3 where p=1;
+update t2 set i=2 where i=1;
+select * from information_schema.key_caches;
+KEY_CACHE_NAME PARTITIONS PARTITION_NUMBER FULL_SIZE BLOCK_SIZE USED_BLOCKS UNUSED_BLOCKS DIRTY_BLOCKS READ_REQUESTS READS WRITE_REQUESTS WRITES
+default 2 1 16384 1024 1 # 0 12 0 13 2
+default 2 2 16384 1024 3 # 0 10 0 13 4
+default 2 NULL 32768 1024 4 # 0 22 0 26 6
+small NULL NULL 1048576 1024 1 # 0 0 0 0 0
+insert into t1(a) select a from t1;
+insert into t1(a) select a from t1;
+insert into t1(a) select a from t1;
+insert into t1(a) select a from t1;
+insert into t1(a) select a from t1;
+insert into t1(a) select a from t1;
+insert into t1(a) select a from t1;
+insert into t1(a) select a from t1;
+insert into t2(i,a) select i,a from t2;
+insert into t2(i,a) select i,a from t2;
+insert into t2(i,a) select i,a from t2;
+insert into t2(i,a) select i,a from t2;
+insert into t2(i,a) select i,a from t2;
+insert into t2(i,a) select i,a from t2;
+insert into t2(i,a) select i,a from t2;
+insert into t2(i,a) select i,a from t2;
+select * from information_schema.key_caches;
+KEY_CACHE_NAME PARTITIONS PARTITION_NUMBER FULL_SIZE BLOCK_SIZE USED_BLOCKS UNUSED_BLOCKS DIRTY_BLOCKS READ_REQUESTS READS WRITE_REQUESTS WRITES
+default 2 1 16384 1024 # # 0 1951 # 1976 43
+default 2 2 16384 1024 # # 0 4782 # 1708 60
+default 2 NULL 32768 1024 # # 0 6733 # 3684 103
+small NULL NULL 1048576 1024 # # 0 0 # 0 0
+select * from t1 where p between 1010 and 1020 ;
+p a
+select * from t2 where p between 1010 and 1020 ;
+p i a
+1010 2 pppp
+1011 2 yyyy
+1012 3 zzzz
+1013 2 qqqq
+1014 2 pppp
+1015 2 yyyy
+1016 3 zzzz
+1017 2 qqqq
+1018 2 pppp
+1019 2 yyyy
+1020 3 zzzz
+select * from information_schema.key_caches;
+KEY_CACHE_NAME PARTITIONS PARTITION_NUMBER FULL_SIZE BLOCK_SIZE USED_BLOCKS UNUSED_BLOCKS DIRTY_BLOCKS READ_REQUESTS READS WRITE_REQUESTS WRITES
+default 2 1 16384 1024 # # 0 1954 # 1976 43
+default 2 2 16384 1024 # # 0 4796 # 1708 60
+default 2 NULL 32768 1024 # # 0 6750 # 3684 103
+small NULL NULL 1048576 1024 # # 0 0 # 0 0
+flush tables;
+flush status;
+update t1 set a='zzzz' where a='qqqq';
+update t2 set i=1 where i=2;
+select * from information_schema.key_caches;
+KEY_CACHE_NAME PARTITIONS PARTITION_NUMBER FULL_SIZE BLOCK_SIZE USED_BLOCKS UNUSED_BLOCKS DIRTY_BLOCKS READ_REQUESTS READS WRITE_REQUESTS WRITES
+default 2 1 16384 1024 # # 0 940 10 939 10
+default 2 2 16384 1024 # # 0 2136 8 613 8
+default 2 NULL 32768 1024 # # 0 3076 18 1552 18
+small NULL NULL 1048576 1024 # # 0 0 0 0 0
+set global keycache1.key_buffer_size=256*1024;
+select @@keycache1.key_buffer_size;
+@@keycache1.key_buffer_size
+262144
+set global keycache1.key_cache_partitions=7;
+select @@keycache1.key_cache_partitions;
+@@keycache1.key_cache_partitions
+7
+select * from information_schema.key_caches;
+KEY_CACHE_NAME PARTITIONS PARTITION_NUMBER FULL_SIZE BLOCK_SIZE USED_BLOCKS UNUSED_BLOCKS DIRTY_BLOCKS READ_REQUESTS READS WRITE_REQUESTS WRITES
+default 2 1 16384 1024 # # 0 940 10 939 10
+default 2 2 16384 1024 # # 0 2136 8 613 8
+default 2 NULL 32768 1024 # # 0 3076 18 1552 18
+small NULL NULL 1048576 1024 # # 0 0 0 0 0
+keycache1 7 1 37449 2048 # # 0 0 0 0 0
+keycache1 7 2 37449 2048 # # 0 0 0 0 0
+keycache1 7 3 37449 2048 # # 0 0 0 0 0
+keycache1 7 4 37449 2048 # # 0 0 0 0 0
+keycache1 7 5 37449 2048 # # 0 0 0 0 0
+keycache1 7 6 37449 2048 # # 0 0 0 0 0
+keycache1 7 7 37449 2048 # # 0 0 0 0 0
+keycache1 7 NULL 262143 2048 # # 0 0 0 0 0
+select * from information_schema.key_caches where key_cache_name like "key%";
+KEY_CACHE_NAME PARTITIONS PARTITION_NUMBER FULL_SIZE BLOCK_SIZE USED_BLOCKS UNUSED_BLOCKS DIRTY_BLOCKS READ_REQUESTS READS WRITE_REQUESTS WRITES
+keycache1 7 1 37449 2048 0 # 0 0 0 0 0
+keycache1 7 2 37449 2048 0 # 0 0 0 0 0
+keycache1 7 3 37449 2048 0 # 0 0 0 0 0
+keycache1 7 4 37449 2048 0 # 0 0 0 0 0
+keycache1 7 5 37449 2048 0 # 0 0 0 0 0
+keycache1 7 6 37449 2048 0 # 0 0 0 0 0
+keycache1 7 7 37449 2048 0 # 0 0 0 0 0
+keycache1 7 NULL 262143 2048 0 # 0 0 0 0 0
+cache index t1 key (`primary`) in keycache1;
+Table Op Msg_type Msg_text
+test.t1 assign_to_keycache status OK
+explain select p from t1 where p between 1010 and 1020;
+id select_type table type possible_keys key key_len ref rows Extra
+1 SIMPLE t1 range PRIMARY PRIMARY 4 NULL 1 Using where; Using index
+select p from t1 where p between 1010 and 1020;
+p
+explain select i from t2 where p between 1010 and 1020;
+id select_type table type possible_keys key key_len ref rows Extra
+1 SIMPLE t2 range PRIMARY PRIMARY 4 NULL 28 Using where
+select i from t2 where p between 1010 and 1020;
+i
+1
+1
+3
+1
+1
+1
+3
+1
+1
+1
+3
+explain select count(*) from t1, t2 where t1.p = t2.i;
+id select_type table type possible_keys key key_len ref rows Extra
+1 SIMPLE t2 index k1 k1 5 NULL 1024 Using index
+1 SIMPLE t1 eq_ref PRIMARY PRIMARY 4 test.t2.i 1 Using index
+select count(*) from t1, t2 where t1.p = t2.i;
+count(*)
+256
+select * from information_schema.key_caches;
+KEY_CACHE_NAME PARTITIONS PARTITION_NUMBER FULL_SIZE BLOCK_SIZE USED_BLOCKS UNUSED_BLOCKS DIRTY_BLOCKS READ_REQUESTS READS WRITE_REQUESTS WRITES
+default 2 1 16384 1024 # # 0 966 12 939 10
+default 2 2 16384 1024 # # 0 2206 12 613 8
+default 2 NULL 32768 1024 # # 0 3172 24 1552 18
+small NULL NULL 1048576 1024 # # 0 0 0 0 0
+keycache1 7 1 37449 2048 # # 0 2 1 0 0
+keycache1 7 2 37449 2048 # # 0 7 1 0 0
+keycache1 7 3 37449 2048 # # 0 0 0 0 0
+keycache1 7 4 37449 2048 # # 0 5 1 0 0
+keycache1 7 5 37449 2048 # # 0 0 0 0 0
+keycache1 7 6 37449 2048 # # 0 0 0 0 0
+keycache1 7 7 37449 2048 # # 0 0 0 0 0
+keycache1 7 NULL 262143 2048 # # 0 14 3 0 0
+select * from information_schema.key_caches where key_cache_name like "key%";
+KEY_CACHE_NAME PARTITIONS PARTITION_NUMBER FULL_SIZE BLOCK_SIZE USED_BLOCKS UNUSED_BLOCKS DIRTY_BLOCKS READ_REQUESTS READS WRITE_REQUESTS WRITES
+keycache1 7 1 37449 2048 1 # 0 2 1 0 0
+keycache1 7 2 37449 2048 1 # 0 7 1 0 0
+keycache1 7 3 37449 2048 0 # 0 0 0 0 0
+keycache1 7 4 37449 2048 1 # 0 5 1 0 0
+keycache1 7 5 37449 2048 0 # 0 0 0 0 0
+keycache1 7 6 37449 2048 0 # 0 0 0 0 0
+keycache1 7 7 37449 2048 0 # 0 0 0 0 0
+keycache1 7 NULL 262143 2048 3 # 0 14 3 0 0
+cache index t2 in keycache1;
+Table Op Msg_type Msg_text
+test.t2 assign_to_keycache status OK
+update t2 set p=p+3000, i=2 where a='qqqq';
+select * from information_schema.key_caches where key_cache_name like "key%";
+KEY_CACHE_NAME PARTITIONS PARTITION_NUMBER FULL_SIZE BLOCK_SIZE USED_BLOCKS UNUSED_BLOCKS DIRTY_BLOCKS READ_REQUESTS READS WRITE_REQUESTS WRITES
+keycache1 7 1 37449 2048 3 # 0 44 3 43 2
+keycache1 7 2 37449 2048 4 # 0 61 4 51 1
+keycache1 7 3 37449 2048 4 # 0 177 4 176 3
+keycache1 7 4 37449 2048 4 # 0 122 4 119 3
+keycache1 7 5 37449 2048 4 # 0 840 4 335 4
+keycache1 7 6 37449 2048 3 # 0 627 3 133 3
+keycache1 7 7 37449 2048 3 # 0 211 3 214 3
+keycache1 7 NULL 262143 2048 25 # 0 2082 25 1071 19
+set global keycache2.key_buffer_size=1024*1024;
+cache index t2 in keycache2;
+Table Op Msg_type Msg_text
+test.t2 assign_to_keycache status OK
+insert into t2 values (2000, 3, 'yyyy');
+select * from information_schema.key_caches where key_cache_name like "keycache2";
+KEY_CACHE_NAME PARTITIONS PARTITION_NUMBER FULL_SIZE BLOCK_SIZE USED_BLOCKS UNUSED_BLOCKS DIRTY_BLOCKS READ_REQUESTS READS WRITE_REQUESTS WRITES
+keycache2 NULL NULL 1048576 1024 0 # 0 0 0 0 0
+select * from information_schema.key_caches where key_cache_name like "key%";
+KEY_CACHE_NAME PARTITIONS PARTITION_NUMBER FULL_SIZE BLOCK_SIZE USED_BLOCKS UNUSED_BLOCKS DIRTY_BLOCKS READ_REQUESTS READS WRITE_REQUESTS WRITES
+keycache1 7 1 37449 2048 3 # 0 44 3 43 2
+keycache1 7 2 37449 2048 4 # 0 61 4 51 1
+keycache1 7 3 37449 2048 4 # 0 177 4 176 3
+keycache1 7 4 37449 2048 4 # 0 122 4 119 3
+keycache1 7 5 37449 2048 4 # 0 840 4 335 4
+keycache1 7 6 37449 2048 3 # 0 627 3 133 3
+keycache1 7 7 37449 2048 3 # 0 211 3 214 3
+keycache1 7 NULL 262143 2048 25 # 0 2082 25 1071 19
+keycache2 NULL NULL 1048576 1024 0 # 0 0 0 0 0
+cache index t2 in keycache1;
+Table Op Msg_type Msg_text
+test.t2 assign_to_keycache status OK
+update t2 set p=p+5000 where a='zzzz';
+select * from t2 where p between 1010 and 1020;
+p i a
+1010 1 pppp
+1011 1 yyyy
+1014 1 pppp
+1015 1 yyyy
+1018 1 pppp
+1019 1 yyyy
+explain select p from t2 where p between 1010 and 1020;
+id select_type table type possible_keys key key_len ref rows Extra
+1 SIMPLE t2 range PRIMARY PRIMARY 4 NULL 7 Using where; Using index
+select p from t2 where p between 1010 and 1020;
+p
+1010
+1011
+1014
+1015
+1018
+1019
+explain select i from t2 where a='yyyy' and i=3;
+id select_type table type possible_keys key key_len ref rows Extra
+1 SIMPLE t2 ref k1,k2 k1 5 const 188 Using where
+select i from t2 where a='yyyy' and i=3;
+i
+3
+explain select a from t2 where a='yyyy' and i=3;
+id select_type table type possible_keys key key_len ref rows Extra
+1 SIMPLE t2 ref k1,k2 k1 5 const 188 Using where
+select a from t2 where a='yyyy' and i=3 ;
+a
+yyyy
+select * from information_schema.key_caches;
+KEY_CACHE_NAME PARTITIONS PARTITION_NUMBER FULL_SIZE BLOCK_SIZE USED_BLOCKS UNUSED_BLOCKS DIRTY_BLOCKS READ_REQUESTS READS WRITE_REQUESTS WRITES
+default 2 1 16384 1024 # # 0 966 12 939 10
+default 2 2 16384 1024 # # 0 2206 12 613 8
+default 2 NULL 32768 1024 # # 0 3172 24 1552 18
+small NULL NULL 1048576 1024 # # 0 0 0 0 0
+keycache1 7 1 37449 2048 # # 0 85 6 68 3
+keycache1 7 2 37449 2048 # # 0 122 6 102 2
+keycache1 7 3 37449 2048 # # 0 271 8 254 6
+keycache1 7 4 37449 2048 # # 0 179 6 170 4
+keycache1 7 5 37449 2048 # # 0 1445 7 416 6
+keycache1 7 6 37449 2048 # # 0 863 6 345 5
+keycache1 7 7 37449 2048 # # 0 236 4 239 4
+keycache1 7 NULL 262143 2048 # # 0 3201 43 1594 30
+keycache2 NULL NULL 1048576 1024 # # 0 0 0 0 0
+set global keycache1.key_cache_block_size=2*1024;
+insert into t2 values (7000, 3, 'yyyy');
+select * from information_schema.key_caches;
+KEY_CACHE_NAME PARTITIONS PARTITION_NUMBER FULL_SIZE BLOCK_SIZE USED_BLOCKS UNUSED_BLOCKS DIRTY_BLOCKS READ_REQUESTS READS WRITE_REQUESTS WRITES
+default 2 1 16384 1024 # # 0 966 12 939 10
+default 2 2 16384 1024 # # 0 2206 12 613 8
+default 2 NULL 32768 1024 # # 0 3172 24 1552 18
+small NULL NULL 1048576 1024 # # 0 0 0 0 0
+keycache1 7 1 37449 2048 # # 0 1 1 1 1
+keycache1 7 2 37449 2048 # # 0 1 1 0 0
+keycache1 7 3 37449 2048 # # 0 0 0 0 0
+keycache1 7 4 37449 2048 # # 0 1 1 1 1
+keycache1 7 5 37449 2048 # # 0 1 1 0 0
+keycache1 7 6 37449 2048 # # 0 2 2 1 1
+keycache1 7 7 37449 2048 # # 0 0 0 0 0
+keycache1 7 NULL 262143 2048 # # 0 6 6 3 3
+keycache2 NULL NULL 1048576 1024 # # 0 0 0 0 0
+set global keycache1.key_cache_block_size=8*1024;
+insert into t2 values (8000, 3, 'yyyy');
+select * from information_schema.key_caches;
+KEY_CACHE_NAME PARTITIONS PARTITION_NUMBER FULL_SIZE BLOCK_SIZE USED_BLOCKS UNUSED_BLOCKS DIRTY_BLOCKS READ_REQUESTS READS WRITE_REQUESTS WRITES
+default 2 1 16384 1024 # # 0 966 12 939 10
+default 2 2 16384 1024 # # 0 2206 12 613 8
+default 2 NULL 32768 1024 # # 0 3172 24 1552 18
+small NULL NULL 1048576 1024 # # 0 0 0 0 0
+keycache1 3 1 87381 8192 # # 0 1 1 1 1
+keycache1 3 2 87381 8192 # # 0 3 2 1 1
+keycache1 3 3 87381 8192 # # 0 2 2 1 1
+keycache1 3 NULL 262143 8192 # # 0 6 5 3 3
+keycache2 NULL NULL 1048576 1024 # # 0 0 0 0 0
+set global keycache1.key_buffer_size=64*1024;
+select * from information_schema.key_caches;
+KEY_CACHE_NAME PARTITIONS PARTITION_NUMBER FULL_SIZE BLOCK_SIZE USED_BLOCKS UNUSED_BLOCKS DIRTY_BLOCKS READ_REQUESTS READS WRITE_REQUESTS WRITES
+default 2 1 16384 1024 # # 0 966 12 939 10
+default 2 2 16384 1024 # # 0 2206 12 613 8
+default 2 NULL 32768 1024 # # 0 3172 24 1552 18
+small NULL NULL 1048576 1024 # # 0 0 0 0 0
+keycache2 NULL NULL 1048576 1024 # # 0 0 0 0 0
+set global keycache1.key_cache_block_size=2*1024;
+select * from information_schema.key_caches;
+KEY_CACHE_NAME PARTITIONS PARTITION_NUMBER FULL_SIZE BLOCK_SIZE USED_BLOCKS UNUSED_BLOCKS DIRTY_BLOCKS READ_REQUESTS READS WRITE_REQUESTS WRITES
+default 2 1 16384 1024 # # 0 966 12 939 10
+default 2 2 16384 1024 # # 0 2206 12 613 8
+default 2 NULL 32768 1024 # # 0 3172 24 1552 18
+small NULL NULL 1048576 1024 # # 0 0 0 0 0
+keycache1 3 1 21845 2048 # # 0 0 0 0 0
+keycache1 3 2 21845 2048 # # 0 0 0 0 0
+keycache1 3 3 21845 2048 # # 0 0 0 0 0
+keycache1 3 NULL 65535 2048 # # 0 0 0 0 0
+keycache2 NULL NULL 1048576 1024 # # 0 0 0 0 0
+set global keycache1.key_cache_block_size=8*1024;
+select * from information_schema.key_caches;
+KEY_CACHE_NAME PARTITIONS PARTITION_NUMBER FULL_SIZE BLOCK_SIZE USED_BLOCKS UNUSED_BLOCKS DIRTY_BLOCKS READ_REQUESTS READS WRITE_REQUESTS WRITES
+default 2 1 16384 1024 # # 0 966 12 939 10
+default 2 2 16384 1024 # # 0 2206 12 613 8
+default 2 NULL 32768 1024 # # 0 3172 24 1552 18
+small NULL NULL 1048576 1024 # # 0 0 0 0 0
+keycache2 NULL NULL 1048576 1024 # # 0 0 0 0 0
+set global keycache1.key_buffer_size=0;
+select * from information_schema.key_caches;
+KEY_CACHE_NAME PARTITIONS PARTITION_NUMBER FULL_SIZE BLOCK_SIZE USED_BLOCKS UNUSED_BLOCKS DIRTY_BLOCKS READ_REQUESTS READS WRITE_REQUESTS WRITES
+default 2 1 16384 1024 # # 0 966 12 939 10
+default 2 2 16384 1024 # # 0 2206 12 613 8
+default 2 NULL 32768 1024 # # 0 3172 24 1552 18
+small NULL NULL 1048576 1024 # # 0 0 0 0 0
+keycache2 NULL NULL 1048576 1024 # # 0 0 0 0 0
+set global keycache1.key_cache_block_size=8*1024;
+select * from information_schema.key_caches;
+KEY_CACHE_NAME PARTITIONS PARTITION_NUMBER FULL_SIZE BLOCK_SIZE USED_BLOCKS UNUSED_BLOCKS DIRTY_BLOCKS READ_REQUESTS READS WRITE_REQUESTS WRITES
+default 2 1 16384 1024 # # 0 966 12 939 10
+default 2 2 16384 1024 # # 0 2206 12 613 8
+default 2 NULL 32768 1024 # # 0 3172 24 1552 18
+small NULL NULL 1048576 1024 # # 0 0 0 0 0
+keycache2 NULL NULL 1048576 1024 # # 0 0 0 0 0
+set global keycache1.key_buffer_size=0;
+select * from information_schema.key_caches;
+KEY_CACHE_NAME PARTITIONS PARTITION_NUMBER FULL_SIZE BLOCK_SIZE USED_BLOCKS UNUSED_BLOCKS DIRTY_BLOCKS READ_REQUESTS READS WRITE_REQUESTS WRITES
+default 2 1 16384 1024 # # 0 966 12 939 10
+default 2 2 16384 1024 # # 0 2206 12 613 8
+default 2 NULL 32768 1024 # # 0 3172 24 1552 18
+small NULL NULL 1048576 1024 # # 0 0 0 0 0
+keycache2 NULL NULL 1048576 1024 # # 0 0 0 0 0
+set global keycache1.key_buffer_size=128*1024;
+select * from information_schema.key_caches;
+KEY_CACHE_NAME PARTITIONS PARTITION_NUMBER FULL_SIZE BLOCK_SIZE USED_BLOCKS UNUSED_BLOCKS DIRTY_BLOCKS READ_REQUESTS READS WRITE_REQUESTS WRITES
+default 2 1 16384 1024 # # 0 966 12 939 10
+default 2 2 16384 1024 # # 0 2206 12 613 8
+default 2 NULL 32768 1024 # # 0 3172 24 1552 18
+small NULL NULL 1048576 1024 # # 0 0 0 0 0
+keycache1 1 1 131072 8192 # # 0 0 0 0 0
+keycache1 1 NULL 131072 8192 # # 0 0 0 0 0
+keycache2 NULL NULL 1048576 1024 # # 0 0 0 0 0
+set global keycache1.key_cache_block_size=1024;
+select * from information_schema.key_caches;
+KEY_CACHE_NAME PARTITIONS PARTITION_NUMBER FULL_SIZE BLOCK_SIZE USED_BLOCKS UNUSED_BLOCKS DIRTY_BLOCKS READ_REQUESTS READS WRITE_REQUESTS WRITES
+default 2 1 16384 1024 # # 0 966 12 939 10
+default 2 2 16384 1024 # # 0 2206 12 613 8
+default 2 NULL 32768 1024 # # 0 3172 24 1552 18
+small NULL NULL 1048576 1024 # # 0 0 0 0 0
+keycache1 7 1 18724 1024 # # 0 0 0 0 0
+keycache1 7 2 18724 1024 # # 0 0 0 0 0
+keycache1 7 3 18724 1024 # # 0 0 0 0 0
+keycache1 7 4 18724 1024 # # 0 0 0 0 0
+keycache1 7 5 18724 1024 # # 0 0 0 0 0
+keycache1 7 6 18724 1024 # # 0 0 0 0 0
+keycache1 7 7 18724 1024 # # 0 0 0 0 0
+keycache1 7 NULL 131068 1024 # # 0 0 0 0 0
+keycache2 NULL NULL 1048576 1024 # # 0 0 0 0 0
+drop table t1,t2;
+set global keycache1.key_buffer_size=0;
+set global keycache2.key_buffer_size=0;
+set global key_buffer_size=@save_key_buffer_size;
+set global key_cache_partitions=@save_key_cache_partitions;
=== modified file 'mysql-test/t/key_cache.test'
--- a/mysql-test/t/key_cache.test 2008-03-27 16:43:17 +0000
+++ b/mysql-test/t/key_cache.test 2010-06-29 00:10:53 +0000
@@ -1,11 +1,13 @@
#
-# Test of multiple key caches
+# Test of multiple key caches, simple an partitioned
#
--disable_warnings
drop table if exists t1, t2, t3;
--enable_warnings
-SET @save_key_buffer=@@key_buffer_size;
+SET @save_key_buffer_size=@@key_buffer_size;
+SET @save_key_cache_block_size=@@key_cache_block_size;
+SET @save_key_cache_partitions=@@key_cache_partitions;
SELECT @@key_buffer_size, @@small.key_buffer_size;
@@ -33,7 +35,7 @@ SELECT @@`default`.key_buffer_size;
SELECT @@small.key_buffer_size;
SELECT @@medium.key_buffer_size;
-SET @@global.key_buffer_size=@save_key_buffer;
+SET @@global.key_buffer_size=@save_key_buffer_size;
#
# Errors
@@ -247,3 +249,263 @@ SET GLOBAL key_cache_block_size= @bug284
DROP TABLE t1;
# End of 4.1 tests
+
+#
+# Test cases for partitioned key caches
+#
+
+# Test usage of the KEY_CACHE table from information schema
+# for a simple key cache
+
+set global key_buffer_size=@save_key_buffer_size;
+set global key_cache_block_size=@save_key_cache_block_size;
+select @@key_buffer_size;
+select @@key_cache_block_size;
+select @@key_cache_partitions;
+
+create table t1 (
+ p int not null auto_increment primary key,
+ a char(10));
+create table t2 (
+ p int not null auto_increment primary key,
+ i int, a char(10), key k1(i), key k2(a));
+
+select @@key_cache_partitions;
+--replace_column 7 #
+select * from information_schema.key_caches;
+
+insert into t1 values (1, 'qqqq'), (2, 'yyyy');
+insert into t2 values (1, 1, 'qqqq'), (2, 1, 'pppp'),
+ (3, 1, 'yyyy'), (4, 3, 'zzzz');
+select * from t1;
+select * from t2;
+update t1 set p=3 where p=1;
+update t2 set i=2 where i=1;
+
+--replace_result 1808 KEY_BLOCKS_UNUSED 1670 KEY_BLOCKS_UNUSED
+show status like 'key_%';
+--replace_column 7 #
+select * from information_schema.key_caches;
+
+delete from t2 where a='zzzz';
+--replace_column 7 #
+select * from information_schema.key_caches;
+
+delete from t1;
+delete from t2;
+--replace_column 7 #
+select * from information_schema.key_caches;
+
+# For the key cache with 2 partitions execute the same sequence of
+# statements as for the simple cache above.
+# The statistical information on the number of i/o requests and
+# the number of is expected to be the same.
+
+set global key_cache_partitions=2;
+select @@key_cache_partitions;
+--replace_column 7 #
+select * from information_schema.key_caches;
+
+insert into t1 values (1, 'qqqq'), (2, 'yyyy');
+insert into t2 values (1, 1, 'qqqq'), (2, 1, 'pppp'),
+ (3, 1, 'yyyy'), (4, 3, 'zzzz');
+select * from t1;
+select * from t2;
+update t1 set p=3 where p=1;
+update t2 set i=2 where i=1;
+
+--replace_result 1808 KEY_BLOCKS_UNUSED 1670 KEY_BLOCKS_UNUSED
+show status like 'key_%';
+--replace_column 7 #
+select * from information_schema.key_caches;
+
+delete from t1;
+delete from t2;
+--replace_column 7 #
+select * from information_schema.key_caches;
+
+# Check that we can work with one partition with the same results
+
+set global key_cache_partitions=1;
+select @@key_cache_partitions;
+--replace_column 7 #
+select * from information_schema.key_caches;
+
+insert into t1 values (1, 'qqqq'), (2, 'yyyy');
+insert into t2 values (1, 1, 'qqqq'), (2, 1, 'pppp'),
+ (3, 1, 'yyyy'), (4, 3, 'zzzz');
+select * from t1;
+select * from t2;
+update t1 set p=3 where p=1;
+update t2 set i=2 where i=1;
+
+--replace_result 1808 KEY_BLOCKS_UNUSED 1670 KEY_BLOCKS_UNUSED
+show status like 'key_%';
+--replace_column 7 #
+select * from information_schema.key_caches;
+
+delete from t1;
+delete from t2;
+--replace_column 7 #
+select * from information_schema.key_caches;
+
+flush tables; flush status;
+--replace_column 7 #
+select * from information_schema.key_caches;
+
+# Switch back to 2 partitions
+
+set global key_buffer_size=32*1024;
+select @@key_buffer_size;
+set global key_cache_partitions=2;
+select @@key_cache_partitions;
+--replace_column 7 #
+select * from information_schema.key_caches;
+
+insert into t1 values (1, 'qqqq'), (2, 'yyyy');
+insert into t2 values (1, 1, 'qqqq'), (2, 1, 'pppp'),
+ (3, 1, 'yyyy'), (4, 3, 'zzzz');
+select * from t1;
+select * from t2;
+update t1 set p=3 where p=1;
+update t2 set i=2 where i=1;
+
+--replace_column 7 #
+select * from information_schema.key_caches;
+
+# Add more rows to tables t1 and t2
+
+insert into t1(a) select a from t1;
+insert into t1(a) select a from t1;
+insert into t1(a) select a from t1;
+insert into t1(a) select a from t1;
+insert into t1(a) select a from t1;
+insert into t1(a) select a from t1;
+insert into t1(a) select a from t1;
+insert into t1(a) select a from t1;
+
+insert into t2(i,a) select i,a from t2;
+insert into t2(i,a) select i,a from t2;
+insert into t2(i,a) select i,a from t2;
+insert into t2(i,a) select i,a from t2;
+insert into t2(i,a) select i,a from t2;
+insert into t2(i,a) select i,a from t2;
+insert into t2(i,a) select i,a from t2;
+insert into t2(i,a) select i,a from t2;
+
+--replace_column 6 # 7 # 10 #
+select * from information_schema.key_caches;
+
+select * from t1 where p between 1010 and 1020 ;
+select * from t2 where p between 1010 and 1020 ;
+--replace_column 6 # 7 # 10 #
+select * from information_schema.key_caches;
+
+flush tables; flush status;
+update t1 set a='zzzz' where a='qqqq';
+update t2 set i=1 where i=2;
+--replace_column 6 # 7 #
+select * from information_schema.key_caches;
+
+# Now test how we can work with 7 partitions
+
+set global keycache1.key_buffer_size=256*1024;
+select @@keycache1.key_buffer_size;
+set global keycache1.key_cache_partitions=7;
+select @@keycache1.key_cache_partitions;
+
+--replace_column 6 # 7 #
+select * from information_schema.key_caches;
+--replace_column 7 #
+select * from information_schema.key_caches where key_cache_name like "key%";
+
+cache index t1 key (`primary`) in keycache1;
+
+explain select p from t1 where p between 1010 and 1020;
+select p from t1 where p between 1010 and 1020;
+explain select i from t2 where p between 1010 and 1020;
+select i from t2 where p between 1010 and 1020;
+explain select count(*) from t1, t2 where t1.p = t2.i;
+select count(*) from t1, t2 where t1.p = t2.i;
+
+--replace_column 6 # 7 #
+select * from information_schema.key_caches;
+--replace_column 7 #
+select * from information_schema.key_caches where key_cache_name like "key%";
+
+cache index t2 in keycache1;
+update t2 set p=p+3000, i=2 where a='qqqq';
+--replace_column 7 #
+select * from information_schema.key_caches where key_cache_name like "key%";
+
+set global keycache2.key_buffer_size=1024*1024;
+cache index t2 in keycache2;
+insert into t2 values (2000, 3, 'yyyy');
+--replace_column 7 #
+select * from information_schema.key_caches where key_cache_name like "keycache2";
+--replace_column 7 #
+select * from information_schema.key_caches where key_cache_name like "key%";
+
+cache index t2 in keycache1;
+update t2 set p=p+5000 where a='zzzz';
+select * from t2 where p between 1010 and 1020;
+explain select p from t2 where p between 1010 and 1020;
+select p from t2 where p between 1010 and 1020;
+explain select i from t2 where a='yyyy' and i=3;
+select i from t2 where a='yyyy' and i=3;
+explain select a from t2 where a='yyyy' and i=3;
+select a from t2 where a='yyyy' and i=3 ;
+--replace_column 6 # 7 #
+select * from information_schema.key_caches;
+
+set global keycache1.key_cache_block_size=2*1024;
+insert into t2 values (7000, 3, 'yyyy');
+--replace_column 6 # 7 #
+select * from information_schema.key_caches;
+
+set global keycache1.key_cache_block_size=8*1024;
+insert into t2 values (8000, 3, 'yyyy');
+--replace_column 6 # 7 #
+select * from information_schema.key_caches;
+
+set global keycache1.key_buffer_size=64*1024;
+--replace_column 6 # 7 #
+select * from information_schema.key_caches;
+
+set global keycache1.key_cache_block_size=2*1024;
+--replace_column 6 # 7 #
+select * from information_schema.key_caches;
+
+set global keycache1.key_cache_block_size=8*1024;
+--replace_column 6 # 7 #
+select * from information_schema.key_caches;
+
+set global keycache1.key_buffer_size=0;
+--replace_column 6 # 7 #
+select * from information_schema.key_caches;
+
+set global keycache1.key_cache_block_size=8*1024;
+--replace_column 6 # 7 #
+select * from information_schema.key_caches;
+
+set global keycache1.key_buffer_size=0;
+--replace_column 6 # 7 #
+select * from information_schema.key_caches;
+
+set global keycache1.key_buffer_size=128*1024;
+--replace_column 6 # 7 #
+select * from information_schema.key_caches;
+
+set global keycache1.key_cache_block_size=1024;
+--replace_column 6 # 7 #
+select * from information_schema.key_caches;
+
+drop table t1,t2;
+
+set global keycache1.key_buffer_size=0;
+set global keycache2.key_buffer_size=0;
+
+set global key_buffer_size=@save_key_buffer_size;
+set global key_cache_partitions=@save_key_cache_partitions;
+
+#End of 5.1 tests
=== modified file 'mysys/mf_keycache.c'
--- a/mysys/mf_keycache.c 2009-09-07 20:50:10 +0000
+++ b/mysys/mf_keycache.c 2010-06-29 00:10:53 +0000
@@ -13,6 +13,35 @@
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */
+
+/******************************************************************************
+ The file contains the following modules:
+
+ Simple Key Cache Module
+
+ Partitioned Key Cache Module
+
+ Key Cache Interface Module
+
+******************************************************************************/
+
+#include "mysys_priv.h"
+#include "mysys_err.h"
+#include <keycache.h>
+#include "my_static.h"
+#include <m_string.h>
+#include <my_bit.h>
+#include <errno.h>
+#include <stdarg.h>
+
+/******************************************************************************
+ Simple Key Cache Module
+
+ The module contains implementations of all key cache interface functions
+ employed by partitioned key caches.
+
+******************************************************************************/
+
/*
These functions handle keyblock cacheing for ISAM and MyISAM tables.
@@ -101,14 +130,77 @@
I/O finished.
*/
-#include "mysys_priv.h"
-#include "mysys_err.h"
-#include <keycache.h>
-#include "my_static.h"
-#include <m_string.h>
-#include <my_bit.h>
-#include <errno.h>
-#include <stdarg.h>
+/* declare structures that is used by st_key_cache */
+
+struct st_block_link;
+typedef struct st_block_link BLOCK_LINK;
+struct st_keycache_page;
+typedef struct st_keycache_page KEYCACHE_PAGE;
+struct st_hash_link;
+typedef struct st_hash_link HASH_LINK;
+
+/* info about requests in a waiting queue */
+typedef struct st_keycache_wqueue
+{
+ struct st_my_thread_var *last_thread; /* circular list of waiting threads */
+} KEYCACHE_WQUEUE;
+
+#define CHANGED_BLOCKS_HASH 128 /* must be power of 2 */
+
+/* Control block for a simple (non-partitioned) key cache */
+
+typedef struct st_s_key_cache_cb
+{
+ my_bool key_cache_inited; /* <=> control block is allocated */
+ my_bool in_resize; /* true during resize operation */
+ my_bool resize_in_flush; /* true during flush of resize operation */
+ my_bool can_be_used; /* usage of cache for read/write is allowed */
+ size_t key_cache_mem_size; /* specified size of the cache memory */
+ uint key_cache_block_size; /* size of the page buffer of a cache block */
+ ulong min_warm_blocks; /* min number of warm blocks; */
+ ulong age_threshold; /* age threshold for hot blocks */
+ ulonglong keycache_time; /* total number of block link operations */
+ uint hash_entries; /* max number of entries in the hash table */
+ int hash_links; /* max number of hash links */
+ int hash_links_used; /* number of hash links currently used */
+ int disk_blocks; /* max number of blocks in the cache */
+ ulong blocks_used; /* maximum number of concurrently used blocks */
+ ulong blocks_unused; /* number of currently unused blocks */
+ ulong blocks_changed; /* number of currently dirty blocks */
+ ulong warm_blocks; /* number of blocks in warm sub-chain */
+ ulong cnt_for_resize_op; /* counter to block resize operation */
+ long blocks_available; /* number of blocks available in the LRU chain */
+ HASH_LINK **hash_root; /* arr. of entries into hash table buckets */
+ HASH_LINK *hash_link_root; /* memory for hash table links */
+ HASH_LINK *free_hash_list; /* list of free hash links */
+ BLOCK_LINK *free_block_list; /* list of free blocks */
+ BLOCK_LINK *block_root; /* memory for block links */
+ uchar HUGE_PTR *block_mem; /* memory for block buffers */
+ BLOCK_LINK *used_last; /* ptr to the last block of the LRU chain */
+ BLOCK_LINK *used_ins; /* ptr to the insertion block in LRU chain */
+ pthread_mutex_t cache_lock; /* to lock access to the cache structure */
+ KEYCACHE_WQUEUE resize_queue; /* threads waiting during resize operation */
+ /*
+ Waiting for a zero resize count. Using a queue for symmetry though
+ only one thread can wait here.
+ */
+ KEYCACHE_WQUEUE waiting_for_resize_cnt;
+ KEYCACHE_WQUEUE waiting_for_hash_link; /* waiting for a free hash link */
+ KEYCACHE_WQUEUE waiting_for_block; /* requests waiting for a free block */
+ BLOCK_LINK *changed_blocks[CHANGED_BLOCKS_HASH]; /* hash for dirty file bl.*/
+ BLOCK_LINK *file_blocks[CHANGED_BLOCKS_HASH]; /* hash for other file bl.*/
+
+ /* Statistics variables. These are reset in reset_key_cache_counters(). */
+ ulong global_blocks_changed; /* number of currently dirty blocks */
+ ulonglong global_cache_w_requests;/* number of write requests (write hits) */
+ ulonglong global_cache_write; /* number of writes from cache to files */
+ ulonglong global_cache_r_requests;/* number of read requests (read hits) */
+ ulonglong global_cache_read; /* number of reads from files to cache */
+
+ int blocks; /* max number of blocks in the cache */
+ uint hash_factor; /* factor used to calculate hash function */
+ my_bool in_init; /* Set to 1 in MySQL during init/resize */
+} S_KEY_CACHE_CB;
/*
Some compilation flags have been added specifically for this module
@@ -220,7 +312,12 @@ KEY_CACHE *dflt_key_cache= &dflt_key_cac
#define FLUSH_CACHE 2000 /* sort this many blocks at once */
-static int flush_all_key_blocks(KEY_CACHE *keycache);
+static int flush_all_key_blocks(S_KEY_CACHE_CB *keycache);
+/*
+static void s_change_key_cache_param(void *keycache_cb, uint division_limit,
+ uint age_threshold);
+*/
+static void s_end_key_cache(void *keycache_cb, my_bool cleanup);
#ifdef THREAD
static void wait_on_queue(KEYCACHE_WQUEUE *wqueue,
pthread_mutex_t *mutex);
@@ -229,15 +326,16 @@ static void release_whole_queue(KEYCACHE
#define wait_on_queue(wqueue, mutex) do {} while (0)
#define release_whole_queue(wqueue) do {} while (0)
#endif
-static void free_block(KEY_CACHE *keycache, BLOCK_LINK *block);
+static void free_block(S_KEY_CACHE_CB *keycache, BLOCK_LINK *block);
#if !defined(DBUG_OFF)
-static void test_key_cache(KEY_CACHE *keycache,
+static void test_key_cache(S_KEY_CACHE_CB *keycache,
const char *where, my_bool lock);
#endif
-
+#define KEYCACHE_BASE_EXPR(f, pos) \
+ ((ulong) ((pos) / keycache->key_cache_block_size) + (ulong) (f))
#define KEYCACHE_HASH(f, pos) \
-(((ulong) ((pos) / keycache->key_cache_block_size) + \
- (ulong) (f)) & (keycache->hash_entries-1))
+ ((KEYCACHE_BASE_EXPR(f, pos) / keycache->hash_factor) & \
+ (keycache->hash_entries-1))
#define FILE_HASH(f) ((uint) (f) & (CHANGED_BLOCKS_HASH-1))
#define DEFAULT_KEYCACHE_DEBUG_LOG "keycache_debug.log"
@@ -333,9 +431,10 @@ static int keycache_pthread_cond_signal(
#define inline /* disabled inline for easier debugging */
static int fail_block(BLOCK_LINK *block);
static int fail_hlink(HASH_LINK *hlink);
-static int cache_empty(KEY_CACHE *keycache);
+static int cache_empty(S_KEY_CACHE_CB *keycache);
#endif
+
static inline uint next_power(uint value)
{
return (uint) my_round_up_to_next_power((uint32) value) << 1;
@@ -343,19 +442,32 @@ static inline uint next_power(uint value
/*
- Initialize a key cache
+ Initialize a simple key cache
SYNOPSIS
- init_key_cache()
- keycache pointer to a key cache data structure
- key_cache_block_size size of blocks to keep cached data
- use_mem total memory to use for the key cache
- division_limit division limit (may be zero)
- age_threshold age threshold (may be zero)
+ s_init_key_cache()
+ keycache_cb pointer to the control block of a simple key cache
+ key_cache_block_size size of blocks to keep cached data
+ use_mem memory to use for the key cache buferrs/structures
+ division_limit division limit (may be zero)
+ age_threshold age threshold (may be zero)
+
+ DESCRIPTION
+ This function is the implementation of the init_key_cache interface
+ function that is employed by simple (non-partitioned) key caches.
+ The function builds a simple key cache and initializes the control block
+ structure of the type S_KEY_CACHE_CB that is used for this key cache.
+ The parameter keycache_cb is supposed to point to this structure.
+ The parameter key_cache_block_size specifies the size of the blocks in
+ the key cache to be built. The parameters division_limit and age_threshhold
+ determine the initial values of those characteristics of the key cache
+ that are used for midpoint insertion strategy. The parameter use_mem
+ specifies the total amount of memory to be allocated for key cache blocks
+ and auxiliary structures.
RETURN VALUE
number of blocks in the key cache, if successful,
- 0 - otherwise.
+ <= 0 - otherwise.
NOTES.
if keycache->key_cache_inited != 0 we assume that the key cache
@@ -367,10 +479,12 @@ static inline uint next_power(uint value
*/
-int init_key_cache(KEY_CACHE *keycache, uint key_cache_block_size,
- size_t use_mem, uint division_limit,
- uint age_threshold)
+static
+int s_init_key_cache(void *keycache_cb, uint key_cache_block_size,
+ size_t use_mem, uint division_limit,
+ uint age_threshold)
{
+ S_KEY_CACHE_CB *keycache= (S_KEY_CACHE_CB *) keycache_cb;
ulong blocks, hash_links;
size_t length;
int error;
@@ -384,12 +498,15 @@ int init_key_cache(KEY_CACHE *keycache,
DBUG_RETURN(0);
}
+ keycache->blocks_used= keycache->blocks_unused= 0;
+ keycache->global_blocks_changed= 0;
keycache->global_cache_w_requests= keycache->global_cache_r_requests= 0;
keycache->global_cache_read= keycache->global_cache_write= 0;
keycache->disk_blocks= -1;
if (! keycache->key_cache_inited)
{
keycache->key_cache_inited= 1;
+ keycache->hash_factor= 1;
/*
Initialize these variables once only.
Their value must survive re-initialization during resizing.
@@ -531,51 +648,43 @@ err:
/*
- Resize a key cache
+ Prepare for resizing a simple key cache
SYNOPSIS
- resize_key_cache()
- keycache pointer to a key cache data structure
- key_cache_block_size size of blocks to keep cached data
- use_mem total memory to use for the new key cache
- division_limit new division limit (if not zero)
- age_threshold new age threshold (if not zero)
+ s_prepare_resize_key_cache()
+ keycache_cb pointer to the control block of a simple key cache
+ with_resize_queue <=> resize queue is used
+ release_lock <=> release the key cache lock before return
- RETURN VALUE
- number of blocks in the key cache, if successful,
- 0 - otherwise.
+ DESCRIPTION
+ This function flushes all dirty pages from a simple key cache and after
+ this it destroys the key cache calling s_end_key_cache. The function
+ considers the parameter keycache_cb as a pointer to the control block
+ structure of the type S_KEY_CACHE_CB for this key cache.
+ The parameter with_resize_queue determines weather the resize queue is
+ involved (MySQL server never uses this queue). The parameter release_lock
+ says weather the key cache lock must be released before return from
+ the function.
- NOTES.
- The function first compares the memory size and the block size parameters
- with the key cache values.
+ RETURN VALUE
+ 0 - on success,
+ 1 - otherwise.
- If they differ the function free the the memory allocated for the
- old key cache blocks by calling the end_key_cache function and
- then rebuilds the key cache with new blocks by calling
- init_key_cache.
+ NOTES
+ This function is the called by s_resize_key_cache and p_resize_key_cache
+ that resize simple and partitioned key caches respectively.
- The function starts the operation only when all other threads
- performing operations with the key cache let her to proceed
- (when cnt_for_resize=0).
*/
-int resize_key_cache(KEY_CACHE *keycache, uint key_cache_block_size,
- size_t use_mem, uint division_limit,
- uint age_threshold)
+static
+int s_prepare_resize_key_cache(void *keycache_cb,
+ my_bool with_resize_queue,
+ my_bool release_lock)
{
- int blocks;
- DBUG_ENTER("resize_key_cache");
-
- if (!keycache->key_cache_inited)
- DBUG_RETURN(keycache->disk_blocks);
-
- if(key_cache_block_size == keycache->key_cache_block_size &&
- use_mem == keycache->key_cache_mem_size)
- {
- change_key_cache_param(keycache, division_limit, age_threshold);
- DBUG_RETURN(keycache->disk_blocks);
- }
-
+ int res= 0;
+ S_KEY_CACHE_CB *keycache= (S_KEY_CACHE_CB *) keycache_cb;
+ DBUG_ENTER("s_prepare_resize_key_cache");
+
keycache_pthread_mutex_lock(&keycache->cache_lock);
#ifdef THREAD
@@ -585,7 +694,7 @@ int resize_key_cache(KEY_CACHE *keycache
one resizer only. In set_var.cc keycache->in_init is used to block
multiple attempts.
*/
- while (keycache->in_resize)
+ while (with_resize_queue && keycache->in_resize)
{
/* purecov: begin inspected */
wait_on_queue(&keycache->resize_queue, &keycache->cache_lock);
@@ -610,8 +719,8 @@ int resize_key_cache(KEY_CACHE *keycache
{
/* TODO: if this happens, we should write a warning in the log file ! */
keycache->resize_in_flush= 0;
- blocks= 0;
keycache->can_be_used= 0;
+ res= 1;
goto finish;
}
DBUG_ASSERT(cache_empty(keycache));
@@ -637,29 +746,145 @@ int resize_key_cache(KEY_CACHE *keycache
#else
KEYCACHE_DBUG_ASSERT(keycache->cnt_for_resize_op == 0);
#endif
-
- /*
- Free old cache structures, allocate new structures, and initialize
- them. Note that the cache_lock mutex and the resize_queue are left
- untouched. We do not lose the cache_lock and will release it only at
- the end of this function.
- */
- end_key_cache(keycache, 0); /* Don't free mutex */
- /* The following will work even if use_mem is 0 */
- blocks= init_key_cache(keycache, key_cache_block_size, use_mem,
- division_limit, age_threshold);
+
+ s_end_key_cache(keycache_cb, 0);
finish:
+ if (release_lock)
+ keycache_pthread_mutex_unlock(&keycache->cache_lock);
+ DBUG_RETURN(res);
+}
+
+
+/*
+ Finalize resizing a simple key cache
+
+ SYNOPSIS
+ s_finish_resize_key_cache()
+ keycache_cb pointer to the control block of a simple key cache
+ with_resize_queue <=> resize queue is used
+ acquire_lock <=> acquire the key cache lock at start
+
+ DESCRIPTION
+ This function performs finalizing actions for the operation of
+ resizing a simple key cache. The function considers the parameter
+ keycache_cb as a pointer to the control block structure of the type
+ S_KEY_CACHE_CB for this key cache. The function sets the flag
+ in_resize in this structure to FALSE.
+ The parameter with_resize_queue determines weather the resize queue
+ is involved (MySQL server never uses this queue).
+ The parameter acquire_lock says weather the key cache lock must be
+ acquired at the start of the function.
+
+ RETURN VALUE
+ none
+
+ NOTES
+ This function is the called by s_resize_key_cache and p_resize_key_cache
+ that resize simple and partitioned key caches respectively.
+
+*/
+
+static
+void s_finish_resize_key_cache(void *keycache_cb,
+ my_bool with_resize_queue,
+ my_bool acquire_lock)
+{
+ S_KEY_CACHE_CB *keycache= (S_KEY_CACHE_CB *) keycache_cb;
+ DBUG_ENTER("s_finish_resize_key_cache");
+
+ if (acquire_lock)
+ keycache_pthread_mutex_lock(&keycache->cache_lock);
+
/*
Mark the resize finished. This allows other threads to start a
resize or to request new cache blocks.
*/
keycache->in_resize= 0;
-
- /* Signal waiting threads. */
- release_whole_queue(&keycache->resize_queue);
+
+ if (with_resize_queue)
+ {
+ /* Signal waiting threads. */
+ release_whole_queue(&keycache->resize_queue);
+ }
keycache_pthread_mutex_unlock(&keycache->cache_lock);
+
+ DBUG_VOID_RETURN;
+}
+
+
+/*
+ Resize a simple key cache
+
+ SYNOPSIS
+ s_resize_key_cache()
+ keycache_cb pointer to the control block of a simple key cache
+ key_cache_block_size size of blocks to keep cached data
+ use_mem memory to use for the key cache buffers/structures
+ division_limit new division limit (if not zero)
+ age_threshold new age threshold (if not zero)
+
+ DESCRIPTION
+ This function is the implementation of the resize_key_cache interface
+ function that is employed by simple (non-partitioned) key caches.
+ The function considers the parameter keycache_cb as a pointer to the
+ control block structure of the type S_KEY_CACHE_CB for the simple key
+ cache to be resized.
+ The parameter key_cache_block_size specifies the new size of the blocks in
+ the key cache. The parameters division_limit and age_threshold
+ determine the new initial values of those characteristics of the key cache
+ that are used for midpoint insertion strategy. The parameter use_mem
+ specifies the total amount of memory to be allocated for key cache blocks
+ and auxiliary structures in the new key cache.
+
+ RETURN VALUE
+ number of blocks in the key cache, if successful,
+ 0 - otherwise.
+
+ NOTES.
+ The function first calls the function s_prepare_resize_key_cache
+ to flush all dirty blocks from key cache, to free memory used
+ for key cache blocks and auxiliary structures. After this the
+ function builds a new key cache with new parameters.
+
+ This implementation doesn't block the calls and executions of other
+ functions from the key cache interface. However it assumes that the
+ calls of s_resize_key_cache itself are serialized.
+
+ The function starts the operation only when all other threads
+ performing operations with the key cache let her to proceed
+ (when cnt_for_resize=0).
+
+*/
+
+static
+int s_resize_key_cache(void *keycache_cb, uint key_cache_block_size,
+ size_t use_mem, uint division_limit,
+ uint age_threshold)
+{
+ S_KEY_CACHE_CB *keycache= (S_KEY_CACHE_CB *) keycache_cb;
+ int blocks= 0;
+ DBUG_ENTER("s_resize_key_cache");
+
+ if (!keycache->key_cache_inited)
+ DBUG_RETURN(keycache->disk_blocks);
+
+ /*
+ Note that the cache_lock mutex and the resize_queue are left untouched.
+ We do not lose the cache_lock and will release it only at the end of
+ this function.
+ */
+ if (s_prepare_resize_key_cache(keycache_cb, 1, 0))
+ goto finish;
+
+ /* The following will work even if use_mem is 0 */
+ blocks= s_init_key_cache(keycache, key_cache_block_size, use_mem,
+ division_limit, age_threshold);
+
+finish:
+ s_finish_resize_key_cache(keycache_cb, 1, 0);
+
DBUG_RETURN(blocks);
}
@@ -667,7 +892,7 @@ finish:
/*
Increment counter blocking resize key cache operation
*/
-static inline void inc_counter_for_resize_op(KEY_CACHE *keycache)
+static inline void inc_counter_for_resize_op(S_KEY_CACHE_CB *keycache)
{
keycache->cnt_for_resize_op++;
}
@@ -677,35 +902,49 @@ static inline void inc_counter_for_resiz
Decrement counter blocking resize key cache operation;
Signal the operation to proceed when counter becomes equal zero
*/
-static inline void dec_counter_for_resize_op(KEY_CACHE *keycache)
+static inline void dec_counter_for_resize_op(S_KEY_CACHE_CB *keycache)
{
if (!--keycache->cnt_for_resize_op)
release_whole_queue(&keycache->waiting_for_resize_cnt);
}
+
/*
- Change the key cache parameters
+ Change key cache parameters of a simple key cache
SYNOPSIS
- change_key_cache_param()
- keycache pointer to a key cache data structure
- division_limit new division limit (if not zero)
- age_threshold new age threshold (if not zero)
+ s_change_key_cache_param()
+ keycache_cb pointer to the control block of a simple key cache
+ division_limit new division limit (if not zero)
+ age_threshold new age threshold (if not zero)
+
+ DESCRIPTION
+ This function is the implementation of the change_key_cache_param interface
+ function that is employed by simple (non-partitioned) key caches.
+ The function considers the parameter keycache_cb as a pointer to the
+ control block structure of the type S_KEY_CACHE_CB for the simple key
+ cache where new values of the division limit and the age threshold used
+ for midpoint insertion strategy are to be set. The parameters
+ division_limit and age_threshold provide these new values.
RETURN VALUE
none
NOTES.
- Presently the function resets the key cache parameters
- concerning midpoint insertion strategy - division_limit and
- age_threshold.
+ Presently the function resets the key cache parameters concerning
+ midpoint insertion strategy - division_limit and age_threshold.
+ This function changes some parameters of a given key cache without
+ reformatting it. The function does not touch the contents the key
+ cache blocks.
+
*/
-void change_key_cache_param(KEY_CACHE *keycache, uint division_limit,
- uint age_threshold)
+static
+void s_change_key_cache_param(void *keycache_cb, uint division_limit,
+ uint age_threshold)
{
- DBUG_ENTER("change_key_cache_param");
-
+ S_KEY_CACHE_CB *keycache= (S_KEY_CACHE_CB *) keycache_cb;
+ DBUG_ENTER("s_change_key_cache_param");
keycache_pthread_mutex_lock(&keycache->cache_lock);
if (division_limit)
keycache->min_warm_blocks= (keycache->disk_blocks *
@@ -719,20 +958,32 @@ void change_key_cache_param(KEY_CACHE *k
/*
- Remove key_cache from memory
+ Destroy a simple key cache
SYNOPSIS
- end_key_cache()
- keycache key cache handle
- cleanup Complete free (Free also mutex for key cache)
+ s_end_key_cache()
+ keycache_cb pointer to the control block of a simple key cache
+ cleanup <=> complete free (free also mutex for key cache)
+
+ DESCRIPTION
+ This function is the implementation of the end_key_cache interface
+ function that is employed by simple (non-partitioned) key caches.
+ The function considers the parameter keycache_cb as a pointer to the
+ control block structure of the type S_KEY_CACHE_CB for the simple key
+ cache to be destroyed.
+ The function frees the memory allocated for the key cache blocks and
+ auxiliary structures. If the value of the parameter cleanup is TRUE
+ then even the key cache mutex is freed.
RETURN VALUE
none
*/
-void end_key_cache(KEY_CACHE *keycache, my_bool cleanup)
+static
+void s_end_key_cache(void *keycache_cb, my_bool cleanup)
{
- DBUG_ENTER("end_key_cache");
+ S_KEY_CACHE_CB *keycache= (S_KEY_CACHE_CB *) keycache_cb;
+ DBUG_ENTER("s_end_key_cache");
DBUG_PRINT("enter", ("key_cache: 0x%lx", (long) keycache));
if (!keycache->key_cache_inited)
@@ -760,7 +1011,14 @@ void end_key_cache(KEY_CACHE *keycache,
(ulong) keycache->global_cache_r_requests,
(ulong) keycache->global_cache_read));
- if (cleanup)
+ /*
+ Reset these values to be able to detect a disabled key cache.
+ See Bug#44068 (RESTORE can disable the MyISAM Key Cache).
+ */
+ keycache->blocks_used= 0;
+ keycache->blocks_unused= 0;
+
+ if (cleanup)
{
pthread_mutex_destroy(&keycache->cache_lock);
keycache->key_cache_inited= keycache->can_be_used= 0;
@@ -1016,7 +1274,7 @@ static inline void link_changed(BLOCK_LI
void
*/
-static void link_to_file_list(KEY_CACHE *keycache,
+static void link_to_file_list(S_KEY_CACHE_CB *keycache,
BLOCK_LINK *block, int file,
my_bool unlink_block)
{
@@ -1057,7 +1315,7 @@ static void link_to_file_list(KEY_CACHE
void
*/
-static void link_to_changed_list(KEY_CACHE *keycache,
+static void link_to_changed_list(S_KEY_CACHE_CB *keycache,
BLOCK_LINK *block)
{
DBUG_ASSERT(block->status & BLOCK_IN_USE);
@@ -1112,7 +1370,7 @@ static void link_to_changed_list(KEY_CAC
not linked in the LRU ring.
*/
-static void link_block(KEY_CACHE *keycache, BLOCK_LINK *block, my_bool hot,
+static void link_block(S_KEY_CACHE_CB *keycache, BLOCK_LINK *block, my_bool hot,
my_bool at_end)
{
BLOCK_LINK *ins;
@@ -1233,7 +1491,7 @@ static void link_block(KEY_CACHE *keycac
See NOTES for link_block
*/
-static void unlink_block(KEY_CACHE *keycache, BLOCK_LINK *block)
+static void unlink_block(S_KEY_CACHE_CB *keycache, BLOCK_LINK *block)
{
DBUG_ASSERT((block->status & ~BLOCK_CHANGED) == (BLOCK_READ | BLOCK_IN_USE));
DBUG_ASSERT(block->hash_link); /*backptr to block NULL from free_block()*/
@@ -1291,7 +1549,7 @@ static void unlink_block(KEY_CACHE *keyc
RETURN
void
*/
-static void reg_requests(KEY_CACHE *keycache, BLOCK_LINK *block, int count)
+static void reg_requests(S_KEY_CACHE_CB *keycache, BLOCK_LINK *block, int count)
{
DBUG_ASSERT(block->status & BLOCK_IN_USE);
DBUG_ASSERT(block->hash_link);
@@ -1334,7 +1592,7 @@ static void reg_requests(KEY_CACHE *keyc
not linked in the LRU ring.
*/
-static void unreg_request(KEY_CACHE *keycache,
+static void unreg_request(S_KEY_CACHE_CB *keycache,
BLOCK_LINK *block, int at_end)
{
DBUG_ASSERT(block->status & (BLOCK_READ | BLOCK_IN_USE));
@@ -1343,7 +1601,11 @@ static void unreg_request(KEY_CACHE *key
DBUG_ASSERT(block->prev_changed && *block->prev_changed == block);
DBUG_ASSERT(!block->next_used);
DBUG_ASSERT(!block->prev_used);
- if (! --block->requests)
+ /*
+ Unregister the request, but do not link erroneous blocks into the
+ LRU ring.
+ */
+ if (!--block->requests && !(block->status & BLOCK_ERROR))
{
my_bool hot;
if (block->hits_left)
@@ -1419,7 +1681,7 @@ static void remove_reader(BLOCK_LINK *bl
signals on its termination
*/
-static void wait_for_readers(KEY_CACHE *keycache,
+static void wait_for_readers(S_KEY_CACHE_CB *keycache,
BLOCK_LINK *block)
{
#ifdef THREAD
@@ -1469,7 +1731,7 @@ static inline void link_hash(HASH_LINK *
Remove a hash link from the hash table
*/
-static void unlink_hash(KEY_CACHE *keycache, HASH_LINK *hash_link)
+static void unlink_hash(S_KEY_CACHE_CB *keycache, HASH_LINK *hash_link)
{
KEYCACHE_DBUG_PRINT("unlink_hash", ("fd: %u pos_ %lu #requests=%u",
(uint) hash_link->file,(ulong) hash_link->diskpos, hash_link->requests));
@@ -1525,7 +1787,7 @@ static void unlink_hash(KEY_CACHE *keyca
Get the hash link for a page
*/
-static HASH_LINK *get_hash_link(KEY_CACHE *keycache,
+static HASH_LINK *get_hash_link(S_KEY_CACHE_CB *keycache,
int file, my_off_t filepos)
{
reg1 HASH_LINK *hash_link, **start;
@@ -1646,7 +1908,7 @@ restart:
waits until first of this operations links any block back.
*/
-static BLOCK_LINK *find_key_block(KEY_CACHE *keycache,
+static BLOCK_LINK *find_key_block(S_KEY_CACHE_CB *keycache,
File file, my_off_t filepos,
int init_hits_left,
int wrmode, int *page_st)
@@ -1716,6 +1978,7 @@ restart:
- block assigned but not yet read from file (invalid data).
*/
+#ifdef THREAD
if (keycache->in_resize)
{
/* This is a request during a resize operation */
@@ -1957,6 +2220,9 @@ restart:
}
DBUG_RETURN(0);
}
+#else /* THREAD */
+ DBUG_ASSERT(!keycache->in_resize);
+#endif
if (page_status == PAGE_READ &&
(block->status & (BLOCK_IN_EVICTION | BLOCK_IN_SWITCH |
@@ -2210,9 +2476,9 @@ restart:
thread might change the block->hash_link value
*/
error= my_pwrite(block->hash_link->file,
- block->buffer+block->offset,
+ block->buffer + block->offset,
block->length - block->offset,
- block->hash_link->diskpos+ block->offset,
+ block->hash_link->diskpos + block->offset,
MYF(MY_NABP | MY_WAIT_IF_FULL));
keycache_pthread_mutex_lock(&keycache->cache_lock);
@@ -2402,7 +2668,7 @@ restart:
portion is less than read_length, but not less than min_length.
*/
-static void read_block(KEY_CACHE *keycache,
+static void read_block(S_KEY_CACHE_CB *keycache,
BLOCK_LINK *block, uint read_length,
uint min_length, my_bool primary)
{
@@ -2490,43 +2756,62 @@ static void read_block(KEY_CACHE *keycac
/*
- Read a block of data from a cached file into a buffer;
+ Read a block of data from a simple key cache into a buffer
SYNOPSIS
- key_cache_read()
- keycache pointer to a key cache data structure
- file handler for the file for the block of data to be read
- filepos position of the block of data in the file
- level determines the weight of the data
- buff buffer to where the data must be placed
- length length of the buffer
- block_length length of the block in the key cache buffer
- return_buffer return pointer to the key cache buffer with the data
+ s_key_cache_read()
+ keycache_cb pointer to the control block of a simple key cache
+ file handler for the file for the block of data to be read
+ filepos position of the block of data in the file
+ level determines the weight of the data
+ buff buffer to where the data must be placed
+ length length of the buffer
+ block_length length of the read data from a key cache block
+ return_buffer return pointer to the key cache buffer with the data
+ DESCRIPTION
+ This function is the implementation of the key_cache_read interface
+ function that is employed by simple (non-partitioned) key caches.
+ The function considers the parameter keycache_cb as a pointer to the
+ control block structure of the type S_KEY_CACHE_CB for a simple key
+ cache.
+ In a general case the function reads a block of data from the key cache
+ into the buffer buff of the size specified by the parameter length. The
+ beginning of the block of data to be read is specified by the parameters
+ file and filepos. The length of the read data is the same as the length
+ of the buffer. The data is read into the buffer in key_cache_block_size
+ increments. If the next portion of the data is not found in any key cache
+ block, first it is read from file into the key cache.
+ If the parameter return_buffer is not ignored and its value is TRUE, and
+ the data to be read of the specified size block_length can be read from one
+ key cache buffer, then the function returns a pointer to the data in the
+ key cache buffer.
+ The function takse into account parameters block_length and return buffer
+ only in a single-threaded environment.
+ The parameter 'level' is used only by the midpoint insertion strategy
+ when the data or its portion cannot be found in the key cache.
+
RETURN VALUE
- Returns address from where the data is placed if sucessful, 0 - otherwise.
+ Returns address from where the data is placed if successful, 0 - otherwise.
- NOTES.
- The function ensures that a block of data of size length from file
- positioned at filepos is in the buffers for some key cache blocks.
- Then the function either copies the data into the buffer buff, or,
- if return_buffer is TRUE, it just returns the pointer to the key cache
- buffer with the data.
+ NOTES
Filepos must be a multiple of 'block_length', but it doesn't
have to be a multiple of key_cache_block_size;
+
*/
-uchar *key_cache_read(KEY_CACHE *keycache,
- File file, my_off_t filepos, int level,
- uchar *buff, uint length,
- uint block_length __attribute__((unused)),
- int return_buffer __attribute__((unused)))
+uchar *s_key_cache_read(void *keycache_cb,
+ File file, my_off_t filepos, int level,
+ uchar *buff, uint length,
+ uint block_length __attribute__((unused)),
+ int return_buffer __attribute__((unused)))
{
+ S_KEY_CACHE_CB *keycache= (S_KEY_CACHE_CB *) keycache_cb;
my_bool locked_and_incremented= FALSE;
int error=0;
uchar *start= buff;
- DBUG_ENTER("key_cache_read");
+ DBUG_ENTER("s_key_cache_read");
DBUG_PRINT("enter", ("fd: %u pos: %lu length: %u",
(uint) file, (ulong) filepos, length));
@@ -2536,7 +2821,6 @@ uchar *key_cache_read(KEY_CACHE *keycach
reg1 BLOCK_LINK *block;
uint read_length;
uint offset;
- uint status;
int page_st;
/*
@@ -2570,10 +2854,12 @@ uchar *key_cache_read(KEY_CACHE *keycach
/* Read data in key_cache_block_size increments */
do
{
- /* Cache could be disabled in a later iteration. */
-
+ /* Cache could be disabled in a later iteration. */
if (!keycache->can_be_used)
- goto no_key_cache;
+ {
+ KEYCACHE_DBUG_PRINT("key_cache_read", ("keycache cannot be used"));
+ goto no_key_cache;
+ }
/* Start reading at the beginning of the cache block. */
filepos-= offset;
/* Do not read beyond the end of the cache block. */
@@ -2634,7 +2920,7 @@ uchar *key_cache_read(KEY_CACHE *keycach
}
/* block status may have added BLOCK_ERROR in the above 'if'. */
- if (!((status= block->status) & BLOCK_ERROR))
+ if (!(block->status & BLOCK_ERROR))
{
#ifndef THREAD
if (! return_buffer)
@@ -2660,14 +2946,22 @@ uchar *key_cache_read(KEY_CACHE *keycach
remove_reader(block);
- /*
- Link the block into the LRU ring if it's the last submitted
- request for the block. This enables eviction for the block.
- */
- unreg_request(keycache, block, 1);
+ /* Error injection for coverage testing. */
+ DBUG_EXECUTE_IF("key_cache_read_block_error",
+ block->status|= BLOCK_ERROR;);
- if (status & BLOCK_ERROR)
+ /* Do not link erroneous blocks into the LRU ring, but free them. */
+ if (!(block->status & BLOCK_ERROR))
+ {
+ /*
+ Link the block into the LRU ring if it's the last submitted
+ request for the block. This enables eviction for the block.
+ */
+ unreg_request(keycache, block, 1);
+ }
+ else
{
+ free_block(keycache, block);
error= 1;
break;
}
@@ -2677,7 +2971,7 @@ uchar *key_cache_read(KEY_CACHE *keycach
if (return_buffer)
DBUG_RETURN(block->buffer);
#endif
- next_block:
+ next_block:
buff+= read_length;
filepos+= read_length+offset;
offset= 0;
@@ -2685,6 +2979,7 @@ uchar *key_cache_read(KEY_CACHE *keycach
} while ((length-= read_length));
goto end;
}
+ KEYCACHE_DBUG_PRINT("key_cache_read", ("keycache not initialized"));
no_key_cache:
/* Key cache is not used */
@@ -2705,34 +3000,55 @@ end:
dec_counter_for_resize_op(keycache);
keycache_pthread_mutex_unlock(&keycache->cache_lock);
}
+ DBUG_PRINT("exit", ("error: %d", error ));
DBUG_RETURN(error ? (uchar*) 0 : start);
}
/*
- Insert a block of file data from a buffer into key cache
+ Insert a block of file data from a buffer into a simple key cache
SYNOPSIS
- key_cache_insert()
- keycache pointer to a key cache data structure
+ s_key_cache_insert()
+ keycache_cb pointer to the control block of a simple key cache
file handler for the file to insert data from
filepos position of the block of data in the file to insert
level determines the weight of the data
buff buffer to read data from
length length of the data in the buffer
- NOTES
- This is used by MyISAM to move all blocks from a index file to the key
- cache
-
+ DESCRIPTION
+ This function is the implementation of the key_cache_insert interface
+ function that is employed by simple (non-partitioned) key caches.
+ The function considers the parameter keycache_cb as a pointer to the
+ control block structure of the type S_KEY_CACHE_CB for a simple key
+ cache.
+ The function writes a block of file data from a buffer into the key cache.
+ The buffer is specified with the parameters buff and length - the pointer
+ to the beginning of the buffer and its size respectively. It's assumed
+ the buffer contains the data from 'file' allocated from the position
+ filepos. The data is copied from the buffer in key_cache_block_size
+ increments.
+ The parameter level is used to set one characteristic for the key buffers
+ loaded with the data from buff. The characteristic is used only by the
+ midpoint insertion strategy.
+
RETURN VALUE
0 if a success, 1 - otherwise.
+
+ NOTES
+ The function is used by MyISAM to move all blocks from a index file to
+ the key cache. It can be performed in parallel with reading the file data
+ from the key buffers by other threads.
+
*/
-int key_cache_insert(KEY_CACHE *keycache,
- File file, my_off_t filepos, int level,
- uchar *buff, uint length)
+static
+int s_key_cache_insert(void *keycache_cb,
+ File file, my_off_t filepos, int level,
+ uchar *buff, uint length)
{
+ S_KEY_CACHE_CB *keycache= (S_KEY_CACHE_CB *) keycache_cb;
int error= 0;
DBUG_ENTER("key_cache_insert");
DBUG_PRINT("enter", ("fd: %u pos: %lu length: %u",
@@ -2916,16 +3232,25 @@ int key_cache_insert(KEY_CACHE *keycache
remove_reader(block);
- /*
- Link the block into the LRU ring if it's the last submitted
- request for the block. This enables eviction for the block.
- */
- unreg_request(keycache, block, 1);
-
- error= (block->status & BLOCK_ERROR);
+ /* Error injection for coverage testing. */
+ DBUG_EXECUTE_IF("key_cache_insert_block_error",
+ block->status|= BLOCK_ERROR; errno=EIO;);
- if (error)
+ /* Do not link erroneous blocks into the LRU ring, but free them. */
+ if (!(block->status & BLOCK_ERROR))
+ {
+ /*
+ Link the block into the LRU ring if it's the last submitted
+ request for the block. This enables eviction for the block.
+ */
+ unreg_request(keycache, block, 1);
+ }
+ else
+ {
+ free_block(keycache, block);
+ error= 1;
break;
+ }
buff+= read_length;
filepos+= read_length+offset;
@@ -2943,43 +3268,65 @@ int key_cache_insert(KEY_CACHE *keycache
/*
- Write a buffer into a cached file.
+ Write a buffer into a simple key cache
SYNOPSIS
- key_cache_write()
- keycache pointer to a key cache data structure
- file handler for the file to write data to
- filepos position in the file to write data to
- level determines the weight of the data
- buff buffer with the data
- length length of the buffer
- dont_write if is 0 then all dirty pages involved in writing
- should have been flushed from key cache
+ s_key_cache_write()
+ keycache_cb pointer to the control block of a simple key cache
+ file handler for the file to write data to
+ file_extra maps of key cache partitions containing
+ dirty pages from file
+ filepos position in the file to write data to
+ level determines the weight of the data
+ buff buffer with the data
+ length length of the buffer
+ dont_write if is 0 then all dirty pages involved in writing
+ should have been flushed from key cache
+ DESCRIPTION
+ This function is the implementation of the key_cache_write interface
+ function that is employed by simple (non-partitioned) key caches.
+ The function considers the parameter keycache_cb as a pointer to the
+ control block structure of the type S_KEY_CACHE_CB for a simple key
+ cache.
+ In a general case the function copies data from a buffer into the key
+ cache. The buffer is specified with the parameters buff and length -
+ the pointer to the beginning of the buffer and its size respectively.
+ It's assumed the buffer contains the data to be written into 'file'
+ starting from the position filepos. The data is copied from the buffer
+ in key_cache_block_size increments.
+ If the value of the parameter dont_write is FALSE then the function
+ also writes the data into file.
+ The parameter level is used to set one characteristic for the key buffers
+ filled with the data from buff. The characteristic is employed only by
+ the midpoint insertion strategy.
+ The parameter file_extra currently makes sense only for simple key caches
+ that are elements of a partitioned key cache. It provides a pointer to the
+ shared bitmap of the partitions that may contains dirty pages for the file.
+ This bitmap is used to optimize the function p_flush_key_blocks.
+
RETURN VALUE
0 if a success, 1 - otherwise.
- NOTES.
- The function copies the data of size length from buff into buffers
- for key cache blocks that are assigned to contain the portion of
- the file starting with position filepos.
- It ensures that this data is flushed to the file if dont_write is FALSE.
- Filepos must be a multiple of 'block_length', but it doesn't
- have to be a multiple of key_cache_block_size;
+ NOTES
+ This implementation exploits the fact that the function is called only
+ when a thread has got an exclusive lock for the key file.
- dont_write is always TRUE in the server (info->lock_type is never F_UNLCK).
*/
-int key_cache_write(KEY_CACHE *keycache,
- File file, my_off_t filepos, int level,
- uchar *buff, uint length,
- uint block_length __attribute__((unused)),
- int dont_write)
+static
+int s_key_cache_write(void *keycache_cb,
+ File file, void *file_extra __attribute__((unused)),
+ my_off_t filepos, int level,
+ uchar *buff, uint length,
+ uint block_length __attribute__((unused)),
+ int dont_write)
{
+ S_KEY_CACHE_CB *keycache= (S_KEY_CACHE_CB *) keycache_cb;
my_bool locked_and_incremented= FALSE;
int error=0;
- DBUG_ENTER("key_cache_write");
+ DBUG_ENTER("s_key_cache_write");
DBUG_PRINT("enter",
("fd: %u pos: %lu length: %u block_length: %u"
" key_block_length: %u",
@@ -3206,14 +3553,24 @@ int key_cache_write(KEY_CACHE *keycache,
*/
remove_reader(block);
- /*
- Link the block into the LRU ring if it's the last submitted
- request for the block. This enables eviction for the block.
- */
- unreg_request(keycache, block, 1);
+ /* Error injection for coverage testing. */
+ DBUG_EXECUTE_IF("key_cache_write_block_error",
+ block->status|= BLOCK_ERROR;);
- if (block->status & BLOCK_ERROR)
+ /* Do not link erroneous blocks into the LRU ring, but free them. */
+ if (!(block->status & BLOCK_ERROR))
+ {
+ /*
+ Link the block into the LRU ring if it's the last submitted
+ request for the block. This enables eviction for the block.
+ */
+ unreg_request(keycache, block, 1);
+ }
+ else
{
+ /* Pretend a "clean" block to avoid complications. */
+ block->status&= ~(BLOCK_CHANGED);
+ free_block(keycache, block);
error= 1;
break;
}
@@ -3284,12 +3641,13 @@ end:
Block must have a request registered on it.
*/
-static void free_block(KEY_CACHE *keycache, BLOCK_LINK *block)
+static void free_block(S_KEY_CACHE_CB *keycache, BLOCK_LINK *block)
{
KEYCACHE_THREAD_TRACE("free block");
KEYCACHE_DBUG_PRINT("free_block",
- ("block %u to be freed, hash_link %p",
- BLOCK_NUMBER(block), block->hash_link));
+ ("block %u to be freed, hash_link %p status: %u",
+ BLOCK_NUMBER(block), block->hash_link,
+ block->status));
/*
Assert that the block is not free already. And that it is in a clean
state. Note that the block might just be assigned to a hash_link and
@@ -3371,10 +3729,14 @@ static void free_block(KEY_CACHE *keycac
if (block->status & BLOCK_IN_EVICTION)
return;
- /* Here the block must be in the LRU ring. Unlink it again. */
- DBUG_ASSERT(block->next_used && block->prev_used &&
- *block->prev_used == block);
- unlink_block(keycache, block);
+ /* Error blocks are not put into the LRU ring. */
+ if (!(block->status & BLOCK_ERROR))
+ {
+ /* Here the block must be in the LRU ring. Unlink it again. */
+ DBUG_ASSERT(block->next_used && block->prev_used &&
+ *block->prev_used == block);
+ unlink_block(keycache, block);
+ }
if (block->temperature == BLOCK_WARM)
keycache->warm_blocks--;
block->temperature= BLOCK_COLD;
@@ -3419,7 +3781,7 @@ static int cmp_sec_link(BLOCK_LINK **a,
free used blocks if requested
*/
-static int flush_cached_blocks(KEY_CACHE *keycache,
+static int flush_cached_blocks(S_KEY_CACHE_CB *keycache,
File file, BLOCK_LINK **cache,
BLOCK_LINK **end,
enum flush_type type)
@@ -3463,10 +3825,9 @@ static int flush_cached_blocks(KEY_CACHE
(BLOCK_READ | BLOCK_IN_FLUSH | BLOCK_CHANGED | BLOCK_IN_USE));
block->status|= BLOCK_IN_FLUSHWRITE;
keycache_pthread_mutex_unlock(&keycache->cache_lock);
- error= my_pwrite(file,
- block->buffer+block->offset,
+ error= my_pwrite(file, block->buffer + block->offset,
block->length - block->offset,
- block->hash_link->diskpos+ block->offset,
+ block->hash_link->diskpos + block->offset,
MYF(MY_NABP | MY_WAIT_IF_FULL));
keycache_pthread_mutex_lock(&keycache->cache_lock);
keycache->global_cache_write++;
@@ -3527,7 +3888,7 @@ static int flush_cached_blocks(KEY_CACHE
/*
- flush all key blocks for a file to disk, but don't do any mutex locks.
+ Flush all key blocks for a file to disk, but don't do any mutex locks
SYNOPSIS
flush_key_blocks_int()
@@ -3549,7 +3910,7 @@ static int flush_cached_blocks(KEY_CACHE
1 error
*/
-static int flush_key_blocks_int(KEY_CACHE *keycache,
+static int flush_key_blocks_int(S_KEY_CACHE_CB *keycache,
File file, enum flush_type type)
{
BLOCK_LINK *cache_buff[FLUSH_CACHE],**cache;
@@ -3986,23 +4347,49 @@ err:
/*
- Flush all blocks for a file to disk
+ Flush all blocks for a file from key buffers of a simple key cache
SYNOPSIS
- flush_key_blocks()
- keycache pointer to a key cache data structure
- file handler for the file to flush to
- flush_type type of the flush
+ s_flush_key_blocks()
+ keycache_cb pointer to the control block of a simple key cache
+ file handler for the file to flush to
+ file_extra maps of key cache partitions containing
+ dirty pages from file (not used)
+ flush_type type of the flush operation
+ DESCRIPTION
+ This function is the implementation of the flush_key_blocks interface
+ function that is employed by simple (non-partitioned) key caches.
+ The function considers the parameter keycache_cb as a pointer to the
+ control block structure of the type S_KEY_CACHE_CB for a simple key
+ cache.
+ In a general case the function flushes the data from all dirty key
+ buffers related to the file 'file' into this file. The function does
+ exactly this if the value of the parameter type is FLUSH_KEEP. If the
+ value of this parameter is FLUSH_RELEASE, the function additionally
+ releases the key buffers containing data from 'file' for new usage.
+ If the value of the parameter type is FLUSH_IGNORE_CHANGED the function
+ just releases the key buffers containing data from 'file'.
+ The parameter file_extra currently is not used by this function.
+
RETURN
0 ok
1 error
+
+ NOTES
+ This implementation exploits the fact that the function is called only
+ when a thread has got an exclusive lock for the key file.
+
*/
-int flush_key_blocks(KEY_CACHE *keycache,
- File file, enum flush_type type)
+static
+int s_flush_key_blocks(void *keycache_cb,
+ File file,
+ void *file_extra __attribute__((unused)),
+ enum flush_type type)
{
+ S_KEY_CACHE_CB *keycache= (S_KEY_CACHE_CB *) keycache_cb;
int res= 0;
DBUG_ENTER("flush_key_blocks");
DBUG_PRINT("enter", ("keycache: 0x%lx", (long) keycache));
@@ -4055,7 +4442,7 @@ int flush_key_blocks(KEY_CACHE *keycache
!= 0 Error
*/
-static int flush_all_key_blocks(KEY_CACHE *keycache)
+static int flush_all_key_blocks(S_KEY_CACHE_CB *keycache)
{
BLOCK_LINK *block;
uint total_found;
@@ -4158,37 +4545,45 @@ static int flush_all_key_blocks(KEY_CACH
/*
- Reset the counters of a key cache.
+ Reset the counters of a simple key cache
SYNOPSIS
- reset_key_cache_counters()
- name the name of a key cache
- key_cache pointer to the key kache to be reset
+ s_reset_key_cache_counters()
+ name the name of a key cache
+ keycache_cb pointer to the control block of a simple key cache
DESCRIPTION
- This procedure is used by process_key_caches() to reset the counters of all
- currently used key caches, both the default one and the named ones.
+ This function is the implementation of the reset_key_cache_counters
+ interface function that is employed by simple (non-partitioned) key caches.
+ The function considers the parameter keycache_cb as a pointer to the
+ control block structure of the type S_KEY_CACHE_CB for a simple key cache.
+ This function resets the values of all statistical counters for the key
+ cache to 0.
+ The parameter name is currently not used.
RETURN
0 on success (always because it can't fail)
+
*/
-int reset_key_cache_counters(const char *name __attribute__((unused)),
- KEY_CACHE *key_cache)
+static
+int s_reset_key_cache_counters(const char *name __attribute__((unused)),
+ void *keycache_cb)
{
- DBUG_ENTER("reset_key_cache_counters");
- if (!key_cache->key_cache_inited)
+ S_KEY_CACHE_CB *keycache= (S_KEY_CACHE_CB *) keycache_cb;
+ DBUG_ENTER("s_reset_key_cache_counters");
+ if (!keycache->key_cache_inited)
{
DBUG_PRINT("info", ("Key cache %s not initialized.", name));
DBUG_RETURN(0);
}
DBUG_PRINT("info", ("Resetting counters for key cache %s.", name));
- key_cache->global_blocks_changed= 0; /* Key_blocks_not_flushed */
- key_cache->global_cache_r_requests= 0; /* Key_read_requests */
- key_cache->global_cache_read= 0; /* Key_reads */
- key_cache->global_cache_w_requests= 0; /* Key_write_requests */
- key_cache->global_cache_write= 0; /* Key_writes */
+ keycache->global_blocks_changed= 0; /* Key_blocks_not_flushed */
+ keycache->global_cache_r_requests= 0; /* Key_read_requests */
+ keycache->global_cache_read= 0; /* Key_reads */
+ keycache->global_cache_w_requests= 0; /* Key_write_requests */
+ keycache->global_cache_write= 0; /* Key_writes */
DBUG_RETURN(0);
}
@@ -4197,7 +4592,7 @@ int reset_key_cache_counters(const char
/*
Test if disk-cache is ok
*/
-static void test_key_cache(KEY_CACHE *keycache __attribute__((unused)),
+static void test_key_cache(S_KEY_CACHE_CB *keycache __attribute__((unused)),
const char *where __attribute__((unused)),
my_bool lock __attribute__((unused)))
{
@@ -4211,7 +4606,7 @@ static void test_key_cache(KEY_CACHE *ke
#define MAX_QUEUE_LEN 100
-static void keycache_dump(KEY_CACHE *keycache)
+static void keycache_dump(S_KEY_CACHE_CB *keycache)
{
FILE *keycache_dump_file=fopen(KEYCACHE_DUMP_FILE, "w");
struct st_my_thread_var *last;
@@ -4404,8 +4799,8 @@ static void keycache_debug_print(const c
va_start(args,fmt);
if (keycache_debug_log)
{
- VOID(vfprintf(keycache_debug_log, fmt, args));
- VOID(fputc('\n',keycache_debug_log));
+ void(vfprintf(keycache_debug_log, fmt, args));
+ void(fputc('\n',keycache_debug_log));
}
va_end(args);
}
@@ -4451,7 +4846,7 @@ static int fail_hlink(HASH_LINK *hlink)
return 0; /* Let the assert fail. */
}
-static int cache_empty(KEY_CACHE *keycache)
+static int cache_empty(S_KEY_CACHE_CB *keycache)
{
int errcnt= 0;
int idx;
@@ -4489,3 +4884,1675 @@ static int cache_empty(KEY_CACHE *keycac
}
#endif
+
+/*
+ Get statistics for a simple key cache
+
+ SYNOPSIS
+ get_key_cache_statistics()
+ keycache_cb pointer to the control block of a simple key cache
+ partition_no partition number (not used)
+ key_cache_stats OUT pointer to the structure for the returned statistics
+
+ DESCRIPTION
+ This function is the implementation of the get_key_cache_statistics
+ interface function that is employed by simple (non-partitioned) key caches.
+ The function considers the parameter keycache_cb as a pointer to the
+ control block structure of the type S_KEY_CACHE_CB for a simple key cache.
+ This function returns the statistical data for the key cache.
+ The parameter partition_no is not used by this function.
+
+ RETURN
+ none
+
+*/
+
+static
+void s_get_key_cache_statistics(void *keycache_cb,
+ uint partition_no __attribute__((unused)),
+ KEY_CACHE_STATISTICS *key_cache_stats)
+{
+ S_KEY_CACHE_CB *keycache= (S_KEY_CACHE_CB *) keycache_cb;
+ DBUG_ENTER("s_get_key_cache_statistics");
+
+ key_cache_stats->mem_size= (longlong) keycache->key_cache_mem_size;
+ key_cache_stats->block_size= (longlong) keycache->key_cache_block_size;
+ key_cache_stats->blocks_used= keycache->blocks_used;
+ key_cache_stats->blocks_unused= keycache->blocks_unused;
+ key_cache_stats->blocks_changed= keycache->global_blocks_changed;
+ key_cache_stats->read_requests= keycache->global_cache_r_requests;
+ key_cache_stats->reads= keycache->global_cache_read;
+ key_cache_stats->write_requests= keycache->global_cache_w_requests;
+ key_cache_stats->writes= keycache->global_cache_write;
+ DBUG_VOID_RETURN;
+}
+
+
+static size_t s_key_cache_stat_var_offsets[]=
+{
+ offsetof(S_KEY_CACHE_CB, blocks_used),
+ offsetof(S_KEY_CACHE_CB, blocks_unused),
+ offsetof(S_KEY_CACHE_CB, global_blocks_changed),
+ offsetof(S_KEY_CACHE_CB, global_cache_w_requests),
+ offsetof(S_KEY_CACHE_CB, global_cache_write),
+ offsetof(S_KEY_CACHE_CB, global_cache_r_requests),
+ offsetof(S_KEY_CACHE_CB, global_cache_read)
+};
+
+
+/*
+ Get the value of a statistical variable for a simple key cache
+
+ SYNOPSIS
+ s_get_key_cache_stat_value()
+ keycache_cb pointer to the control block of a simple key cache
+ var_no the ordered number of a statistical variable
+
+ DESCRIPTION
+ This function is the implementation of the s_get_key_cache_stat_value
+ interface function that is employed by simple (non-partitioned) key caches.
+ The function considers the parameter keycache_cb as a pointer to the
+ control block structure of the type S_KEY_CACHE_CB for a simple key cache.
+ This function returns the value of the statistical variable var_no
+ for this key cache. The variables are numbered starting from 0 to 6.
+
+ RETURN
+ The value of the specified statistical variable
+
+*/
+
+static
+ulonglong s_get_key_cache_stat_value(void *keycache_cb, uint var_no)
+{
+ S_KEY_CACHE_CB *keycache= (S_KEY_CACHE_CB *) keycache_cb;
+ size_t var_ofs= s_key_cache_stat_var_offsets[var_no];
+ ulonglong res= 0;
+ DBUG_ENTER("s_get_key_cache_stat_value");
+
+ if (var_no < 3)
+ res= (ulonglong) (*(long *) ((char *) keycache + var_ofs));
+ else
+ res= *(ulonglong *) ((char *) keycache + var_ofs);
+
+ DBUG_RETURN(res);
+}
+
+
+/*
+ The array of pointer to the key cache interface functions used for simple
+ key caches. Any simple key cache objects including those incorporated into
+ partitioned keys caches exploit this array.
+
+ The current implementation of these functions allows to call them from
+ the MySQL server code directly. We don't do it though.
+*/
+
+static KEY_CACHE_FUNCS s_key_cache_funcs =
+{
+ s_init_key_cache,
+ s_resize_key_cache,
+ s_change_key_cache_param,
+ s_key_cache_read,
+ s_key_cache_insert,
+ s_key_cache_write,
+ s_flush_key_blocks,
+ s_reset_key_cache_counters,
+ s_end_key_cache,
+ s_get_key_cache_statistics,
+ s_get_key_cache_stat_value
+};
+
+
+/******************************************************************************
+ Partitioned Key Cache Module
+
+ The module contains implementations of all key cache interface functions
+ employed by partitioned key caches.
+
+ A partitioned key cache is a collection of structures for simple key caches
+ called key cache partitions. Any page from a file can be placed into a buffer
+ of only one partition. The number of the partition is calculated from
+ the file number and the position of the page in the file, and it's always the
+ same for the page. The function that maps pages into partitions takes care
+ of even distribution of pages among partitions.
+
+ Partition key cache mitigate one of the major problem of simple key cache:
+ thread contention for key cache lock (mutex). Every call of a key cache
+ interface function must acquire this lock. So threads compete for this lock
+ even in the case when they have acquired shared locks for the file and
+ pages they want read from are in the key cache buffers.
+ When working with a partitioned key cache any key cache interface function
+ that needs only one page has to acquire the key cache lock only for the
+ partition the page is ascribed to. This makes the chances for threads not
+ compete for the same key cache lock better. Unfortunately if we use a
+ partitioned key cache with N partitions for B-tree indexes we can't say
+ that the chances becomes N times less. The fact is that any index lookup
+ operation requires reading from the root page that, for any index, is always
+ ascribed to the same partition. To resolve this problem we should have
+ employed more sophisticated mechanisms of working with root pages.
+
+ Currently the number of partitions in a partitioned key cache is limited
+ by 64. We could increase this limit. Simultaneously we would have to increase
+ accordingly the size of the bitmap dirty_part_map from the MYISAM_SHARE
+ structure.
+
+******************************************************************************/
+
+/* Control block for a partitioned key cache */
+
+typedef struct st_p_key_cache_cb
+{
+ my_bool key_cache_inited; /*<=> control block is allocated */
+ S_KEY_CACHE_CB **partition_array; /* array of the key cache partitions */
+ uint partitions; /* number of partitions in the key cache */
+ size_t key_cache_mem_size; /* specified size of the cache memory */
+ uint key_cache_block_size; /* size of the page buffer of a cache block */
+} P_KEY_CACHE_CB;
+
+static
+void p_end_key_cache(void *keycache_cb, my_bool cleanup);
+
+/*
+ Determine the partition to which the index block to read is ascribed
+
+ SYNOPSIS
+ get_key_cache_partition()
+ keycache pointer to the control block of a partitioned key cache
+ file handler for the file for the block of data to be read
+ filepos position of the block of data in the file
+
+ DESCRIPTION
+ The function determines the number of the partition in whose buffer the
+ block from 'file' at the position filepos has to be placed for reading.
+ The function returns the control block of the simple key cache for this
+ partition to the caller.
+
+ RETURN VALUE
+ The pointer to the control block of the partition to which the specified
+ file block is ascribed.
+*/
+
+static
+S_KEY_CACHE_CB *get_key_cache_partition(P_KEY_CACHE_CB *keycache,
+ File file, my_off_t filepos)
+{
+ uint i= KEYCACHE_BASE_EXPR( file, filepos) % keycache->partitions;
+ return keycache->partition_array[i];
+}
+
+
+/*
+ Determine the partition to which the index block to write is ascribed
+
+ SYNOPSIS
+ get_key_cache_partition()
+ keycache pointer to the control block of a partitioned key cache
+ file handler for the file for the block of data to be read
+ filepos position of the block of data in the file
+ dirty_part_map pointer to the bitmap of dirty partitions for the file
+
+ DESCRIPTION
+ The function determines the number of the partition in whose buffer the
+ block from 'file' at the position filepos has to be placed for writing and
+ marks the partition as dirty in the dirty_part_map bitmap.
+ The function returns the control block of the simple key cache for this
+ partition to the caller.
+
+ RETURN VALUE
+ The pointer to the control block of the partition to which the specified
+ file block is ascribed.
+*/
+
+static
+S_KEY_CACHE_CB *get_key_cache_partition_for_write(P_KEY_CACHE_CB *keycache,
+ File file, my_off_t filepos,
+ ulonglong* dirty_part_map)
+{
+ uint i= KEYCACHE_BASE_EXPR( file, filepos) % keycache->partitions;
+ *dirty_part_map|= 1<<i;
+ return keycache->partition_array[i];
+}
+
+
+/*
+ Initialize a partitioned key cache
+
+ SYNOPSIS
+ p_init_key_cache()
+ keycache_cb pointer to the control block of a partitioned key cache
+ key_cache_block_size size of blocks to keep cached data
+ use_mem total memory to use for all key cache partitions
+ division_limit division limit (may be zero)
+ age_threshold age threshold (may be zero)
+
+ DESCRIPTION
+ This function is the implementation of the init_key_cache interface function
+ that is employed by partitioned key caches.
+ The function builds and initializes an array of simple key caches, and then
+ initializes the control block structure of the type P_KEY_CACHE_CB that is
+ used for a partitioned key cache. The parameter keycache_cb is supposed to
+ point to this structure. The number of partitions in the partitioned key
+ cache to be built must be passed through the field 'partitions' of this
+ structure. The parameter key_cache_block_size specifies the size of the
+ blocks in the the simple key caches to be built. The parameters
+ division_limit and age_threshold determine the initial values of those
+ characteristics of the simple key caches that are used for midpoint
+ insertion strategy. The parameter use_mem specifies the total amount of
+ memory to be allocated for the key cache blocks in all simple key caches
+ and for all auxiliary structures.
+
+ RETURN VALUE
+ total number of blocks in key cache partitions, if successful,
+ <= 0 - otherwise.
+
+ NOTES
+ If keycache->key_cache_inited != 0 then we assume that the memory for
+ the array of partitions has been already allocated.
+
+ It's assumed that no two threads call this function simultaneously
+ referring to the same key cache handle.
+*/
+
+static
+int p_init_key_cache(void *keycache_cb, uint key_cache_block_size,
+ size_t use_mem, uint division_limit,
+ uint age_threshold)
+{
+ int i;
+ size_t mem_per_cache;
+ int cnt;
+ S_KEY_CACHE_CB *partition;
+ S_KEY_CACHE_CB **partition_ptr;
+ P_KEY_CACHE_CB *keycache= (P_KEY_CACHE_CB *) keycache_cb;
+ uint partitions= keycache->partitions;
+ int blocks= -1;
+ DBUG_ENTER("p_init_key_cache");
+
+ keycache->key_cache_block_size = key_cache_block_size;
+
+ if (keycache->key_cache_inited)
+ partition_ptr= keycache->partition_array;
+ else
+ {
+ if(!(partition_ptr=
+ (S_KEY_CACHE_CB **) my_malloc(sizeof(S_KEY_CACHE_CB *) * partitions,
+ MYF(0))))
+ DBUG_RETURN(blocks);
+ keycache->partition_array= partition_ptr;
+ }
+
+ mem_per_cache = use_mem / partitions;
+
+ for (i= 0; i < (int) partitions; i++)
+ {
+ my_bool key_cache_inited= keycache->key_cache_inited;
+ if (key_cache_inited)
+ partition= *partition_ptr;
+ else
+ {
+ if (!(partition= (S_KEY_CACHE_CB *) my_malloc(sizeof(S_KEY_CACHE_CB),
+ MYF(0))))
+ continue;
+ partition->key_cache_inited= 0;
+ }
+
+ if ((cnt= s_init_key_cache(partition,
+ key_cache_block_size, mem_per_cache,
+ division_limit, age_threshold)) <= 0)
+ {
+ s_end_key_cache(partition, 1);
+ my_free((uchar *) partition, MYF(0));
+ partition= 0;
+ if (key_cache_inited)
+ {
+ memmove(partition_ptr, partition_ptr+1,
+ sizeof(partition_ptr)*(partitions-i-1));
+ }
+ if (i == 0)
+ {
+ i--;
+ partitions--;
+ if (partitions)
+ mem_per_cache = use_mem / partitions;
+ }
+ continue;
+ }
+
+ if (blocks < 0)
+ blocks= 0;
+ blocks+= cnt;
+ *partition_ptr++= partition;
+ }
+
+ keycache->partitions= partitions= partition_ptr-keycache->partition_array;
+ keycache->key_cache_mem_size= mem_per_cache * partitions;
+ for (i= 0; i < (int) partitions; i++)
+ keycache->partition_array[i]->hash_factor= partitions;
+
+ keycache->key_cache_inited= 1;
+
+ DBUG_RETURN(blocks);
+}
+
+
+/*
+ Resize a partitioned key cache
+
+ SYNOPSIS
+ p_resize_key_cache()
+ keycache_cb pointer to the control block of a partitioned key cache
+ key_cache_block_size size of blocks to keep cached data
+ use_mem total memory to use for the new key cache
+ division_limit new division limit (if not zero)
+ age_threshold new age threshold (if not zero)
+
+ DESCRIPTION
+ This function is the implementation of the resize_key_cache interface
+ function that is employed by partitioned key caches.
+ The function considers the parameter keycache_cb as a pointer to the
+ control block structure of the type P_KEY_CACHE_CB for the partitioned
+ key cache to be resized.
+ The parameter key_cache_block_size specifies the new size of the blocks in
+ the simple key caches that comprise the partitioned key cache.
+ The parameters division_limit and age_threshold determine the new initial
+ values of those characteristics of the simple key cache that are used for
+ midpoint insertion strategy. The parameter use-mem specifies the total
+ amount of memory to be allocated for the key cache blocks in all new
+ simple key caches and for all auxiliary structures.
+
+ RETURN VALUE
+ number of blocks in the key cache, if successful,
+ 0 - otherwise.
+
+ NOTES.
+ The function first calls s_prepare_resize_key_cache for each simple
+ key cache effectively flushing all dirty pages from it and destroying
+ the key cache. Then p_init_key cache is called. This call builds all
+ the new array of simple key caches containing the same number of
+ elements as the old one. After this the function calls the function
+ s_finish_resize_key_cache for each simple key cache from this array.
+
+ This implementation doesn't block the calls and executions of other
+ functions from the key cache interface. However it assumes that the
+ calls of s_resize_key_cache itself are serialized.
+
+*/
+
+static
+int p_resize_key_cache(void *keycache_cb, uint key_cache_block_size,
+ size_t use_mem, uint division_limit,
+ uint age_threshold)
+{
+ uint i;
+ P_KEY_CACHE_CB *keycache= (P_KEY_CACHE_CB *) keycache_cb;
+ uint partitions= keycache->partitions;
+ my_bool cleanup= use_mem == 0;
+ int blocks= -1;
+ int err= 0;
+ DBUG_ENTER("p_resize_key_cache");
+ if (use_mem == 0)
+ {
+ p_end_key_cache(keycache_cb, 0);
+ DBUG_RETURN(blocks);
+ }
+ for (i= 0; i < partitions; i++)
+ {
+ err|= s_prepare_resize_key_cache(keycache->partition_array[i], 0, 1);
+ }
+ if (!err && use_mem)
+ blocks= p_init_key_cache(keycache_cb, key_cache_block_size, use_mem,
+ division_limit, age_threshold);
+ if (blocks > 0 && !cleanup)
+ {
+ for (i= 0; i < partitions; i++)
+ {
+ s_finish_resize_key_cache(keycache->partition_array[i], 0, 1);
+ }
+ }
+ DBUG_RETURN(blocks);
+}
+
+
+/*
+ Change key cache parameters of a partitioned key cache
+
+ SYNOPSIS
+ p_change_key_cache_param()
+ keycache_cb pointer to the control block of a partitioned key cache
+ division_limit new division limit (if not zero)
+ age_threshold new age threshold (if not zero)
+
+ DESCRIPTION
+ This function is the implementation of the change_key_cache_param interface
+ function that is employed by partitioned key caches.
+ The function considers the parameter keycache_cb as a pointer to the
+ control block structure of the type P_KEY_CACHE_CB for the simple key
+ cache where new values of the division limit and the age threshold used
+ for midpoint insertion strategy are to be set. The parameters
+ division_limit and age_threshold provide these new values.
+
+ RETURN VALUE
+ none
+
+ NOTES
+ The function just calls s_change_key_cache_param for each element from the
+ array of simple caches that comprise the partitioned key cache.
+
+*/
+
+static
+void p_change_key_cache_param(void *keycache_cb, uint division_limit,
+ uint age_threshold)
+{
+ uint i;
+ P_KEY_CACHE_CB *keycache= (P_KEY_CACHE_CB *) keycache_cb;
+ uint partitions= keycache->partitions;
+ DBUG_ENTER("p_change_key_cache_param");
+ for (i= 0; i < partitions; i++)
+ {
+ s_change_key_cache_param(keycache->partition_array[i], division_limit,
+ age_threshold);
+ }
+ DBUG_VOID_RETURN;
+}
+
+
+/*
+ Destroy a partitioned key cache
+
+ SYNOPSIS
+ p_end_key_cache()
+ keycache_cb pointer to the control block of a partitioned key cache
+ cleanup <=> complete free (free also control block structures
+ for all simple key caches)
+
+ DESCRIPTION
+ This function is the implementation of the end_key_cache interface
+ function that is employed by partitioned key caches.
+ The function considers the parameter keycache_cb as a pointer to the
+ control block structure of the type P_KEY_CACHE_CB for the partitioned
+ key cache to be destroyed.
+ The function frees the memory allocated for the cache blocks and
+ auxiliary structures used by simple key caches that comprise the
+ partitioned key cache. If the value of the parameter cleanup is TRUE
+ then even the memory used for control blocks of the simple key caches
+ and the array of pointers to them are freed.
+
+ RETURN VALUE
+ none
+
+*/
+
+static
+void p_end_key_cache(void *keycache_cb, my_bool cleanup)
+{
+ uint i;
+ P_KEY_CACHE_CB *keycache= (P_KEY_CACHE_CB *) keycache_cb;
+ uint partitions= keycache->partitions;
+ DBUG_ENTER("p_end_key_cache");
+ DBUG_PRINT("enter", ("key_cache: 0x%lx", (long) keycache));
+
+ for (i= 0; i < partitions; i++)
+ {
+ s_end_key_cache(keycache->partition_array[i], cleanup);
+ }
+ if (cleanup) {
+ for (i= 0; i < partitions; i++)
+ my_free((uchar*) keycache->partition_array[i], MYF(0));
+ my_free((uchar*) keycache->partition_array, MYF(0));
+ keycache->key_cache_inited= 0;
+ }
+ DBUG_VOID_RETURN;
+}
+
+
+/*
+ Read a block of data from a partitioned key cache into a buffer
+
+ SYNOPSIS
+
+ p_key_cache_read()
+ keycache_cb pointer to the control block of a partitioned key cache
+ file handler for the file for the block of data to be read
+ filepos position of the block of data in the file
+ level determines the weight of the data
+ buff buffer to where the data must be placed
+ length length of the buffer
+ block_length length of the read data from a key cache block
+ return_buffer return pointer to the key cache buffer with the data
+
+ DESCRIPTION
+ This function is the implementation of the key_cache_read interface
+ function that is employed by partitioned key caches.
+ The function considers the parameter keycache_cb as a pointer to the
+ control block structure of the type P_KEY_CACHE_CB for a partitioned
+ key cache.
+ In a general case the function reads a block of data from the key cache
+ into the buffer buff of the size specified by the parameter length. The
+ beginning of the block of data to be read is specified by the parameters
+ file and filepos. The length of the read data is the same as the length
+ of the buffer. The data is read into the buffer in key_cache_block_size
+ increments. To read each portion the function first finds out in what
+ partition of the key cache this portion(page) is to be saved, and calls
+ s_key_cache_read with the pointer to the corresponding simple key as
+ its first parameter.
+ If the parameter return_buffer is not ignored and its value is TRUE, and
+ the data to be read of the specified size block_length can be read from one
+ key cache buffer, then the function returns a pointer to the data in the
+ key cache buffer.
+ The function takes into account parameters block_length and return buffer
+ only in a single-threaded environment.
+ The parameter 'level' is used only by the midpoint insertion strategy
+ when the data or its portion cannot be found in the key cache.
+
+ RETURN VALUE
+ Returns address from where the data is placed if successful, 0 - otherwise.
+
+*/
+
+static
+uchar *p_key_cache_read(void *keycache_cb,
+ File file, my_off_t filepos, int level,
+ uchar *buff, uint length,
+ uint block_length __attribute__((unused)),
+ int return_buffer __attribute__((unused)))
+{
+ uint r_length;
+ P_KEY_CACHE_CB *keycache= (P_KEY_CACHE_CB *) keycache_cb;
+ uint offset= (uint) (filepos % keycache->key_cache_block_size);
+ uchar *start= buff;
+ DBUG_ENTER("p_key_cache_read");
+ DBUG_PRINT("enter", ("fd: %u pos: %lu length: %u",
+ (uint) file, (ulong) filepos, length));
+
+#ifndef THREAD
+ if (block_length > keycache->key_cache_block_size || offset)
+ return_buffer=0;
+#endif
+
+ /* Read data in key_cache_block_size increments */
+ do
+ {
+ S_KEY_CACHE_CB *partition= get_key_cache_partition(keycache,
+ file, filepos);
+ uchar *ret_buff= 0;
+ r_length= length;
+ set_if_smaller(r_length, keycache->key_cache_block_size - offset);
+ ret_buff= s_key_cache_read((void *) partition,
+ file, filepos, level,
+ buff, r_length,
+ block_length, return_buffer);
+ if (ret_buff == 0)
+ DBUG_RETURN(0);
+#ifndef THREAD
+ /* This is only true if we were able to read everything in one block */
+ if (return_buffer)
+ DBUG_RETURN(ret_buff);
+#endif
+ filepos+= r_length;
+ buff+= r_length;
+ offset= 0;
+ } while ((length-= r_length));
+
+ DBUG_RETURN(start);
+}
+
+
+/*
+ Insert a block of file data from a buffer into a partitioned key cache
+
+ SYNOPSIS
+ p_key_cache_insert()
+ keycache_cb pointer to the control block of a partitioned key cache
+ file handler for the file to insert data from
+ filepos position of the block of data in the file to insert
+ level determines the weight of the data
+ buff buffer to read data from
+ length length of the data in the buffer
+
+ DESCRIPTION
+ This function is the implementation of the key_cache_insert interface
+ function that is employed by partitioned key caches.
+ The function considers the parameter keycache_cb as a pointer to the
+ control block structure of the type P_KEY_CACHE_CB for a partitioned key
+ cache.
+ The function writes a block of file data from a buffer into the key cache.
+ The buffer is specified with the parameters buff and length - the pointer
+ to the beginning of the buffer and its size respectively. It's assumed
+ that the buffer contains the data from 'file' allocated from the position
+ filepos. The data is copied from the buffer in key_cache_block_size
+ increments. For every portion of data the function finds out in what simple
+ key cache from the array of partitions the data must be stored, and after
+ this calls s_key_cache_insert to copy the data into a key buffer of this
+ simple key cache.
+ The parameter level is used to set one characteristic for the key buffers
+ loaded with the data from buff. The characteristic is used only by the
+ midpoint insertion strategy.
+
+ RETURN VALUE
+ 0 if a success, 1 - otherwise.
+
+ NOTES
+ The function is used by MyISAM to move all blocks from a index file to
+ the key cache. It can be performed in parallel with reading the file data
+ from the key buffers by other threads.
+
+*/
+
+static
+int p_key_cache_insert(void *keycache_cb,
+ File file, my_off_t filepos, int level,
+ uchar *buff, uint length)
+{
+ uint w_length;
+ P_KEY_CACHE_CB *keycache= (P_KEY_CACHE_CB *) keycache_cb;
+ uint offset= (uint) (filepos % keycache->key_cache_block_size);
+ DBUG_ENTER("p_key_cache_insert");
+ DBUG_PRINT("enter", ("fd: %u pos: %lu length: %u",
+ (uint) file,(ulong) filepos, length));
+
+
+ /* Write data in key_cache_block_size increments */
+ do
+ {
+ S_KEY_CACHE_CB *partition= get_key_cache_partition(keycache,
+ file, filepos);
+ w_length= length;
+ set_if_smaller(w_length, keycache->key_cache_block_size);
+ if (s_key_cache_insert((void *) partition,
+ file, filepos, level,
+ buff, w_length))
+ DBUG_RETURN(1);
+
+ filepos+= w_length;
+ buff+= w_length;
+ offset = 0;
+ } while ((length-= w_length));
+
+ DBUG_RETURN(0);
+}
+
+
+/*
+ Write data from a buffer into a partitioned key cache
+
+ SYNOPSIS
+
+ p_key_cache_write()
+ keycache_cb pointer to the control block of a partitioned key cache
+ file handler for the file to write data to
+ filepos position in the file to write data to
+ level determines the weight of the data
+ buff buffer with the data
+ length length of the buffer
+ dont_write if is 0 then all dirty pages involved in writing
+ should have been flushed from key cache
+ file_extra maps of key cache partitions containing
+ dirty pages from file
+
+ DESCRIPTION
+ This function is the implementation of the key_cache_write interface
+ function that is employed by partitioned key caches.
+ The function considers the parameter keycache_cb as a pointer to the
+ control block structure of the type P_KEY_CACHE_CB for a partitioned
+ key cache.
+ In a general case the function copies data from a buffer into the key
+ cache. The buffer is specified with the parameters buff and length -
+ the pointer to the beginning of the buffer and its size respectively.
+ It's assumed the buffer contains the data to be written into 'file'
+ starting from the position filepos. The data is copied from the buffer
+ in key_cache_block_size increments. For every portion of data the
+ function finds out in what simple key cache from the array of partitions
+ the data must be stored, and after this calls s_key_cache_write to copy
+ the data into a key buffer of this simple key cache.
+ If the value of the parameter dont_write is FALSE then the function
+ also writes the data into file.
+ The parameter level is used to set one characteristic for the key buffers
+ filled with the data from buff. The characteristic is employed only by
+ the midpoint insertion strategy.
+ The parameter file_expra provides a pointer to the shared bitmap of
+ the partitions that may contains dirty pages for the file. This bitmap
+ is used to optimize the function p_flush_key_blocks.
+
+ RETURN VALUE
+ 0 if a success, 1 - otherwise.
+
+ NOTES
+ This implementation exploits the fact that the function is called only
+ when a thread has got an exclusive lock for the key file.
+
+*/
+
+static
+int p_key_cache_write(void *keycache_cb,
+ File file, void *file_extra,
+ my_off_t filepos, int level,
+ uchar *buff, uint length,
+ uint block_length __attribute__((unused)),
+ int dont_write)
+{
+ uint w_length;
+ ulonglong *part_map= (ulonglong *) file_extra;
+ P_KEY_CACHE_CB *keycache= (P_KEY_CACHE_CB *) keycache_cb;
+ uint offset= (uint) (filepos % keycache->key_cache_block_size);
+ DBUG_ENTER("p_key_cache_write");
+ DBUG_PRINT("enter",
+ ("fd: %u pos: %lu length: %u block_length: %u"
+ " key_block_length: %u",
+ (uint) file, (ulong) filepos, length, block_length,
+ keycache ? keycache->key_cache_block_size : 0));
+
+
+ /* Write data in key_cache_block_size increments */
+ do
+ {
+ S_KEY_CACHE_CB *partition= get_key_cache_partition_for_write(keycache,
+ file, filepos,
+ part_map);
+ w_length = length;
+ set_if_smaller(w_length, keycache->key_cache_block_size );
+ if (s_key_cache_write(partition,
+ file, 0, filepos, level,
+ buff, w_length, block_length,
+ dont_write))
+ DBUG_RETURN(1);
+
+ filepos+= w_length;
+ buff+= w_length;
+ offset= 0;
+ } while ((length-= w_length));
+
+ DBUG_RETURN(0);
+}
+
+
+/*
+ Flush all blocks for a file from key buffers of a partitioned key cache
+
+ SYNOPSIS
+
+ p_flush_key_blocks()
+ keycache_cb pointer to the control block of a partitioned key cache
+ file handler for the file to flush to
+ file_extra maps of key cache partitions containing
+ dirty pages from file (not used)
+ flush_type type of the flush operation
+
+ DESCRIPTION
+ This function is the implementation of the flush_key_blocks interface
+ function that is employed by partitioned key caches.
+ The function considers the parameter keycache_cb as a pointer to the
+ control block structure of the type P_KEY_CACHE_CB for a partitioned
+ key cache.
+ In a general case the function flushes the data from all dirty key
+ buffers related to the file 'file' into this file. The function does
+ exactly this if the value of the parameter type is FLUSH_KEEP. If the
+ value of this parameter is FLUSH_RELEASE, the function additionally
+ releases the key buffers containing data from 'file' for new usage.
+ If the value of the parameter type is FLUSH_IGNORE_CHANGED the function
+ just releases the key buffers containing data from 'file'.
+ The function performs the operation by calling s_flush_key_blocks
+ for the elements of the array of the simple key caches that comprise
+ the partitioned key_cache. If the value of the parameter type is
+ FLUSH_KEEP s_flush_key_blocks is called only for the partitions with
+ possibly dirty pages marked in the bitmap pointed to by the parameter
+ file_extra.
+
+ RETURN
+ 0 ok
+ 1 error
+
+ NOTES
+ This implementation exploits the fact that the function is called only
+ when a thread has got an exclusive lock for the key file.
+
+*/
+
+static
+int p_flush_key_blocks(void *keycache_cb,
+ File file, void *file_extra,
+ enum flush_type type)
+{
+ uint i;
+ P_KEY_CACHE_CB *keycache= (P_KEY_CACHE_CB *) keycache_cb;
+ uint partitions= keycache->partitions;
+ int err= 0;
+ ulonglong *dirty_part_map= (ulonglong *) file_extra;
+ DBUG_ENTER("p_flush_key_blocks");
+ DBUG_PRINT("enter", ("keycache: 0x%lx", (long) keycache));
+
+ for (i= 0; i < partitions; i++)
+ {
+ S_KEY_CACHE_CB *partition= keycache->partition_array[i];
+ if ((type == FLUSH_KEEP || type == FLUSH_FORCE_WRITE) &&
+ !((*dirty_part_map) & (1<<i)))
+ continue;
+ err+= test(s_flush_key_blocks(partition, file, 0, type));
+ }
+ *dirty_part_map= 0;
+
+ if (err > 0)
+ err= 1;
+
+ DBUG_RETURN(err);
+}
+
+
+/*
+ Reset the counters of a partitioned key cache
+
+ SYNOPSIS
+ p_reset_key_cache_counters()
+ name the name of a key cache
+ keycache_cb pointer to the control block of a partitioned key cache
+
+ DESCRIPTION
+ This function is the implementation of the reset_key_cache_counters
+ interface function that is employed by partitioned key caches.
+ The function considers the parameter keycache_cb as a pointer to the
+ control block structure of the type P_KEY_CACHE_CB for a partitioned
+ key cache.
+ This function resets the values of the statistical counters of the simple
+ key caches comprising partitioned key cache to 0. It does it by calling
+ s_reset_key_cache_counters for each key cache partition.
+ The parameter name is currently not used.
+
+ RETURN
+ 0 on success (always because it can't fail)
+
+*/
+
+static
+int p_reset_key_cache_counters(const char *name __attribute__((unused)),
+ void *keycache_cb)
+{
+ uint i;
+ P_KEY_CACHE_CB *keycache= (P_KEY_CACHE_CB *) keycache_cb;
+ uint partitions= keycache->partitions;
+ DBUG_ENTER("p_reset_key_cache_counters");
+
+ for (i = 0; i < partitions; i++)
+ {
+ s_reset_key_cache_counters(name, keycache->partition_array[i]);
+ }
+ DBUG_RETURN(0);
+}
+
+
+/*
+ Get statistics for a partition key cache
+
+ SYNOPSIS
+ p_get_key_cache_statistics()
+ keycache_cb pointer to the control block of a partitioned key cache
+ partition_no partition number to get statistics for
+ key_cache_stats OUT pointer to the structure for the returned statistics
+
+ DESCRIPTION
+ This function is the implementation of the get_key_cache_statistics
+ interface function that is employed by partitioned key caches.
+ The function considers the parameter keycache_cb as a pointer to the
+ control block structure of the type P_KEY_CACHE_CB for a partitioned
+ key cache.
+ If the value of the parameter partition_no is equal to 0 then aggregated
+ statistics for all partitions is returned in the fields of the
+ structure key_cache_stat of the type KEY_CACHE_STATISTICS . Otherwise
+ the function returns data for the partition number partition_no of the
+ key cache in the structure key_cache_stat. (Here partitions are numbered
+ starting from 1.)
+
+ RETURN
+ none
+
+*/
+
+static
+void p_get_key_cache_statistics(void *keycache_cb, uint partition_no,
+ KEY_CACHE_STATISTICS *key_cache_stats)
+{
+ uint i;
+ S_KEY_CACHE_CB *partition;
+ P_KEY_CACHE_CB *keycache= (P_KEY_CACHE_CB *) keycache_cb;
+ uint partitions= keycache->partitions;
+ DBUG_ENTER("p_get_key_cache_statistics_");
+
+ if (partition_no != 0)
+ {
+ partition= keycache->partition_array[partition_no-1];
+ s_get_key_cache_statistics((void *) partition, 0, key_cache_stats);
+ DBUG_VOID_RETURN;
+ }
+ key_cache_stats->mem_size= (longlong) keycache->key_cache_mem_size;
+ key_cache_stats->block_size= (longlong) keycache->key_cache_block_size;
+ for (i = 0; i < partitions; i++)
+ {
+ partition= keycache->partition_array[i];
+ key_cache_stats->blocks_used+= partition->blocks_used;
+ key_cache_stats->blocks_unused+= partition->blocks_unused;
+ key_cache_stats->blocks_changed+= partition->global_blocks_changed;
+ key_cache_stats->read_requests+= partition->global_cache_r_requests;
+ key_cache_stats->reads+= partition->global_cache_read;
+ key_cache_stats->write_requests+= partition->global_cache_w_requests;
+ key_cache_stats->writes+= partition->global_cache_write;
+ }
+ DBUG_VOID_RETURN;
+}
+
+/*
+ Get the value of a statistical variable for a partitioned key cache
+
+ SYNOPSIS
+ p_get_key_cache_stat_value()
+ keycache_cb pointer to the control block of a partitioned key cache
+ var_no the ordered number of a statistical variable
+
+ DESCRIPTION
+ This function is the implementation of the get_key_cache_stat_value
+ interface function that is employed by partitioned key caches.
+ The function considers the parameter keycache_cb as a pointer to the
+ control block structure of the type P_KEY_CACHE_CB for a partitioned
+ key cache.
+ This function returns the value of the statistical variable var_no
+ for this key cache. The variables are numbered starting from 0 to 6.
+ The returned value is calculated as the sum of the values of the
+ statistical variable with number var_no for all simple key caches that
+ comprise the partitioned key cache.
+
+ RETURN
+ The value of the specified statistical variable
+
+*/
+
+static
+ulonglong p_get_key_cache_stat_value(void *keycache_cb, uint var_no)
+{
+ uint i;
+ P_KEY_CACHE_CB *keycache= (P_KEY_CACHE_CB *) keycache_cb;
+ uint partitions= keycache->partitions;
+ size_t var_ofs= s_key_cache_stat_var_offsets[var_no];
+ ulonglong res= 0;
+ DBUG_ENTER("p_get_key_cache_stat_value");
+
+ if (var_no < 3)
+ {
+ for (i = 0; i < partitions; i++)
+ {
+ S_KEY_CACHE_CB *partition= keycache->partition_array[i];
+ res+= (ulonglong) (*(long *) ((char *) partition + var_ofs));
+ }
+ }
+ else
+ {
+ for (i = 0; i < partitions; i++)
+ {
+ S_KEY_CACHE_CB *partition= keycache->partition_array[i];
+ res+= *(ulonglong *) ((char *) partition + var_ofs);
+ }
+ }
+ DBUG_RETURN(res);
+}
+
+
+/*
+ The array of pointers to the key cache interface functions used by
+ partitioned key caches. Any partitioned key cache object caches exploits
+ this array.
+
+ The current implementation of these functions does not allow to call
+ them from the MySQL server code directly. The key cache interface
+ wrappers must be used for this purpose.
+*/
+
+static KEY_CACHE_FUNCS p_key_cache_funcs =
+{
+ p_init_key_cache,
+ p_resize_key_cache,
+ p_change_key_cache_param,
+ p_key_cache_read,
+ p_key_cache_insert,
+ p_key_cache_write,
+ p_flush_key_blocks,
+ p_reset_key_cache_counters,
+ p_end_key_cache,
+ p_get_key_cache_statistics,
+ p_get_key_cache_stat_value
+};
+
+
+/******************************************************************************
+ Key Cache Interface Module
+
+ The module contains wrappers for all key cache interface functions.
+
+ Currently there are key caches of two types: simple key caches and
+ partitioned key caches. Each type (class) has its own implementation of the
+ basic key cache operations used the MyISAM storage engine. The pointers
+ to the implementation functions are stored in two static structures of the
+ type KEY_CACHE_FUNC: s_key_cache_funcs - for simple key caches, and
+ p_key_cache_funcs - for partitioned key caches. When a key cache object is
+ created the constructor procedure init_key_cache places a pointer to the
+ corresponding table into one of its fields. The procedure also initializes
+ a control block for the key cache oject and saves the pointer to this
+ block in another field of the key cache object.
+ When a key cache wrapper function is invoked for a key cache object to
+ perform a basic key cache operation it looks into the interface table
+ associated with the key cache oject and calls the corresponding
+ implementation of the operation. It passes the saved key cache control
+ block to this implementation. If, for some reasons, the control block
+ has not been fully initialized yet, the wrapper function either does not
+ do anything or, in the case when it perform a read/write operation, the
+ function do it directly through the system i/o functions.
+
+ As we can see the model with which the key cache interface is supported
+ as quite conventional for interfaces in general.
+
+******************************************************************************/
+
+
+/*
+ Initialize a key cache
+
+ SYNOPSIS
+ init_key_cache()
+ keycache pointer to the key cache to be initialized
+ key_cache_block_size size of blocks to keep cached data
+ use_mem total memory to use for cache buffers/structures
+ division_limit division limit (may be zero)
+ age_threshold age threshold (may be zero)
+ partitions number of partitions in the key cache
+
+ DESCRIPTION
+ The function creates a control block structure for a key cache and
+ places the pointer to this block in the structure keycache.
+ If the value of the parameter 'partitions' is 0 then a simple key cache
+ is created. Otherwise a partitioned key cache with the specified number
+ of partitions is created.
+ The parameter key_cache_block_size specifies the size of the blocks in
+ the key cache to be created. The parameters division_limit and
+ age_threshold determine the initial values of those characteristics of
+ the key cache that are used for midpoint insertion strategy. The parameter
+ use_mem specifies the total amount of memory to be allocated for the
+ key cache buffers and for all auxiliary structures.
+
+ RETURN VALUE
+ total number of blocks in key cache partitions, if successful,
+ <= 0 - otherwise.
+
+ NOTES
+ if keycache->key_cache_inited != 0 we assume that the memory
+ for the control block of the key cache has been already allocated.
+
+ It's assumed that no two threads call this function simultaneously
+ referring to the same key cache handle.
+
+*/
+
+int init_key_cache(KEY_CACHE *keycache, uint key_cache_block_size,
+ size_t use_mem, uint division_limit,
+ uint age_threshold, uint partitions)
+{
+ void *keycache_cb;
+ int blocks;
+ if (keycache->key_cache_inited)
+ keycache_cb= keycache->keycache_cb;
+ else
+ {
+ if (partitions == 0)
+ {
+ if (!(keycache_cb= (void *) my_malloc(sizeof(S_KEY_CACHE_CB), MYF(0))))
+ return 0;
+ ((S_KEY_CACHE_CB *) keycache_cb)->key_cache_inited= 0;
+ keycache->key_cache_type= SIMPLE_KEY_CACHE;
+ keycache->interface_funcs= &s_key_cache_funcs;
+ }
+ else
+ {
+ if (!(keycache_cb= (void *) my_malloc(sizeof(P_KEY_CACHE_CB), MYF(0))))
+ return 0;
+ ((P_KEY_CACHE_CB *) keycache_cb)->key_cache_inited= 0;
+ keycache->key_cache_type= PARTITIONED_KEY_CACHE;
+ keycache->interface_funcs= &p_key_cache_funcs;
+ }
+ keycache->keycache_cb= keycache_cb;
+ keycache->key_cache_inited= 1;
+ }
+
+ if (partitions != 0)
+ {
+ ((P_KEY_CACHE_CB *) keycache_cb)->partitions= partitions;
+ }
+ keycache->can_be_used= 0;
+ blocks= keycache->interface_funcs->init(keycache_cb, key_cache_block_size,
+ use_mem, division_limit,
+ age_threshold);
+ keycache->partitions= partitions ?
+ ((P_KEY_CACHE_CB *) keycache_cb)->partitions : 0;
+ DBUG_ASSERT(partitions <= MAX_KEY_CACHE_PARTITIONS);
+ if (blocks > 0)
+ keycache->can_be_used= 1;
+ return blocks;
+}
+
+
+/*
+ Resize a key cache
+
+ SYNOPSIS
+ resize_key_cache()
+ keycache pointer to the key cache to be resized
+ key_cache_block_size size of blocks to keep cached data
+ use_mem total memory to use for the new key cache
+ division_limit new division limit (if not zero)
+ age_threshold new age threshold (if not zero)
+
+ DESCRIPTION
+ The function operates over the key cache key cache.
+ The parameter key_cache_block_size specifies the new size of the block
+ buffers in the key cache. The parameters division_limit and age_threshold
+ determine the new initial values of those characteristics of the key cache
+ that are used for midpoint insertion strategy. The parameter use_mem
+ specifies the total amount of memory to be allocated for the key cache
+ buffers and for all auxiliary structures.
+
+ RETURN VALUE
+ number of blocks in the key cache, if successful,
+ 0 - otherwise.
+
+ NOTES
+ The function does not block the calls and executions of other functions
+ from the key cache interface. However it assumes that the calls of
+ resize_key_cache itself are serialized.
+
+ Currently the function is called when the values of the variables
+ key_buffer_size and/or key_cache_block_size are being reset for
+ the key cache keycache.
+
+*/
+
+int resize_key_cache(KEY_CACHE *keycache, uint key_cache_block_size,
+ size_t use_mem, uint division_limit, uint age_threshold)
+{
+ int blocks= -1;
+ if (keycache->key_cache_inited)
+ {
+ if ((uint) keycache->param_partitions != keycache->partitions && use_mem)
+ blocks= repartition_key_cache (keycache,
+ key_cache_block_size, use_mem,
+ division_limit, age_threshold,
+ (uint) keycache->param_partitions);
+ else
+ {
+ blocks= keycache->interface_funcs->resize(keycache->keycache_cb,
+ key_cache_block_size,
+ use_mem, division_limit,
+ age_threshold);
+
+ if (keycache->partitions)
+ keycache->partitions=
+ ((P_KEY_CACHE_CB *)(keycache->keycache_cb))->partitions;
+ }
+ if (blocks <= 0)
+ keycache->can_be_used= 0;
+ }
+ return blocks;
+}
+
+
+/*
+ Change key cache parameters of a key cache
+
+ SYNOPSIS
+ change_key_cache_param()
+ keycache pointer to the key cache to change parameters for
+ division_limit new division limit (if not zero)
+ age_threshold new age threshold (if not zero)
+
+ DESCRIPTION
+ The function sets new values of the division limit and the age threshold
+ used when the key cache keycach employs midpoint insertion strategy.
+ The parameters division_limit and age_threshold provide these new values.
+
+ RETURN VALUE
+ none
+
+ NOTES
+ Currently the function is called when the values of the variables
+ key_cache_division_limit and/or key_cache_age_threshold are being reset
+ for the key cache keycache.
+
+*/
+
+void change_key_cache_param(KEY_CACHE *keycache, uint division_limit,
+ uint age_threshold)
+{
+ if (keycache->key_cache_inited)
+ {
+
+ keycache->interface_funcs->change_param(keycache->keycache_cb,
+ division_limit,
+ age_threshold);
+ }
+}
+
+
+/*
+ Destroy a key cache
+
+ SYNOPSIS
+ end_key_cache()
+ keycache pointer to the key cache to be destroyed
+ cleanup <=> complete free
+
+ DESCRIPTION
+ The function frees the memory allocated for the cache blocks and
+ auxiliary structures used by the key cache keycache. If the value
+ of the parameter cleanup is TRUE then all resources used by the key
+ cache are to be freed.
+
+ RETURN VALUE
+ none
+*/
+
+void end_key_cache(KEY_CACHE *keycache, my_bool cleanup)
+{
+ if (keycache->key_cache_inited)
+ {
+ keycache->interface_funcs->end(keycache->keycache_cb, cleanup);
+ if (cleanup)
+ {
+ if (keycache->keycache_cb)
+ {
+ my_free((uchar *) keycache->keycache_cb, MYF(0));
+ keycache->keycache_cb= 0;
+ }
+ keycache->key_cache_inited= 0;
+ }
+ keycache->can_be_used= 0;
+ }
+}
+
+
+/*
+ Read a block of data from a key cache into a buffer
+
+ SYNOPSIS
+
+ key_cache_read()
+ keycache pointer to the key cache to read data from
+ file handler for the file for the block of data to be read
+ filepos position of the block of data in the file
+ level determines the weight of the data
+ buff buffer to where the data must be placed
+ length length of the buffer
+ block_length length of the data read from a key cache block
+ return_buffer return pointer to the key cache buffer with the data
+
+ DESCRIPTION
+ The function operates over buffers of the key cache keycache.
+ In a general case the function reads a block of data from the key cache
+ into the buffer buff of the size specified by the parameter length. The
+ beginning of the block of data to be read is specified by the parameters
+ file and filepos. The length of the read data is the same as the length
+ of the buffer.
+ If the parameter return_buffer is not ignored and its value is TRUE, and
+ the data to be read of the specified size block_length can be read from one
+ key cache buffer, then the function returns a pointer to the data in the
+ key cache buffer.
+ The parameter 'level' is used only by the midpoint insertion strategy
+ when the data or its portion cannot be found in the key cache.
+ The function reads data into the buffer directly from file if the control
+ block of the key cache has not been initialized yet.
+
+ RETURN VALUE
+ Returns address from where the data is placed if successful, 0 - otherwise.
+
+ NOTES.
+ Filepos must be a multiple of 'block_length', but it doesn't
+ have to be a multiple of key_cache_block_size;
+*/
+
+uchar *key_cache_read(KEY_CACHE *keycache,
+ File file, my_off_t filepos, int level,
+ uchar *buff, uint length,
+ uint block_length, int return_buffer)
+{
+ if (keycache->key_cache_inited && keycache->can_be_used)
+ return keycache->interface_funcs->read(keycache->keycache_cb,
+ file, filepos, level,
+ buff, length,
+ block_length, return_buffer);
+
+ /* We can't use mutex here as the key cache may not be initialized */
+ keycache->global_cache_r_requests++;
+ keycache->global_cache_read++;
+
+ if (my_pread(file, (uchar*) buff, length, filepos, MYF(MY_NABP)))
+ return (uchar *) 0;
+
+ return buff;
+}
+
+
+/*
+ Insert a block of file data from a buffer into a key cache
+
+ SYNOPSIS
+ key_cache_insert()
+ keycache pointer to the key cache to insert data into
+ file handler for the file to insert data from
+ filepos position of the block of data in the file to insert
+ level determines the weight of the data
+ buff buffer to read data from
+ length length of the data in the buffer
+
+ DESCRIPTION
+ The function operates over buffers of the key cache keycache.
+ The function writes a block of file data from a buffer into the key cache.
+ The buffer is specified with the parameters buff and length - the pointer
+ to the beginning of the buffer and its size respectively. It's assumed
+ that the buffer contains the data from 'file' allocated from the position
+ filepos.
+ The parameter level is used to set one characteristic for the key buffers
+ loaded with the data from buff. The characteristic is used only by the
+ midpoint insertion strategy.
+
+ RETURN VALUE
+ 0 if a success, 1 - otherwise.
+
+ NOTES
+ The function is used by MyISAM to move all blocks from a index file to
+ the key cache.
+ It is assumed that it may be performed in parallel with reading the file
+ data from the key buffers by other threads.
+
+*/
+
+int key_cache_insert(KEY_CACHE *keycache,
+ File file, my_off_t filepos, int level,
+ uchar *buff, uint length)
+{
+ if (keycache->key_cache_inited && keycache->can_be_used)
+ return keycache->interface_funcs->insert(keycache->keycache_cb,
+ file, filepos, level,
+ buff, length);
+ return 0;
+}
+
+
+/*
+ Write data from a buffer into a key cache
+
+ SYNOPSIS
+
+ key_cache_write()
+ keycache pointer to the key cache to write data to
+ file handler for the file to write data to
+ filepos position in the file to write data to
+ level determines the weight of the data
+ buff buffer with the data
+ length length of the buffer
+ dont_write if is 0 then all dirty pages involved in writing
+ should have been flushed from key cache
+ file_extra pointer to optional file attributes
+
+ DESCRIPTION
+ The function operates over buffers of the key cache keycache.
+ In a general case the function writes data from a buffer into the key
+ cache. The buffer is specified with the parameters buff and length -
+ the pointer to the beginning of the buffer and its size respectively.
+ It's assumed the buffer contains the data to be written into 'file'
+ starting from the position filepos.
+ If the value of the parameter dont_write is FALSE then the function
+ also writes the data into file.
+ The parameter level is used to set one characteristic for the key buffers
+ filled with the data from buff. The characteristic is employed only by
+ the midpoint insertion strategy.
+ The parameter file_expra may point to additional file attributes used
+ for optimization or other purposes.
+ The function writes data from the buffer directly into file if the control
+ block of the key cache has not been initialized yet.
+
+ RETURN VALUE
+ 0 if a success, 1 - otherwise.
+
+ NOTES
+ This implementation may exploit the fact that the function is called only
+ when a thread has got an exclusive lock for the key file.
+
+*/
+
+int key_cache_write(KEY_CACHE *keycache,
+ File file, void *file_extra,
+ my_off_t filepos, int level,
+ uchar *buff, uint length,
+ uint block_length, int force_write)
+{
+ if (keycache->key_cache_inited && keycache->can_be_used)
+ return keycache->interface_funcs->write(keycache->keycache_cb,
+ file, file_extra,
+ filepos, level,
+ buff, length,
+ block_length, force_write);
+
+ /* We can't use mutex here as the key cache may not be initialized */
+ keycache->global_cache_w_requests++;
+ keycache->global_cache_write++;
+ if (my_pwrite(file, buff, length, filepos, MYF(MY_NABP | MY_WAIT_IF_FULL)))
+ return 1;
+
+ return 0;
+}
+
+
+/*
+ Flush all blocks for a file from key buffers of a key cache
+
+ SYNOPSIS
+
+ flush_key_blocks()
+ keycache pointer to the key cache whose blocks are to be flushed
+ file handler for the file to flush to
+ file_extra maps of key cache (used for partitioned key caches)
+ flush_type type of the flush operation
+
+ DESCRIPTION
+ The function operates over buffers of the key cache keycache.
+ In a general case the function flushes the data from all dirty key
+ buffers related to the file 'file' into this file. The function does
+ exactly this if the value of the parameter type is FLUSH_KEEP. If the
+ value of this parameter is FLUSH_RELEASE, the function additionally
+ releases the key buffers containing data from 'file' for new usage.
+ If the value of the parameter type is FLUSH_IGNORE_CHANGED the function
+ just releases the key buffers containing data from 'file'.
+ If the value of the parameter type is FLUSH_KEEP the function may use
+ the value of the parameter file_extra pointing to possibly dirty
+ partitions to optimize the operation for partitioned key caches.
+
+ RETURN
+ 0 ok
+ 1 error
+
+ NOTES
+ Any implementation of the function may exploit the fact that the function
+ is called only when a thread has got an exclusive lock for the key file.
+
+*/
+
+int flush_key_blocks(KEY_CACHE *keycache,
+ int file, void *file_extra,
+ enum flush_type type)
+{
+ if (keycache->key_cache_inited)
+ return keycache->interface_funcs->flush(keycache->keycache_cb,
+ file, file_extra, type);
+ return 0;
+}
+
+
+/*
+ Reset the counters of a key cache
+
+ SYNOPSIS
+ reset_key_cache_counters()
+ name the name of a key cache (unused)
+ keycache pointer to the key cache for which to reset counters
+
+ DESCRIPTION
+ This function resets the values of the statistical counters for the key
+ cache keycache.
+ The parameter name is currently not used.
+
+ RETURN
+ 0 on success (always because it can't fail)
+
+ NOTES
+ This procedure is used by process_key_caches() to reset the counters of all
+ currently used key caches, both the default one and the named ones.
+
+*/
+
+int reset_key_cache_counters(const char *name __attribute__((unused)),
+ KEY_CACHE *keycache)
+{
+ if (keycache->key_cache_inited)
+ {
+
+ return keycache->interface_funcs->reset_counters(name,
+ keycache->keycache_cb);
+ }
+ return 0;
+}
+
+
+/*
+ Get statistics for a key cache
+
+ SYNOPSIS
+ get_key_cache_statistics()
+ keycache pointer to the key cache to get statistics for
+ partition_no partition number to get statistics for
+ key_cache_stats OUT pointer to the structure for the returned statistics
+
+ DESCRIPTION
+ If the value of the parameter partition_no is equal to 0 then statistics
+ for the whole key cache keycache (aggregated statistics) is returned in the
+ fields of the structure key_cache_stat of the type KEY_CACHE_STATISTICS.
+ Otherwise the value of the parameter partition_no makes sense only for
+ a partitioned key cache. In this case the function returns statistics
+ for the partition with the specified number partition_no.
+
+ RETURN
+ none
+
+*/
+
+void get_key_cache_statistics(KEY_CACHE *keycache, uint partition_no,
+ KEY_CACHE_STATISTICS *key_cache_stats)
+{
+ bzero(key_cache_stats, sizeof(KEY_CACHE_STATISTICS));
+ if (keycache->key_cache_inited)
+ {
+ keycache->interface_funcs->get_stats(keycache->keycache_cb,
+ partition_no, key_cache_stats);
+ }
+}
+
+
+/*
+ Get the value of a statistical variable for a key cache
+
+ SYNOPSIS
+ get_key_cache_stat_value()
+ keycache pointer to the key cache to get statistics for
+ var_no the ordered number of a statistical variable
+
+ DESCRIPTION
+ This function returns the value of the statistical variable var_no for
+ the key cache keycache. The variables are numbered starting from 0 to 6.
+
+ RETURN
+ The value of the specified statistical variable.
+
+ NOTES
+ Currently for any key cache the function can return values for the
+ following 7 statistical variables:
+
+ Name Number
+
+ blocks_used 0
+ blocks_unused 1
+ blocks_changed 2
+ read_requests 3
+ reads 4
+ write_requests 5
+ writes 6
+
+*/
+
+ulonglong get_key_cache_stat_value(KEY_CACHE *keycache, uint var_no)
+{
+ if (keycache->key_cache_inited)
+ {
+ return keycache->interface_funcs->get_stat_val(keycache->keycache_cb,
+ var_no);
+ }
+ else
+ return 0;
+}
+
+
+/*
+ Repartition a key cache
+
+ SYNOPSIS
+ repartition_key_cache()
+ keycache pointer to the key cache to be repartitioned
+ key_cache_block_size size of blocks to keep cached data
+ use_mem total memory to use for the new key cache
+ division_limit new division limit (if not zero)
+ age_threshold new age threshold (if not zero)
+ partitions new number of partitions in the key cache
+
+ DESCRIPTION
+ The function operates over the key cache keycache.
+ The parameter partitions specifies the number of partitions in the key
+ cache after repartitioning. If the value of this parameter is 0 then
+ a simple key cache must be created instead of the old one.
+ The parameter key_cache_block_size specifies the new size of the block
+ buffers in the key cache. The parameters division_limit and age_threshold
+ determine the new initial values of those characteristics of the key cache
+ that are used for midpoint insertion strategy. The parameter use_mem
+ specifies the total amount of memory to be allocated for the new key
+ cache buffers and for all auxiliary structures.
+
+ RETURN VALUE
+ number of blocks in the key cache, if successful,
+ 0 - otherwise.
+
+ NOTES
+ The function does not block the calls and executions of other functions
+ from the key cache interface. However it assumes that the calls of
+ resize_key_cache itself are serialized.
+
+ Currently the function is called when the value of the variable
+ key_cache_partitions is being reset for the key cache keycache.
+
+*/
+
+int repartition_key_cache(KEY_CACHE *keycache, uint key_cache_block_size,
+ size_t use_mem, uint division_limit,
+ uint age_threshold, uint partitions)
+{
+ uint blocks= -1;
+ if (keycache->key_cache_inited)
+ {
+ keycache->interface_funcs->resize(keycache->keycache_cb,
+ key_cache_block_size, 0,
+ division_limit, age_threshold);
+ end_key_cache(keycache, 1);
+ blocks= init_key_cache(keycache, key_cache_block_size, use_mem,
+ division_limit, age_threshold, partitions);
+ }
+ return blocks;
+}
+
=== modified file 'sql/handler.cc'
--- a/sql/handler.cc 2009-09-09 21:06:57 +0000
+++ b/sql/handler.cc 2010-06-29 00:10:53 +0000
@@ -3691,11 +3691,13 @@ int ha_init_key_cache(const char *name,
uint tmp_block_size= (uint) key_cache->param_block_size;
uint division_limit= key_cache->param_division_limit;
uint age_threshold= key_cache->param_age_threshold;
+ uint partitions= key_cache->param_partitions;
pthread_mutex_unlock(&LOCK_global_system_variables);
DBUG_RETURN(!init_key_cache(key_cache,
tmp_block_size,
tmp_buff_size,
- division_limit, age_threshold));
+ division_limit, age_threshold,
+ partitions));
}
DBUG_RETURN(0);
}
@@ -3725,10 +3727,12 @@ int ha_resize_key_cache(KEY_CACHE *key_c
/**
- Change parameters for key cache (like size)
+ Change parameters for key cache (like division_limit)
*/
int ha_change_key_cache_param(KEY_CACHE *key_cache)
{
+ DBUG_ENTER("ha_change_key_cache_param");
+
if (key_cache->key_cache_inited)
{
pthread_mutex_lock(&LOCK_global_system_variables);
@@ -3737,9 +3741,35 @@ int ha_change_key_cache_param(KEY_CACHE
pthread_mutex_unlock(&LOCK_global_system_variables);
change_key_cache_param(key_cache, division_limit, age_threshold);
}
- return 0;
+ DBUG_RETURN(0);
}
+
+/**
+ Repartition key cache
+*/
+int ha_repartition_key_cache(KEY_CACHE *key_cache)
+{
+ DBUG_ENTER("ha_repartition_key_cache");
+
+ if (key_cache->key_cache_inited)
+ {
+ pthread_mutex_lock(&LOCK_global_system_variables);
+ size_t tmp_buff_size= (size_t) key_cache->param_buff_size;
+ long tmp_block_size= (long) key_cache->param_block_size;
+ uint division_limit= key_cache->param_division_limit;
+ uint age_threshold= key_cache->param_age_threshold;
+ uint partitions= key_cache->param_partitions;
+ pthread_mutex_unlock(&LOCK_global_system_variables);
+ DBUG_RETURN(!repartition_key_cache(key_cache, tmp_block_size,
+ tmp_buff_size,
+ division_limit, age_threshold,
+ partitions));
+ }
+ DBUG_RETURN(0);
+}
+
+
/**
Free memory allocated by a key cache.
*/
=== modified file 'sql/handler.h'
--- a/sql/handler.h 2009-09-07 20:50:10 +0000
+++ b/sql/handler.h 2010-06-29 00:10:53 +0000
@@ -2026,6 +2026,7 @@ int ha_table_exists_in_engine(THD* thd,
extern "C" int ha_init_key_cache(const char *name, KEY_CACHE *key_cache);
int ha_resize_key_cache(KEY_CACHE *key_cache);
int ha_change_key_cache_param(KEY_CACHE *key_cache);
+int ha_repartition_key_cache(KEY_CACHE *key_cache);
int ha_change_key_cache(KEY_CACHE *old_key_cache, KEY_CACHE *new_key_cache);
int ha_end_key_cache(KEY_CACHE *key_cache);
=== modified file 'sql/mysqld.cc'
--- a/sql/mysqld.cc 2009-10-07 13:07:10 +0000
+++ b/sql/mysqld.cc 2010-06-29 00:10:53 +0000
@@ -5713,6 +5713,7 @@ enum options_mysqld
OPT_INTERACTIVE_TIMEOUT, OPT_JOIN_BUFF_SIZE,
OPT_KEY_BUFFER_SIZE, OPT_KEY_CACHE_BLOCK_SIZE,
OPT_KEY_CACHE_DIVISION_LIMIT, OPT_KEY_CACHE_AGE_THRESHOLD,
+ OPT_KEY_CACHE_PARTITIONS,
OPT_LONG_QUERY_TIME,
OPT_LOWER_CASE_TABLE_NAMES, OPT_MAX_ALLOWED_PACKET,
OPT_MAX_BINLOG_CACHE_SIZE, OPT_MAX_BINLOG_SIZE,
@@ -6789,6 +6790,12 @@ log and this option does nothing anymore
(uchar**) 0,
0, (GET_ULONG | GET_ASK_ADDR) , REQUIRED_ARG, 100,
1, 100, 0, 1, 0},
+ {"key_cache_partitions", OPT_KEY_CACHE_PARTITIONS,
+ "The number of partitions in key cache",
+ (uchar**) &dflt_key_cache_var.param_partitions,
+ (uchar**) 0,
+ 0, (GET_ULONG | GET_ASK_ADDR), REQUIRED_ARG, DEFAULT_KEY_CACHE_PARTITIONS,
+ 0, MAX_KEY_CACHE_PARTITIONS, 0, 1, 0},
{"log-slow-filter", OPT_LOG_SLOW_FILTER,
"Log only the queries that followed certain execution plan. Multiple flags allowed in a comma-separated string. [admin, filesort, filesort_on_disk, full_join, full_scan, query_cache, query_cache_miss, tmp_table, tmp_table_on_disk]. Sets log-slow-admin-command to ON",
0, 0, 0, GET_STR, REQUIRED_ARG, 0, 0, 0, QPLAN_ALWAYS_SET, 0, 0},
@@ -8664,6 +8671,7 @@ mysql_getopt_value(const char *keyname,
case OPT_KEY_CACHE_BLOCK_SIZE:
case OPT_KEY_CACHE_DIVISION_LIMIT:
case OPT_KEY_CACHE_AGE_THRESHOLD:
+ case OPT_KEY_CACHE_PARTITIONS:
{
KEY_CACHE *key_cache;
if (!(key_cache= get_or_create_key_cache(keyname, key_length)))
@@ -8681,6 +8689,8 @@ mysql_getopt_value(const char *keyname,
return (uchar**) &key_cache->param_division_limit;
case OPT_KEY_CACHE_AGE_THRESHOLD:
return (uchar**) &key_cache->param_age_threshold;
+ case OPT_KEY_CACHE_PARTITIONS:
+ return (uchar**) &key_cache->param_partitions;
}
}
}
=== modified file 'sql/set_var.cc'
--- a/sql/set_var.cc 2009-09-15 10:46:35 +0000
+++ b/sql/set_var.cc 2010-06-29 00:10:53 +0000
@@ -314,15 +314,18 @@ static sys_var_thd_ulong sys_interactive
static sys_var_thd_ulong sys_join_buffer_size(&vars, "join_buffer_size",
&SV::join_buff_size);
static sys_var_key_buffer_size sys_key_buffer_size(&vars, "key_buffer_size");
-static sys_var_key_cache_long sys_key_cache_block_size(&vars, "key_cache_block_size",
- offsetof(KEY_CACHE,
- param_block_size));
-static sys_var_key_cache_long sys_key_cache_division_limit(&vars, "key_cache_division_limit",
- offsetof(KEY_CACHE,
- param_division_limit));
-static sys_var_key_cache_long sys_key_cache_age_threshold(&vars, "key_cache_age_threshold",
- offsetof(KEY_CACHE,
- param_age_threshold));
+static sys_var_key_cache_long sys_key_cache_block_size(&vars,
+ "key_cache_block_size",
+ offsetof(KEY_CACHE,param_block_size));
+static sys_var_key_cache_long sys_key_cache_division_limit(&vars,
+ "key_cache_division_limit",
+ offsetof(KEY_CACHE, param_division_limit));
+static sys_var_key_cache_long sys_key_cache_age_threshold(&vars,
+ "key_cache_age_threshold",
+ offsetof(KEY_CACHE, param_age_threshold));
+static sys_var_key_cache_long sys_key_cache_partitions(&vars,
+ "key_cache_partitions",
+ offsetof(KEY_CACHE, param_partitions));
static sys_var_const sys_language(&vars, "language",
OPT_GLOBAL, SHOW_CHAR,
(uchar*) language);
@@ -2528,7 +2531,21 @@ bool sys_var_key_cache_long::update(THD
pthread_mutex_unlock(&LOCK_global_system_variables);
- error= (bool) (ha_resize_key_cache(key_cache));
+ switch (offset) {
+
+ case offsetof(KEY_CACHE, param_block_size):
+ error= (bool) (ha_resize_key_cache(key_cache));
+ break;
+
+ case offsetof(KEY_CACHE, param_division_limit):
+ case offsetof(KEY_CACHE, param_age_threshold):
+ error= (bool) (ha_change_key_cache_param(key_cache));
+ break;
+
+ case offsetof(KEY_CACHE, param_partitions):
+ error= (bool) (ha_repartition_key_cache(key_cache));
+ break;
+ }
pthread_mutex_lock(&LOCK_global_system_variables);
key_cache->in_init= 0;
@@ -4131,6 +4148,7 @@ static KEY_CACHE *create_key_cache(const
key_cache->param_block_size= dflt_key_cache_var.param_block_size;
key_cache->param_division_limit= dflt_key_cache_var.param_division_limit;
key_cache->param_age_threshold= dflt_key_cache_var.param_age_threshold;
+ key_cache->param_partitions= dflt_key_cache_var.param_partitions;
}
}
DBUG_RETURN(key_cache);
=== modified file 'sql/set_var.h'
--- a/sql/set_var.h 2009-09-15 10:46:35 +0000
+++ b/sql/set_var.h 2010-06-29 00:10:53 +0000
@@ -1411,6 +1411,7 @@ public:
my_free((uchar*) name, MYF(0));
}
friend bool process_key_caches(process_key_cache_t func);
+ friend int fill_key_cache_tables(THD *thd, TABLE_LIST *tables, COND *cond);
friend void delete_elements(I_List<NAMED_LIST> *list,
void (*free_element)(const char*, uchar*));
};
=== modified file 'sql/sql_show.cc'
--- a/sql/sql_show.cc 2009-09-23 11:03:47 +0000
+++ b/sql/sql_show.cc 2010-06-29 00:10:53 +0000
@@ -2106,6 +2106,31 @@ inline void make_upper(char *buf)
*buf= my_toupper(system_charset_info, *buf);
}
+
+static void update_key_cache_stat_var(KEY_CACHE *key_cache, size_t ofs)
+{
+ uint var_no;
+ switch (ofs) {
+ case offsetof(KEY_CACHE, blocks_used):
+ case offsetof(KEY_CACHE, blocks_unused):
+ case offsetof(KEY_CACHE, global_blocks_changed):
+ var_no= (ofs-offsetof(KEY_CACHE, blocks_used))/sizeof(ulong);
+ *(ulong *)((char *) key_cache + ofs)=
+ (ulong) get_key_cache_stat_value(key_cache, var_no);
+ break;
+ case offsetof(KEY_CACHE, global_cache_r_requests):
+ case offsetof(KEY_CACHE, global_cache_read):
+ case offsetof(KEY_CACHE, global_cache_w_requests):
+ case offsetof(KEY_CACHE, global_cache_write):
+ var_no= 3+(ofs-offsetof(KEY_CACHE, global_cache_w_requests))/
+ sizeof(ulonglong);
+ *(ulonglong *)((char *) key_cache + ofs)=
+ get_key_cache_stat_value(key_cache, var_no);
+ break;
+ }
+}
+
+
static bool show_status_array(THD *thd, const char *wild,
SHOW_VAR *variables,
enum enum_var_type value_type,
@@ -2238,10 +2263,12 @@ static bool show_status_array(THD *thd,
break;
}
case SHOW_KEY_CACHE_LONG:
+ update_key_cache_stat_var(dflt_key_cache, (size_t) value);
value= (char*) dflt_key_cache + (ulong)value;
end= int10_to_str(*(long*) value, buff, 10);
break;
case SHOW_KEY_CACHE_LONGLONG:
+ update_key_cache_stat_var(dflt_key_cache, (size_t) value);
value= (char*) dflt_key_cache + (ulong)value;
end= longlong10_to_str(*(longlong*) value, buff, 10);
break;
@@ -6095,6 +6122,90 @@ int fill_schema_files(THD *thd, TABLE_LI
}
+static
+int store_key_cache_table_record(THD *thd, TABLE *table,
+ const char *name, uint name_length,
+ KEY_CACHE *key_cache,
+ uint partitions, uint partition_no)
+{
+ KEY_CACHE_STATISTICS key_cache_stats;
+ uint err;
+ DBUG_ENTER("store_key_cache_table_record");
+
+ get_key_cache_statistics(key_cache, partition_no, &key_cache_stats);
+
+ if (key_cache_stats.mem_size == 0)
+ DBUG_RETURN(0);
+
+ restore_record(table, s->default_values);
+ table->field[0]->store(name, name_length, system_charset_info);
+ if (partitions == 0)
+ table->field[1]->set_null();
+ else
+ {
+ table->field[1]->set_notnull();
+ table->field[1]->store((long) partitions, TRUE);
+ }
+
+ if (partition_no == 0)
+ table->field[2]->set_null();
+ else
+ {
+ table->field[2]->set_notnull();
+ table->field[2]->store((long) partition_no, TRUE);
+ }
+ table->field[3]->store(key_cache_stats.mem_size, TRUE);
+ table->field[4]->store(key_cache_stats.block_size, TRUE);
+ table->field[5]->store(key_cache_stats.blocks_used, TRUE);
+ table->field[6]->store(key_cache_stats.blocks_unused, TRUE);
+ table->field[7]->store(key_cache_stats.blocks_changed, TRUE);
+ table->field[8]->store(key_cache_stats.read_requests, TRUE);
+ table->field[9]->store(key_cache_stats.reads, TRUE);
+ table->field[10]->store(key_cache_stats.write_requests, TRUE);
+ table->field[11]->store(key_cache_stats.writes, TRUE);
+
+ err= schema_table_store_record(thd, table);
+ DBUG_RETURN(err);
+}
+
+
+int fill_key_cache_tables(THD *thd, TABLE_LIST *tables, COND *cond)
+{
+ TABLE *table= tables->table;
+ I_List_iterator<NAMED_LIST> it(key_caches);
+ NAMED_LIST *element;
+ DBUG_ENTER("fill_key_cache_tables");
+
+ while ((element= it++))
+ {
+ KEY_CACHE *key_cache= (KEY_CACHE *) element->data;
+
+ if (!key_cache->key_cache_inited)
+ continue;
+
+ uint partitions= key_cache->partitions;
+ DBUG_ASSERT(partitions <= MAX_KEY_CACHE_PARTITIONS);
+
+ if (partitions)
+ {
+ for (uint i= 0; i < partitions; i++)
+ {
+ if (store_key_cache_table_record(thd, table,
+ element->name, element->name_length,
+ key_cache, partitions, i+1))
+ DBUG_RETURN(1);
+ }
+ }
+
+ if (store_key_cache_table_record(thd, table,
+ element->name, element->name_length,
+ key_cache, partitions, 0))
+ DBUG_RETURN(1);
+ }
+ DBUG_RETURN(0);
+}
+
+
ST_FIELD_INFO schema_fields_info[]=
{
{"CATALOG_NAME", FN_REFLEN, MYSQL_TYPE_STRING, 0, 1, 0, SKIP_OPEN_TABLE},
@@ -6672,6 +6783,35 @@ ST_FIELD_INFO referential_constraints_fi
};
+ST_FIELD_INFO keycache_fields_info[]=
+{
+ {"KEY_CACHE_NAME", NAME_LEN, MYSQL_TYPE_STRING, 0, 0, 0, SKIP_OPEN_TABLE},
+ {"PARTITIONS", 3, MYSQL_TYPE_LONG, 0,
+ (MY_I_S_MAYBE_NULL | MY_I_S_UNSIGNED) , 0, SKIP_OPEN_TABLE},
+ {"PARTITION_NUMBER", 3, MYSQL_TYPE_LONG, 0,
+ (MY_I_S_MAYBE_NULL | MY_I_S_UNSIGNED), 0, SKIP_OPEN_TABLE},
+ {"FULL_SIZE", MY_INT64_NUM_DECIMAL_DIGITS, MYSQL_TYPE_LONGLONG, 0,
+ (MY_I_S_UNSIGNED), 0, SKIP_OPEN_TABLE},
+ {"BLOCK_SIZE", MY_INT64_NUM_DECIMAL_DIGITS, MYSQL_TYPE_LONGLONG, 0,
+ (MY_I_S_UNSIGNED), 0, SKIP_OPEN_TABLE },
+ {"USED_BLOCKS", MY_INT64_NUM_DECIMAL_DIGITS, MYSQL_TYPE_LONGLONG, 0,
+ (MY_I_S_UNSIGNED), "Key_blocks_used", SKIP_OPEN_TABLE},
+ {"UNUSED_BLOCKS", MY_INT64_NUM_DECIMAL_DIGITS, MYSQL_TYPE_LONGLONG, 0,
+ (MY_I_S_UNSIGNED), "Key_blocks_unused", SKIP_OPEN_TABLE},
+ {"DIRTY_BLOCKS", MY_INT64_NUM_DECIMAL_DIGITS, MYSQL_TYPE_LONGLONG, 0,
+ (MY_I_S_UNSIGNED), "Key_blocks_not_flushed", SKIP_OPEN_TABLE},
+ {"READ_REQUESTS", MY_INT64_NUM_DECIMAL_DIGITS, MYSQL_TYPE_LONGLONG, 0,
+ (MY_I_S_UNSIGNED), "Key_read_requests", SKIP_OPEN_TABLE},
+ {"READS", MY_INT64_NUM_DECIMAL_DIGITS, MYSQL_TYPE_LONGLONG, 0,
+ (MY_I_S_UNSIGNED), "Key_reads", SKIP_OPEN_TABLE},
+ {"WRITE_REQUESTS", MY_INT64_NUM_DECIMAL_DIGITS, MYSQL_TYPE_LONGLONG, 0,
+ (MY_I_S_UNSIGNED), "Key_write_requests", SKIP_OPEN_TABLE},
+ {"WRITES", MY_INT64_NUM_DECIMAL_DIGITS, MYSQL_TYPE_LONGLONG, 0,
+ (MY_I_S_UNSIGNED), "Key_writes", SKIP_OPEN_TABLE},
+ {0, 0, MYSQL_TYPE_STRING, 0, 0, 0, SKIP_OPEN_TABLE}
+};
+
+
/*
Description of ST_FIELD_INFO in table.h
@@ -6707,6 +6847,8 @@ ST_SCHEMA_TABLE schema_tables[]=
fill_status, make_old_format, 0, 0, -1, 0, 0},
{"GLOBAL_VARIABLES", variables_fields_info, create_schema_table,
fill_variables, make_old_format, 0, 0, -1, 0, 0},
+ {"KEY_CACHES", keycache_fields_info, create_schema_table,
+ fill_key_cache_tables, make_old_format, 0, -1,-1, 0, 0},
{"KEY_COLUMN_USAGE", key_column_usage_fields_info, create_schema_table,
get_all_tables, 0, get_schema_key_column_usage_record, 4, 5, 0,
OPEN_TABLE_ONLY},
=== modified file 'sql/sql_test.cc'
--- a/sql/sql_test.cc 2009-09-07 20:50:10 +0000
+++ b/sql/sql_test.cc 2010-06-29 00:10:53 +0000
@@ -435,7 +435,8 @@ static int print_key_cache_status(const
Buffer_size: %10lu\n\
Block_size: %10lu\n\
Division_limit: %10lu\n\
-Age_limit: %10lu\n\
+Age_threshold: %10lu\n\
+Partitions: %10lu\n\
blocks used: %10lu\n\
not flushed: %10lu\n\
w_requests: %10s\n\
@@ -445,6 +446,7 @@ reads: %10s\n\n",
name,
(ulong) key_cache->param_buff_size, key_cache->param_block_size,
key_cache->param_division_limit, key_cache->param_age_threshold,
+ key_cache->param_partitions,
key_cache->blocks_used,key_cache->global_blocks_changed,
llstr(key_cache->global_cache_w_requests,llbuff1),
llstr(key_cache->global_cache_write,llbuff2),
=== modified file 'sql/table.h'
--- a/sql/table.h 2009-09-15 10:46:35 +0000
+++ b/sql/table.h 2010-06-29 00:10:53 +0000
@@ -887,6 +887,7 @@ enum enum_schema_tables
SCH_FILES,
SCH_GLOBAL_STATUS,
SCH_GLOBAL_VARIABLES,
+ SCH_KEY_CACHES,
SCH_KEY_COLUMN_USAGE,
SCH_OPEN_TABLES,
SCH_PARTITIONS,
=== modified file 'storage/myisam/mi_check.c'
--- a/storage/myisam/mi_check.c 2009-06-29 21:03:30 +0000
+++ b/storage/myisam/mi_check.c 2010-06-29 00:10:53 +0000
@@ -334,7 +334,8 @@ int chk_size(HA_CHECK *param, register M
/* The following is needed if called externally (not from myisamchk) */
flush_key_blocks(info->s->key_cache,
- info->s->kfile, FLUSH_FORCE_WRITE);
+ info->s->kfile, &info->s->dirty_part_map,
+ FLUSH_FORCE_WRITE);
size= my_seek(info->s->kfile, 0L, MY_SEEK_END, MYF(MY_THREADSAFE));
if ((skr=(my_off_t) info->state->key_file_length) != size)
@@ -1477,6 +1478,7 @@ static int mi_drop_all_indexes(HA_CHECK
*/
DBUG_PRINT("repair", ("all disabled are empty: create missing"));
error= flush_key_blocks(share->key_cache, share->kfile,
+ &share->dirty_part_map,
FLUSH_FORCE_WRITE);
goto end;
}
@@ -1491,6 +1493,7 @@ static int mi_drop_all_indexes(HA_CHECK
/* Remove all key blocks of this index file from key cache. */
if ((error= flush_key_blocks(share->key_cache, share->kfile,
+ &share->dirty_part_map,
FLUSH_IGNORE_CHANGED)))
goto end; /* purecov: inspected */
@@ -1550,7 +1553,7 @@ int mi_repair(HA_CHECK *param, register
if (!param->using_global_keycache)
VOID(init_key_cache(dflt_key_cache, param->key_cache_block_size,
- param->use_buffers, 0, 0));
+ (size_t) param->use_buffers, 0, 0, 0));
if (init_io_cache(¶m->read_cache,info->dfile,
(uint) param->read_buffer_length,
@@ -1763,7 +1766,8 @@ err:
VOID(end_io_cache(¶m->read_cache));
info->opt_flag&= ~(READ_CACHE_USED | WRITE_CACHE_USED);
VOID(end_io_cache(&info->rec_cache));
- got_error|=flush_blocks(param, share->key_cache, share->kfile);
+ got_error|=flush_blocks(param, share->key_cache, share->kfile,
+ &share->dirty_part_map);
if (!got_error && param->testflag & T_UNPACK)
{
share->state.header.options[0]&= (uchar) ~HA_OPTION_COMPRESS_RECORD;
@@ -1909,9 +1913,10 @@ void lock_memory(HA_CHECK *param __attri
/* Flush all changed blocks to disk */
-int flush_blocks(HA_CHECK *param, KEY_CACHE *key_cache, File file)
+int flush_blocks(HA_CHECK *param, KEY_CACHE *key_cache, File file,
+ ulonglong *dirty_part_map)
{
- if (flush_key_blocks(key_cache, file, FLUSH_RELEASE))
+ if (flush_key_blocks(key_cache, file, dirty_part_map, FLUSH_RELEASE))
{
mi_check_print_error(param,"%d when trying to write bufferts",my_errno);
return(1);
@@ -1978,7 +1983,8 @@ int mi_sort_index(HA_CHECK *param, regis
}
/* Flush key cache for this file if we are calling this outside myisamchk */
- flush_key_blocks(share->key_cache,share->kfile, FLUSH_IGNORE_CHANGED);
+ flush_key_blocks(share->key_cache, share->kfile, &share->dirty_part_map,
+ FLUSH_IGNORE_CHANGED);
share->state.version=(ulong) time((time_t*) 0);
old_state= share->state; /* save state if not stored */
@@ -2537,7 +2543,8 @@ int mi_repair_by_sort(HA_CHECK *param, r
memcpy( &share->state.state, info->state, sizeof(*info->state));
err:
- got_error|= flush_blocks(param, share->key_cache, share->kfile);
+ got_error|= flush_blocks(param, share->key_cache, share->kfile,
+ &share->dirty_part_map);
VOID(end_io_cache(&info->rec_cache));
if (!got_error)
{
@@ -3059,7 +3066,8 @@ int mi_repair_parallel(HA_CHECK *param,
memcpy(&share->state.state, info->state, sizeof(*info->state));
err:
- got_error|= flush_blocks(param, share->key_cache, share->kfile);
+ got_error|= flush_blocks(param, share->key_cache, share->kfile,
+ &share->dirty_part_map);
/*
Destroy the write cache. The master thread did already detach from
the share by remove_io_thread() or it was not yet started (if the
=== modified file 'storage/myisam/mi_close.c'
--- a/storage/myisam/mi_close.c 2009-09-07 20:50:10 +0000
+++ b/storage/myisam/mi_close.c 2010-06-29 00:10:53 +0000
@@ -64,6 +64,7 @@ int mi_close(register MI_INFO *info)
if (share->kfile >= 0) abort(););
if (share->kfile >= 0 &&
flush_key_blocks(share->key_cache, share->kfile,
+ &share->dirty_part_map,
share->temporary ? FLUSH_IGNORE_CHANGED :
FLUSH_RELEASE))
error=my_errno;
=== modified file 'storage/myisam/mi_delete_all.c'
--- a/storage/myisam/mi_delete_all.c 2008-04-28 16:24:05 +0000
+++ b/storage/myisam/mi_delete_all.c 2010-06-29 00:10:53 +0000
@@ -52,7 +52,8 @@ int mi_delete_all_rows(MI_INFO *info)
If we are using delayed keys or if the user has done changes to the tables
since it was locked then there may be key blocks in the key cache
*/
- flush_key_blocks(share->key_cache, share->kfile, FLUSH_IGNORE_CHANGED);
+ flush_key_blocks(share->key_cache, share->kfile, &share->dirty_part_map,
+ FLUSH_IGNORE_CHANGED);
#ifdef HAVE_MMAP
if (share->file_map)
_mi_unmap_file(info);
=== modified file 'storage/myisam/mi_extra.c'
--- a/storage/myisam/mi_extra.c 2009-10-06 06:13:56 +0000
+++ b/storage/myisam/mi_extra.c 2010-06-29 00:10:53 +0000
@@ -263,6 +263,7 @@ int mi_extra(MI_INFO *info, enum ha_extr
pthread_mutex_lock(&share->intern_lock);
/* Flush pages that we don't need anymore */
if (flush_key_blocks(share->key_cache, share->kfile,
+ &share->dirty_part_map,
(function == HA_EXTRA_PREPARE_FOR_DROP ?
FLUSH_IGNORE_CHANGED : FLUSH_RELEASE)))
{
@@ -321,7 +322,8 @@ int mi_extra(MI_INFO *info, enum ha_extr
break;
case HA_EXTRA_FLUSH:
if (!share->temporary)
- flush_key_blocks(share->key_cache, share->kfile, FLUSH_KEEP);
+ flush_key_blocks(share->key_cache, share->kfile, &share->dirty_part_map,
+ FLUSH_KEEP);
#ifdef HAVE_PWRITE
_mi_decrement_open_count(info);
#endif
=== modified file 'storage/myisam/mi_keycache.c'
--- a/storage/myisam/mi_keycache.c 2008-03-29 15:56:33 +0000
+++ b/storage/myisam/mi_keycache.c 2010-06-29 00:10:53 +0000
@@ -75,7 +75,8 @@ int mi_assign_to_key_cache(MI_INFO *info
in the old key cache.
*/
- if (flush_key_blocks(share->key_cache, share->kfile, FLUSH_RELEASE))
+ if (flush_key_blocks(share->key_cache, share->kfile, &share->dirty_part_map,
+ FLUSH_RELEASE))
{
error= my_errno;
mi_print_error(info->s, HA_ERR_CRASHED);
@@ -90,7 +91,8 @@ int mi_assign_to_key_cache(MI_INFO *info
(This can never fail as there is never any not written data in the
new key cache)
*/
- (void) flush_key_blocks(key_cache, share->kfile, FLUSH_RELEASE);
+ (void) flush_key_blocks(key_cache, share->kfile, &share->dirty_part_map,
+ FLUSH_RELEASE);
/*
ensure that setting the key cache and changing the multi_key_cache
@@ -102,6 +104,7 @@ int mi_assign_to_key_cache(MI_INFO *info
This should be seen at the lastes for the next call to an myisam function.
*/
share->key_cache= key_cache;
+ share->dirty_part_map= 0;
/* store the key cache in the global hash structure for future opens */
if (multi_key_cache_set((uchar*) share->unique_file_name,
=== modified file 'storage/myisam/mi_locking.c'
--- a/storage/myisam/mi_locking.c 2009-10-06 06:57:22 +0000
+++ b/storage/myisam/mi_locking.c 2010-06-29 00:10:53 +0000
@@ -68,7 +68,9 @@ int mi_lock_database(MI_INFO *info, int
--share->tot_locks;
if (info->lock_type == F_WRLCK && !share->w_locks &&
!share->delay_key_write && flush_key_blocks(share->key_cache,
- share->kfile,FLUSH_KEEP))
+ share->kfile,
+ &share->dirty_part_map,
+ FLUSH_KEEP))
{
error=my_errno;
mi_print_error(info->s, HA_ERR_CRASHED);
@@ -513,7 +515,8 @@ int _mi_test_if_changed(register MI_INFO
{ /* Keyfile has changed */
DBUG_PRINT("info",("index file changed"));
if (share->state.process != share->this_process)
- VOID(flush_key_blocks(share->key_cache, share->kfile, FLUSH_RELEASE));
+ VOID(flush_key_blocks(share->key_cache, share->kfile,
+ &share->dirty_part_map, FLUSH_RELEASE));
share->last_process=share->state.process;
info->last_unique= share->state.unique;
info->last_loop= share->state.update_count;
=== modified file 'storage/myisam/mi_page.c'
--- a/storage/myisam/mi_page.c 2009-05-06 12:03:24 +0000
+++ b/storage/myisam/mi_page.c 2010-06-29 00:10:53 +0000
@@ -94,10 +94,11 @@ int _mi_write_keypage(register MI_INFO *
}
#endif
DBUG_RETURN((key_cache_write(info->s->key_cache,
- info->s->kfile,page, level, (uchar*) buff,length,
- (uint) keyinfo->block_length,
- (int) ((info->lock_type != F_UNLCK) ||
- info->s->delay_key_write))));
+ info->s->kfile, &info->s->dirty_part_map,
+ page, level, (uchar*) buff, length,
+ (uint) keyinfo->block_length,
+ (int) ((info->lock_type != F_UNLCK) ||
+ info->s->delay_key_write))));
} /* mi_write_keypage */
@@ -116,7 +117,8 @@ int _mi_dispose(register MI_INFO *info,
mi_sizestore(buff,old_link);
info->s->state.changed|= STATE_NOT_SORTED_PAGES;
DBUG_RETURN(key_cache_write(info->s->key_cache,
- info->s->kfile, pos , level, buff,
+ info->s->kfile, &info->s->dirty_part_map,
+ pos , level, buff,
sizeof(buff),
(uint) keyinfo->block_length,
(int) (info->lock_type != F_UNLCK)));
=== modified file 'storage/myisam/mi_panic.c'
--- a/storage/myisam/mi_panic.c 2006-12-31 00:32:21 +0000
+++ b/storage/myisam/mi_panic.c 2010-06-29 00:10:53 +0000
@@ -47,7 +47,8 @@ int mi_panic(enum ha_panic_function flag
if (info->s->options & HA_OPTION_READ_ONLY_DATA)
break;
#endif
- if (flush_key_blocks(info->s->key_cache, info->s->kfile, FLUSH_RELEASE))
+ if (flush_key_blocks(info->s->key_cache, info->s->kfile,
+ &info->s->dirty_part_map, FLUSH_RELEASE))
error=my_errno;
if (info->opt_flag & WRITE_CACHE_USED)
if (flush_io_cache(&info->rec_cache))
=== modified file 'storage/myisam/mi_preload.c'
--- a/storage/myisam/mi_preload.c 2007-05-24 12:26:10 +0000
+++ b/storage/myisam/mi_preload.c 2010-06-29 00:10:53 +0000
@@ -65,7 +65,7 @@ int mi_preload(MI_INFO *info, ulonglong
}
}
else
- block_length= share->key_cache->key_cache_block_size;
+ block_length= share->key_cache->param_block_size;
length= info->preload_buff_size/block_length * block_length;
set_if_bigger(length, block_length);
@@ -73,7 +73,8 @@ int mi_preload(MI_INFO *info, ulonglong
if (!(buff= (uchar *) my_malloc(length, MYF(MY_WME))))
DBUG_RETURN(my_errno= HA_ERR_OUT_OF_MEM);
- if (flush_key_blocks(share->key_cache,share->kfile, FLUSH_RELEASE))
+ if (flush_key_blocks(share->key_cache, share->kfile, &share->dirty_part_map,
+ FLUSH_RELEASE))
goto err;
do
=== modified file 'storage/myisam/mi_test1.c'
--- a/storage/myisam/mi_test1.c 2008-04-28 16:24:05 +0000
+++ b/storage/myisam/mi_test1.c 2010-06-29 00:10:53 +0000
@@ -49,7 +49,8 @@ int main(int argc,char *argv[])
MY_INIT(argv[0]);
my_init();
if (key_cacheing)
- init_key_cache(dflt_key_cache,KEY_CACHE_BLOCK_SIZE,IO_SIZE*16,0,0);
+ init_key_cache(dflt_key_cache,KEY_CACHE_BLOCK_SIZE,IO_SIZE*16,0,0,
+ DEFAULT_KEY_CACHE_PARTITIONS);
get_options(argc,argv);
exit(run_test("test1"));
=== modified file 'storage/myisam/mi_test2.c'
--- a/storage/myisam/mi_test2.c 2008-04-28 16:24:05 +0000
+++ b/storage/myisam/mi_test2.c 2010-06-29 00:10:53 +0000
@@ -215,7 +215,8 @@ int main(int argc, char *argv[])
if (!silent)
printf("- Writing key:s\n");
if (key_cacheing)
- init_key_cache(dflt_key_cache,key_cache_block_size,key_cache_size,0,0);
+ init_key_cache(dflt_key_cache,key_cache_block_size,key_cache_size,0,0,
+ DEFAULT_KEY_CACHE_PARTITIONS);
if (do_locking)
mi_lock_database(file,F_WRLCK);
if (write_cacheing)
=== modified file 'storage/myisam/mi_test3.c'
--- a/storage/myisam/mi_test3.c 2008-04-28 16:24:05 +0000
+++ b/storage/myisam/mi_test3.c 2010-06-29 00:10:53 +0000
@@ -177,7 +177,8 @@ void start_test(int id)
exit(1);
}
if (key_cacheing && rnd(2) == 0)
- init_key_cache(dflt_key_cache, KEY_CACHE_BLOCK_SIZE, 65536L, 0, 0);
+ init_key_cache(dflt_key_cache, KEY_CACHE_BLOCK_SIZE, 65536L, 0, 0,
+ DEFAULT_KEY_CACHE_PARTITIONS);
printf("Process %d, pid: %d\n",id,getpid()); fflush(stdout);
for (error=i=0 ; i < tests && !error; i++)
=== modified file 'storage/myisam/myisam_ftdump.c'
--- a/storage/myisam/myisam_ftdump.c 2007-05-10 09:59:39 +0000
+++ b/storage/myisam/myisam_ftdump.c 2010-06-29 00:10:53 +0000
@@ -83,7 +83,7 @@ int main(int argc,char *argv[])
usage();
}
- init_key_cache(dflt_key_cache,MI_KEY_BLOCK_LENGTH,USE_BUFFER_INIT, 0, 0);
+ init_key_cache(dflt_key_cache,MI_KEY_BLOCK_LENGTH,USE_BUFFER_INIT, 0, 0, 0);
if (!(info=mi_open(argv[0], O_RDONLY,
HA_OPEN_ABORT_IF_LOCKED|HA_OPEN_FROM_SQL_LAYER)))
=== modified file 'storage/myisam/myisamchk.c'
--- a/storage/myisam/myisamchk.c 2009-09-19 21:21:29 +0000
+++ b/storage/myisam/myisamchk.c 2010-06-29 00:10:53 +0000
@@ -1102,7 +1102,7 @@ static int myisamchk(HA_CHECK *param, ch
{
if (param->testflag & (T_EXTEND | T_MEDIUM))
VOID(init_key_cache(dflt_key_cache,opt_key_cache_block_size,
- (size_t) param->use_buffers, 0, 0));
+ (size_t) param->use_buffers, 0, 0, 0));
VOID(init_io_cache(¶m->read_cache,datafile,
(uint) param->read_buffer_length,
READ_CACHE,
@@ -1116,7 +1116,8 @@ static int myisamchk(HA_CHECK *param, ch
HA_OPTION_COMPRESS_RECORD)) ||
(param->testflag & (T_EXTEND | T_MEDIUM)))
error|=chk_data_link(param, info, test(param->testflag & T_EXTEND));
- error|=flush_blocks(param, share->key_cache, share->kfile);
+ error|=flush_blocks(param, share->key_cache, share->kfile,
+ &share->dirty_part_map);
VOID(end_io_cache(¶m->read_cache));
}
if (!error)
@@ -1526,7 +1527,7 @@ static int mi_sort_records(HA_CHECK *par
DBUG_RETURN(0); /* Nothing to do */
init_key_cache(dflt_key_cache, opt_key_cache_block_size,
- (size_t) param->use_buffers, 0, 0);
+ (size_t) param->use_buffers, 0, 0, 0);
if (init_io_cache(&info->rec_cache,-1,(uint) param->write_buffer_length,
WRITE_CACHE,share->pack.header_length,1,
MYF(MY_WME | MY_WAIT_IF_FULL)))
@@ -1641,8 +1642,8 @@ err:
my_free(sort_info.buff,MYF(MY_ALLOW_ZERO_PTR));
sort_info.buff=0;
share->state.sortkey=sort_key;
- DBUG_RETURN(flush_blocks(param, share->key_cache, share->kfile) |
- got_error);
+ DBUG_RETURN(flush_blocks(param, share->key_cache, share->kfile,
+ &share->dirty_part_map) | got_error);
} /* sort_records */
=== modified file 'storage/myisam/myisamdef.h'
--- a/storage/myisam/myisamdef.h 2009-10-06 06:57:22 +0000
+++ b/storage/myisam/myisamdef.h 2010-06-29 00:10:53 +0000
@@ -174,6 +174,8 @@ typedef struct st_mi_isam_share
*index_file_name;
uchar *file_map; /* mem-map of file if possible */
KEY_CACHE *key_cache; /* ref to the current key cache */
+ /* To mark the key cache partitions containing dirty pages for this file */
+ ulonglong dirty_part_map;
MI_DECODE_TREE *decode_trees;
uint16 *decode_tables;
/* Function to use for a row checksum. */
@@ -732,7 +734,8 @@ void mi_check_print_info _VARARGS((HA_CH
#ifdef THREAD
pthread_handler_t thr_find_all_keys(void *arg);
#endif
-int flush_blocks(HA_CHECK *param, KEY_CACHE *key_cache, File file);
+int flush_blocks(HA_CHECK *param, KEY_CACHE *key_cache, File file,
+ ulonglong *dirty_part_map);
#ifdef __cplusplus
}
#endif
=== modified file 'storage/myisam/myisamlog.c'
--- a/storage/myisam/myisamlog.c 2008-02-18 22:35:17 +0000
+++ b/storage/myisam/myisamlog.c 2010-06-29 00:10:53 +0000
@@ -333,7 +333,7 @@ static int examine_log(char * file_name,
init_tree(&tree,0,0,sizeof(file_info),(qsort_cmp2) file_info_compare,1,
(tree_element_free) file_info_free, NULL);
VOID(init_key_cache(dflt_key_cache,KEY_CACHE_BLOCK_SIZE,KEY_CACHE_SIZE,
- 0, 0));
+ 0, 0, 0));
files_open=0; access_time=0;
while (access_time++ != number_of_commands &&
1
0
[Maria-developers] bzr commit into MariaDB 5.1, with Maria 1.5:maria branch (igor:2750)
by Igor Babaev 29 Jun '10
by Igor Babaev 29 Jun '10
29 Jun '10
#At lp:maria based on revid:igor@askmonty.org-20091103182103-jnjuss2b4t72rz83
2750 Igor Babaev 2010-06-28
An implementation of index intersect via a modified Unique class.
This code is planned to be used for mwl#21.
modified:
include/my_tree.h
mysys/tree.c
sql/filesort.cc
sql/opt_range.cc
sql/opt_range.h
sql/sql_class.h
sql/sql_select.cc
sql/sql_sort.h
sql/uniques.cc
=== modified file 'include/my_tree.h'
--- a/include/my_tree.h 2008-05-29 15:33:33 +0000
+++ b/include/my_tree.h 2010-06-29 00:02:19 +0000
@@ -31,6 +31,7 @@ extern "C" {
#define tree_set_pointer(element,ptr) *((uchar **) (element+1))=((uchar*) (ptr))
#define TREE_NO_DUPS 1
+#define TREE_ONLY_DUPS 2
typedef enum { left_root_right, right_root_left } TREE_WALK;
typedef uint32 element_count;
=== modified file 'mysys/tree.c'
--- a/mysys/tree.c 2007-05-10 09:59:39 +0000
+++ b/mysys/tree.c 2010-06-29 00:02:19 +0000
@@ -221,6 +221,8 @@ TREE_ELEMENT *tree_insert(TREE *tree, vo
}
if (element == &tree->null_element)
{
+ if (tree->flag & TREE_ONLY_DUPS)
+ return((TREE_ELEMENT *) 1);
uint alloc_size=sizeof(TREE_ELEMENT)+key_size+tree->size_of_element;
tree->allocated+=alloc_size;
=== modified file 'sql/filesort.cc'
--- a/sql/filesort.cc 2009-09-03 14:05:38 +0000
+++ b/sql/filesort.cc 2010-06-29 00:02:19 +0000
@@ -50,10 +50,6 @@ static int write_keys(SORTPARAM *param,u
uint count, IO_CACHE *buffer_file, IO_CACHE *tempfile);
static void make_sortkey(SORTPARAM *param,uchar *to, uchar *ref_pos);
static void register_used_fields(SORTPARAM *param);
-static int merge_index(SORTPARAM *param,uchar *sort_buffer,
- BUFFPEK *buffpek,
- uint maxbuffer,IO_CACHE *tempfile,
- IO_CACHE *outfile);
static bool save_index(SORTPARAM *param,uchar **sort_keys, uint count,
FILESORT_INFO *table_sort);
static uint suffix_length(ulong string_length);
@@ -143,6 +139,7 @@ ha_rows filesort(THD *thd, TABLE *table,
bzero((char*) ¶m,sizeof(param));
param.sort_length= sortlength(thd, sortorder, s_length, &multi_byte_charset);
param.ref_length= table->file->ref_length;
+ param.min_dupl_count= 0;
param.addon_field= 0;
param.addon_length= 0;
if (!(table->file->ha_table_flags() & HA_FAST_KEY_READ) &&
@@ -1212,7 +1209,13 @@ int merge_buffers(SORTPARAM *param, IO_C
rec_length= param->rec_length;
res_length= param->res_length;
sort_length= param->sort_length;
- offset= rec_length-res_length;
+ element_count dupl_count;
+ uchar *src;
+ uint dupl_count_ofs= rec_length-sizeof(element_count);
+ uint min_dupl_count= param->min_dupl_count;
+ offset= rec_length-
+ (flag && min_dupl_count ? sizeof(dupl_count) : 0)-res_length;
+ uint wr_len= flag ? res_length : rec_length;
maxcount= (ulong) (param->keys/((uint) (Tb-Fb) +1));
to_start_filepos= my_b_tell(to_file);
strpos= sort_buffer;
@@ -1258,16 +1261,20 @@ int merge_buffers(SORTPARAM *param, IO_C
*/
buffpek= (BUFFPEK*) queue_top(&queue);
memcpy(param->unique_buff, buffpek->key, rec_length);
- if (my_b_write(to_file, (uchar*) buffpek->key, rec_length))
- {
- error=1; goto err; /* purecov: inspected */
- }
+ if (min_dupl_count)
+ memcpy(&dupl_count, param->unique_buff+dupl_count_ofs,
+ sizeof(dupl_count));
buffpek->key+= rec_length;
- buffpek->mem_count--;
- if (!--max_rows)
+ if (! --buffpek->mem_count)
{
- error= 0; /* purecov: inspected */
- goto end; /* purecov: inspected */
+ if (!(error= (int) read_to_buffer(from_file,buffpek,
+ rec_length)))
+ {
+ VOID(queue_remove(&queue,0));
+ reuse_freed_buff(&queue, buffpek, rec_length);
+ }
+ else if (error == -1)
+ goto err; /* purecov: inspected */
}
queue_replaced(&queue); // Top element has been used
}
@@ -1283,27 +1290,42 @@ int merge_buffers(SORTPARAM *param, IO_C
for (;;)
{
buffpek= (BUFFPEK*) queue_top(&queue);
+ src= buffpek->key;
if (cmp) // Remove duplicates
{
if (!(*cmp)(first_cmp_arg, &(param->unique_buff),
(uchar**) &buffpek->key))
- goto skip_duplicate;
- memcpy(param->unique_buff, (uchar*) buffpek->key, rec_length);
- }
- if (flag == 0)
- {
- if (my_b_write(to_file,(uchar*) buffpek->key, rec_length))
- {
- error=1; goto err; /* purecov: inspected */
+ {
+ if (min_dupl_count)
+ {
+ element_count cnt;
+ memcpy(&cnt, (uchar *) buffpek->key+dupl_count_ofs, sizeof(cnt));
+ dupl_count+= cnt;
+ }
+ goto skip_duplicate;
}
+ if (min_dupl_count)
+ {
+ memcpy(param->unique_buff+dupl_count_ofs, &dupl_count,
+ sizeof(dupl_count));
+ }
+ src= param->unique_buff;
}
- else
+
+ if (!flag || !min_dupl_count || dupl_count >= min_dupl_count)
{
- if (my_b_write(to_file, (uchar*) buffpek->key+offset, res_length))
+ if (my_b_write(to_file, src+(flag ? offset : 0), wr_len))
{
error=1; goto err; /* purecov: inspected */
}
}
+ if (cmp)
+ {
+ memcpy(param->unique_buff, (uchar*) buffpek->key, rec_length);
+ if (min_dupl_count)
+ memcpy(&dupl_count, param->unique_buff+dupl_count_ofs,
+ sizeof(dupl_count));
+ }
if (!--max_rows)
{
error= 0; /* purecov: inspected */
@@ -1339,9 +1361,33 @@ int merge_buffers(SORTPARAM *param, IO_C
{
if (!(*cmp)(first_cmp_arg, &(param->unique_buff), (uchar**) &buffpek->key))
{
- buffpek->key+= rec_length; // Remove duplicate
+ if (min_dupl_count)
+ {
+ element_count cnt;
+ memcpy(&cnt, (uchar *) buffpek->key+dupl_count_ofs, sizeof(cnt));
+ dupl_count+= cnt;
+ }
+ buffpek->key+= rec_length;
--buffpek->mem_count;
}
+
+ if (min_dupl_count)
+ memcpy(param->unique_buff+dupl_count_ofs, &dupl_count,
+ sizeof(dupl_count));
+
+ if (!flag || !min_dupl_count || dupl_count >= min_dupl_count)
+ {
+ src= param->unique_buff;
+ if (my_b_write(to_file, src+(flag ? offset : 0), wr_len))
+ {
+ error=1; goto err; /* purecov: inspected */
+ }
+ if (!--max_rows)
+ {
+ error= 0;
+ goto end;
+ }
+ }
}
do
@@ -1363,12 +1409,17 @@ int merge_buffers(SORTPARAM *param, IO_C
else
{
register uchar *end;
- strpos= buffpek->key+offset;
- for (end= strpos+buffpek->mem_count*rec_length ;
- strpos != end ;
- strpos+= rec_length)
- {
- if (my_b_write(to_file, strpos, res_length))
+ src= buffpek->key+offset;
+ for (end= src+buffpek->mem_count*rec_length ;
+ src != end ;
+ src+= rec_length)
+ {
+ if (flag && min_dupl_count &&
+ memcmp(&min_dupl_count, src+dupl_count_ofs,
+ sizeof(dupl_count_ofs))<0)
+ continue;
+
+ if (my_b_write(to_file, src, wr_len))
{
error=1; goto err;
}
@@ -1389,7 +1440,7 @@ err:
/* Do a merge to output-file (save only positions) */
-static int merge_index(SORTPARAM *param, uchar *sort_buffer,
+int merge_index(SORTPARAM *param, uchar *sort_buffer,
BUFFPEK *buffpek, uint maxbuffer,
IO_CACHE *tempfile, IO_CACHE *outfile)
{
=== modified file 'sql/opt_range.cc'
--- a/sql/opt_range.cc 2009-10-30 00:36:35 +0000
+++ b/sql/opt_range.cc 2010-06-29 00:02:19 +0000
@@ -697,6 +697,9 @@ public:
key_map ror_scans_map; /* bitmask of ROR scan-able elements in keys */
uint n_ror_scans; /* number of set bits in ror_scans_map */
+ struct st_index_scan_info **index_scans; /* list of index scans */
+ struct st_index_scan_info **index_scans_end; /* last index scan */
+
struct st_ror_scan_info **ror_scans; /* list of ROR key scans */
struct st_ror_scan_info **ror_scans_end; /* last ROR scan */
/* Note that #records for each key scan is stored in table->quick_rows */
@@ -776,9 +779,11 @@ class TABLE_READ_PLAN;
class TRP_RANGE;
class TRP_ROR_INTERSECT;
class TRP_ROR_UNION;
+ class TRP_INDEX_INTERSECT;
class TRP_INDEX_MERGE;
class TRP_GROUP_MIN_MAX;
+struct st_index_scan_info;
struct st_ror_scan_info;
static SEL_TREE * get_mm_parts(RANGE_OPT_PARAM *param,COND *cond_func,Field *field,
@@ -804,6 +809,9 @@ static TRP_RANGE *get_key_scans_params(P
bool update_tbl_stats,
double read_time);
static
+TRP_INDEX_INTERSECT *get_best_index_intersect(PARAM *param, SEL_TREE *tree,
+ double read_time);
+static
TRP_ROR_INTERSECT *get_best_ror_intersect(const PARAM *param, SEL_TREE *tree,
double read_time,
bool *are_all_covering);
@@ -1743,7 +1751,7 @@ int QUICK_INDEX_MERGE_SELECT::init()
int QUICK_INDEX_MERGE_SELECT::reset()
{
DBUG_ENTER("QUICK_INDEX_MERGE_SELECT::reset");
- DBUG_RETURN(read_keys_and_merge());
+ DBUG_RETURN (read_keys_and_merge());
}
bool
@@ -1778,6 +1786,63 @@ QUICK_INDEX_MERGE_SELECT::~QUICK_INDEX_M
DBUG_VOID_RETURN;
}
+QUICK_INDEX_INTERSECT_SELECT::QUICK_INDEX_INTERSECT_SELECT(THD *thd_param,
+ TABLE *table)
+ :pk_quick_select(NULL), thd(thd_param)
+{
+ DBUG_ENTER("QUICK_INDEX_INTERSECT_SELECT::QUICK_INDEX_INTERSECT_SELECT");
+ index= MAX_KEY;
+ head= table;
+ bzero(&read_record, sizeof(read_record));
+ init_sql_alloc(&alloc, thd->variables.range_alloc_block_size, 0);
+ DBUG_VOID_RETURN;
+}
+
+int QUICK_INDEX_INTERSECT_SELECT::init()
+{
+ DBUG_ENTER("QUICK_INDEX_INTERSECT_SELECT::init");
+ DBUG_RETURN(0);
+}
+
+int QUICK_INDEX_INTERSECT_SELECT::reset()
+{
+ DBUG_ENTER("QUICK_INDEX_INTERSECT_SELECT::reset");
+ DBUG_RETURN (read_keys_and_merge());
+}
+
+bool
+QUICK_INDEX_INTERSECT_SELECT::push_quick_back(QUICK_RANGE_SELECT *quick_sel_range)
+{
+ /*
+ Save quick_select that does scan on clustered primary key as it will be
+ processed separately.
+ */
+ if (head->file->primary_key_is_clustered() &&
+ quick_sel_range->index == head->s->primary_key)
+ pk_quick_select= quick_sel_range;
+ else
+ return quick_selects.push_back(quick_sel_range);
+ return 0;
+}
+
+QUICK_INDEX_INTERSECT_SELECT::~QUICK_INDEX_INTERSECT_SELECT()
+{
+ List_iterator_fast<QUICK_RANGE_SELECT> quick_it(quick_selects);
+ QUICK_RANGE_SELECT* quick;
+ DBUG_ENTER("QUICK_INDEX_INTERSECT_SELECT::~QUICK_INDEX_INTERSECT_SELECT");
+ quick_it.rewind();
+ while ((quick= quick_it++))
+ quick->file= NULL;
+ quick_selects.delete_elements();
+ delete pk_quick_select;
+ /* It's ok to call the next two even if they are already deinitialized */
+ end_read_record(&read_record);
+ free_io_cache(head);
+ free_root(&alloc,MYF(0));
+ DBUG_VOID_RETURN;
+}
+
+
QUICK_ROR_INTERSECT_SELECT::QUICK_ROR_INTERSECT_SELECT(THD *thd_param,
TABLE *table,
@@ -2555,6 +2620,24 @@ public:
/*
+ Plan for QUICK_INDEX_INTERSECT_SELECT scan.
+ QUICK_INDEX_INTERSECT_SELECT always retrieves full rows, so retrieve_full_rows
+ is ignored by make_quick.
+*/
+
+class TRP_INDEX_INTERSECT : public TABLE_READ_PLAN
+{
+public:
+ TRP_INDEX_INTERSECT() {} /* Remove gcc warning */
+ virtual ~TRP_INDEX_INTERSECT() {} /* Remove gcc warning */
+ QUICK_SELECT_I *make_quick(PARAM *param, bool retrieve_full_rows,
+ MEM_ROOT *parent_alloc);
+ TRP_RANGE **range_scans; /* array of ptrs to plans of merged scans */
+ TRP_RANGE **range_scans_end; /* end of the array */
+};
+
+
+/*
Plan for QUICK_INDEX_MERGE_SELECT scan.
QUICK_ROR_INTERSECT_SELECT always retrieves full rows, so retrieve_full_rows
is ignored by make_quick.
@@ -2621,6 +2704,30 @@ public:
};
+typedef struct st_index_scan_info
+{
+ uint idx; /* # of used key in param->keys */
+ uint keynr; /* # of used key in table */
+ uint range_count;
+ ha_rows records; /* estimate of # records this scan will return */
+
+ /* Set of intervals over key fields that will be used for row retrieval. */
+ SEL_ARG *sel_arg;
+
+ /* Fields used in the query and covered by this ROR scan. */
+ MY_BITMAP covered_fields;
+ uint used_fields_covered; /* # of set bits in covered_fields */
+ int key_rec_length; /* length of key record (including rowid) */
+
+ /*
+ Cost of reading all index records with values in sel_arg intervals set
+ (assuming there is no need to access full table records)
+ */
+ double index_read_cost;
+ uint first_uncovered_field; /* first unused bit in covered_fields */
+ uint key_components; /* # of parts in the key */
+} INDEX_SCAN_INFO;
+
/*
Fill param->needed_fields with bitmap of fields used in the query.
SYNOPSIS
@@ -2899,6 +3006,7 @@ int SQL_SELECT::test_quick_select(THD *t
*/
TRP_RANGE *range_trp;
TRP_ROR_INTERSECT *rori_trp;
+ TRP_INDEX_INTERSECT *intersect_trp;
bool can_build_covering= FALSE;
remove_nonrange_trees(¶m, tree);
@@ -2938,6 +3046,18 @@ int SQL_SELECT::test_quick_select(THD *t
best_trp= rori_trp;
}
}
+#if 1
+#else
+ if (optimizer_flag(thd, OPTIMIZER_SWITCH_INDEX_MERGE))
+ {
+ if ((intersect_trp= get_best_index_intersect(¶m, tree,
+ best_read_time)))
+ {
+ best_trp= intersect_trp;
+ best_read_time= best_trp->read_cost;
+ }
+ }
+#endif
if (optimizer_flag(thd, OPTIMIZER_SWITCH_INDEX_MERGE))
{
@@ -4601,6 +4721,85 @@ TABLE_READ_PLAN *merge_same_index_scans(
DBUG_RETURN(trp);
}
+static
+TRP_INDEX_INTERSECT *get_best_index_intersect(PARAM *param, SEL_TREE *tree,
+ double read_time)
+{
+ uint i;
+ uint unique_calc_buff_size;
+ TRP_RANGE **cur_range;
+ TRP_RANGE **range_scans;
+ TRP_INDEX_INTERSECT *intersect_trp= NULL;
+ double intersect_cost= 0.0;
+ ha_rows scan_records= 0;
+ double selectivity= 1.0;
+ ha_rows table_records= param->table->file->stats.records;
+ uint n_index_scans= tree->index_scans_end - tree->index_scans;
+
+ DBUG_ENTER("get_best_index_intersect");
+
+ if (!n_index_scans)
+ DBUG_RETURN(NULL);
+
+ if (!(range_scans= (TRP_RANGE**)alloc_root(param->mem_root,
+ sizeof(TRP_RANGE *)*
+ n_index_scans)))
+ DBUG_RETURN(NULL);
+
+ for (i= 0, cur_range= range_scans; i < n_index_scans; i++)
+ {
+ struct st_index_scan_info *index_scan= tree->index_scans[i];
+ if ((*cur_range= new (param->mem_root) TRP_RANGE(index_scan->sel_arg,
+ index_scan->idx)))
+ {
+ TRP_RANGE *trp= *cur_range;
+ trp->records= index_scan->records;
+ trp->is_ror= FALSE;
+ trp->read_cost= get_index_only_read_time(param, index_scan->records,
+ index_scan->keynr);
+ scan_records+= trp->records;
+ selectivity*= (double) trp->records/table_records;
+ intersect_cost+= trp->read_cost;
+ cur_range++;
+ }
+ }
+
+ /* Add Unique operations cost */
+ unique_calc_buff_size=
+ Unique::get_cost_calc_buff_size((ulong)scan_records,
+ param->table->file->ref_length,
+ param->thd->variables.sortbuff_size);
+ if (param->imerge_cost_buff_size < unique_calc_buff_size)
+ {
+ if (!(param->imerge_cost_buff= (uint*)alloc_root(param->mem_root,
+ unique_calc_buff_size)))
+ DBUG_RETURN(NULL);
+ param->imerge_cost_buff_size= unique_calc_buff_size;
+ }
+
+ intersect_cost +=
+ Unique::get_use_cost(param->imerge_cost_buff, scan_records,
+ param->table->file->ref_length,
+ param->thd->variables.sortbuff_size);
+
+ intersect_cost += get_sweep_read_cost(param,
+ (ha_rows) (table_records*selectivity));
+
+ if (intersect_cost < read_time)
+ {
+ if ((intersect_trp= new (param->mem_root)TRP_INDEX_INTERSECT))
+ {
+ intersect_trp->read_cost= intersect_cost;
+ intersect_trp->records= (ha_rows) table_records*selectivity;
+ set_if_bigger(intersect_trp->records, 1);
+ intersect_trp->range_scans= range_scans;
+ intersect_trp->range_scans_end= cur_range;
+ read_time= intersect_cost;
+ }
+ }
+ DBUG_RETURN(intersect_trp);
+}
+
/*
Calculate cost of 'index only' scan for given index and number of records.
@@ -4638,27 +4837,8 @@ static double get_index_only_read_time(c
}
-typedef struct st_ror_scan_info
-{
- uint idx; /* # of used key in param->keys */
- uint keynr; /* # of used key in table */
- ha_rows records; /* estimate of # records this scan will return */
-
- /* Set of intervals over key fields that will be used for row retrieval. */
- SEL_ARG *sel_arg;
-
- /* Fields used in the query and covered by this ROR scan. */
- MY_BITMAP covered_fields;
- uint used_fields_covered; /* # of set bits in covered_fields */
- int key_rec_length; /* length of key record (including rowid) */
-
- /*
- Cost of reading all index records with values in sel_arg intervals set
- (assuming there is no need to access full table records)
- */
- double index_read_cost;
- uint first_uncovered_field; /* first unused bit in covered_fields */
- uint key_components; /* # of parts in the key */
+typedef struct st_ror_scan_info : INDEX_SCAN_INFO
+{
} ROR_SCAN_INFO;
@@ -5518,6 +5698,14 @@ static TRP_RANGE *get_key_scans_params(P
"tree scans"););
tree->ror_scans_map.clear_all();
tree->n_ror_scans= 0;
+ tree->index_scans= 0;
+ if (!tree->keys_map.is_clear_all())
+ {
+ tree->index_scans=
+ (INDEX_SCAN_INFO **) alloc_root(param->mem_root,
+ sizeof(INDEX_SCAN_INFO *) * param->keys);
+ }
+ tree->index_scans_end= tree->index_scans;
for (idx= 0,key=tree->keys, end=key+param->keys;
key != end ;
key++,idx++)
@@ -5526,6 +5714,7 @@ static TRP_RANGE *get_key_scans_params(P
double found_read_time;
if (*key)
{
+ INDEX_SCAN_INFO *index_scan;
uint keynr= param->real_keynr[idx];
if ((*key)->type == SEL_ARG::MAYBE_KEY ||
(*key)->maybe_flag)
@@ -5535,6 +5724,17 @@ static TRP_RANGE *get_key_scans_params(P
(bool) param->table->covering_keys.is_set(keynr);
found_records= check_quick_select(param, idx, *key, update_tbl_stats);
+ if (found_records != HA_POS_ERROR && tree->index_scans &&
+ (index_scan= (INDEX_SCAN_INFO *)alloc_root(param->mem_root,
+ sizeof(INDEX_SCAN_INFO))))
+ {
+ index_scan->idx= idx;
+ index_scan->keynr= keynr;
+ index_scan->range_count= param->range_count;
+ index_scan->records= found_records;
+ index_scan->sel_arg= *key;
+ *tree->index_scans_end++= index_scan;
+ }
if (param->is_ror_scan)
{
tree->n_ror_scans++;
@@ -5629,6 +5829,34 @@ QUICK_SELECT_I *TRP_INDEX_MERGE::make_qu
return quick_imerge;
}
+QUICK_SELECT_I *TRP_INDEX_INTERSECT::make_quick(PARAM *param,
+ bool retrieve_full_rows,
+ MEM_ROOT *parent_alloc)
+{
+ QUICK_INDEX_INTERSECT_SELECT *quick_intersect;
+ QUICK_RANGE_SELECT *quick;
+ /* index_merge always retrieves full rows, ignore retrieve_full_rows */
+ if (!(quick_intersect= new QUICK_INDEX_INTERSECT_SELECT(param->thd, param->table)))
+ return NULL;
+
+ quick_intersect->records= records;
+ quick_intersect->read_time= read_cost;
+ for (TRP_RANGE **range_scan= range_scans; range_scan != range_scans_end;
+ range_scan++)
+ {
+ if (!(quick= (QUICK_RANGE_SELECT*)
+ ((*range_scan)->make_quick(param, FALSE, &quick_intersect->alloc)))||
+ quick_intersect->push_quick_back(quick))
+ {
+ delete quick;
+ delete quick_intersect;
+ return NULL;
+ }
+ }
+ return quick_intersect;
+}
+
+
QUICK_SELECT_I *TRP_ROR_INTERSECT::make_quick(PARAM *param,
bool retrieve_full_rows,
MEM_ROOT *parent_alloc)
@@ -8893,6 +9121,18 @@ bool QUICK_INDEX_MERGE_SELECT::is_keys_u
return 0;
}
+bool QUICK_INDEX_INTERSECT_SELECT::is_keys_used(const MY_BITMAP *fields)
+{
+ QUICK_RANGE_SELECT *quick;
+ List_iterator_fast<QUICK_RANGE_SELECT> it(quick_selects);
+ while ((quick= it++))
+ {
+ if (is_key_used(head, quick->index, fields))
+ return 1;
+ }
+ return 0;
+}
+
bool QUICK_ROR_INTERSECT_SELECT::is_keys_used(const MY_BITMAP *fields)
{
QUICK_RANGE_SELECT *quick;
@@ -9038,14 +9278,19 @@ err:
other error
*/
-int QUICK_INDEX_MERGE_SELECT::read_keys_and_merge()
+int read_keys_and_merge_scans(THD *thd,
+ TABLE *head,
+ List<QUICK_RANGE_SELECT> quick_selects,
+ QUICK_RANGE_SELECT *pk_quick_select,
+ READ_RECORD *read_record,
+ bool intersection)
{
List_iterator_fast<QUICK_RANGE_SELECT> cur_quick_it(quick_selects);
QUICK_RANGE_SELECT* cur_quick;
int result;
Unique *unique;
handler *file= head->file;
- DBUG_ENTER("QUICK_INDEX_MERGE_SELECT::read_keys_and_merge");
+ DBUG_ENTER("read_keys_and_merge");
/* We're going to just read rowids. */
file->extra(HA_EXTRA_KEYREAD);
@@ -9053,6 +9298,7 @@ int QUICK_INDEX_MERGE_SELECT::read_keys_
cur_quick_it.rewind();
cur_quick= cur_quick_it++;
+ bool first_quick= TRUE;
DBUG_ASSERT(cur_quick != 0);
/*
@@ -9064,13 +9310,20 @@ int QUICK_INDEX_MERGE_SELECT::read_keys_
unique= new Unique(refpos_order_cmp, (void *)file,
file->ref_length,
- thd->variables.sortbuff_size);
+ thd->variables.sortbuff_size,
+ intersection ? quick_selects.elements : 0);
if (!unique)
DBUG_RETURN(1);
for (;;)
{
while ((result= cur_quick->get_next()) == HA_ERR_END_OF_FILE)
{
+ if (first_quick)
+ {
+ first_quick= FALSE;
+ if (intersection && unique->is_in_memory())
+ unique->close_for_expansion();
+ }
cur_quick->range_end();
cur_quick= cur_quick_it++;
if (!cur_quick)
@@ -9113,14 +9366,24 @@ int QUICK_INDEX_MERGE_SELECT::read_keys_
*/
result= unique->get(head);
delete unique;
- doing_pk_scan= FALSE;
/* index_merge currently doesn't support "using index" at all */
file->extra(HA_EXTRA_NO_KEYREAD);
- init_read_record(&read_record, thd, head, (SQL_SELECT*) 0, 1 , 1, TRUE);
+ init_read_record(read_record, thd, head, (SQL_SELECT*) 0, 1 , 1, TRUE);
DBUG_RETURN(result);
}
+int QUICK_INDEX_MERGE_SELECT::read_keys_and_merge()
+
+{
+ int result;
+ DBUG_ENTER("QUICK_INDEX_MERGE_SELECT::read_keys_and_merge");
+ result= read_keys_and_merge_scans(thd, head, quick_selects, pk_quick_select,
+ &read_record, FALSE);
+ doing_pk_scan= FALSE;
+ DBUG_RETURN(result);
+}
+
/*
Get next row for index_merge.
NOTES
@@ -9157,6 +9420,44 @@ int QUICK_INDEX_MERGE_SELECT::get_next()
DBUG_RETURN(result);
}
+int QUICK_INDEX_INTERSECT_SELECT::read_keys_and_merge()
+
+{
+ int result;
+ DBUG_ENTER("QUICK_INDEX_INTERSECT_SELECT::read_keys_and_merge");
+ result= read_keys_and_merge_scans(thd, head, quick_selects, pk_quick_select,
+ &read_record, TRUE);
+ doing_pk_scan= FALSE;
+ DBUG_RETURN(result);
+}
+
+int QUICK_INDEX_INTERSECT_SELECT::get_next()
+{
+ int result;
+ DBUG_ENTER("QUICK_INDEX_INTERSECT_SELECT::get_next");
+
+ if (doing_pk_scan)
+ DBUG_RETURN(pk_quick_select->get_next());
+
+ if ((result= read_record.read_record(&read_record)) == -1)
+ {
+ result= HA_ERR_END_OF_FILE;
+ end_read_record(&read_record);
+ free_io_cache(head);
+ /* All rows from Unique have been retrieved, do a clustered PK scan */
+ if (pk_quick_select)
+ {
+ doing_pk_scan= TRUE;
+ if ((result= pk_quick_select->init()) ||
+ (result= pk_quick_select->reset()))
+ DBUG_RETURN(result);
+ DBUG_RETURN(pk_quick_select->get_next());
+ }
+ }
+
+ DBUG_RETURN(result);
+}
+
/*
Retrieve next record.
@@ -9887,6 +10188,28 @@ void QUICK_INDEX_MERGE_SELECT::add_info_
str->append(')');
}
+void QUICK_INDEX_INTERSECT_SELECT::add_info_string(String *str)
+{
+ QUICK_RANGE_SELECT *quick;
+ bool first= TRUE;
+ List_iterator_fast<QUICK_RANGE_SELECT> it(quick_selects);
+ str->append(STRING_WITH_LEN("sort_intersect("));
+ while ((quick= it++))
+ {
+ if (!first)
+ str->append(',');
+ else
+ first= FALSE;
+ quick->add_info_string(str);
+ }
+ if (pk_quick_select)
+ {
+ str->append(',');
+ pk_quick_select->add_info_string(str);
+ }
+ str->append(')');
+}
+
void QUICK_ROR_INTERSECT_SELECT::add_info_string(String *str)
{
bool first= TRUE;
@@ -9911,6 +10234,7 @@ void QUICK_ROR_INTERSECT_SELECT::add_inf
str->append(')');
}
+
void QUICK_ROR_UNION_SELECT::add_info_string(String *str)
{
bool first= TRUE;
@@ -9940,8 +10264,12 @@ void QUICK_RANGE_SELECT::add_keys_and_le
used_lengths->append(buf, length);
}
-void QUICK_INDEX_MERGE_SELECT::add_keys_and_lengths(String *key_names,
- String *used_lengths)
+static
+void add_keys_and_lengths_of_index_scans(TABLE *head,
+ List<QUICK_RANGE_SELECT> quick_selects,
+ QUICK_RANGE_SELECT *pk_quick_select,
+ String *key_names,
+ String *used_lengths)
{
char buf[64];
uint length;
@@ -9975,6 +10303,20 @@ void QUICK_INDEX_MERGE_SELECT::add_keys_
}
}
+void QUICK_INDEX_MERGE_SELECT::add_keys_and_lengths(String *key_names,
+ String *used_lengths)
+{
+ add_keys_and_lengths_of_index_scans(head, quick_selects, pk_quick_select,
+ key_names, used_lengths);
+}
+
+void QUICK_INDEX_INTERSECT_SELECT::add_keys_and_lengths(String *key_names,
+ String *used_lengths)
+{
+ add_keys_and_lengths_of_index_scans(head, quick_selects, pk_quick_select,
+ key_names, used_lengths);
+}
+
void QUICK_ROR_INTERSECT_SELECT::add_keys_and_lengths(String *key_names,
String *used_lengths)
{
@@ -12310,6 +12652,22 @@ void QUICK_INDEX_MERGE_SELECT::dbug_dump
fprintf(DBUG_FILE, "%*s}\n", indent, "");
}
+void QUICK_INDEX_INTERSECT_SELECT::dbug_dump(int indent, bool verbose)
+{
+ List_iterator_fast<QUICK_RANGE_SELECT> it(quick_selects);
+ QUICK_RANGE_SELECT *quick;
+ fprintf(DBUG_FILE, "%*squick index_intersect select\n", indent, "");
+ fprintf(DBUG_FILE, "%*smerged scans {\n", indent, "");
+ while ((quick= it++))
+ quick->dbug_dump(indent+2, verbose);
+ if (pk_quick_select)
+ {
+ fprintf(DBUG_FILE, "%*sclustered PK quick:\n", indent, "");
+ pk_quick_select->dbug_dump(indent+2, verbose);
+ }
+ fprintf(DBUG_FILE, "%*s}\n", indent, "");
+}
+
void QUICK_ROR_INTERSECT_SELECT::dbug_dump(int indent, bool verbose)
{
List_iterator_fast<QUICK_RANGE_SELECT> it(quick_selects);
=== modified file 'sql/opt_range.h'
--- a/sql/opt_range.h 2009-09-02 08:40:18 +0000
+++ b/sql/opt_range.h 2010-06-29 00:02:19 +0000
@@ -195,12 +195,13 @@ public:
enum {
QS_TYPE_RANGE = 0,
- QS_TYPE_INDEX_MERGE = 1,
- QS_TYPE_RANGE_DESC = 2,
- QS_TYPE_FULLTEXT = 3,
- QS_TYPE_ROR_INTERSECT = 4,
- QS_TYPE_ROR_UNION = 5,
- QS_TYPE_GROUP_MIN_MAX = 6
+ QS_TYPE_INDEX_INTERSECT = 1,
+ QS_TYPE_INDEX_MERGE = 2,
+ QS_TYPE_RANGE_DESC = 3,
+ QS_TYPE_FULLTEXT = 4,
+ QS_TYPE_ROR_INTERSECT = 5,
+ QS_TYPE_ROR_UNION = 6,
+ QS_TYPE_GROUP_MIN_MAX = 7
};
/* Get type of this quick select - one of the QS_TYPE_* values */
@@ -312,8 +313,16 @@ protected:
friend QUICK_RANGE_SELECT *get_quick_select(PARAM*,uint idx,
SEL_ARG *key_tree,
MEM_ROOT *alloc);
+ friend
+ int read_keys_and_merge_scans(THD *thd, TABLE *head,
+ List<QUICK_RANGE_SELECT> quick_selects,
+ QUICK_RANGE_SELECT *pk_quick_select,
+ READ_RECORD *read_record,
+ bool intersection);
+
friend class QUICK_SELECT_DESC;
friend class QUICK_INDEX_MERGE_SELECT;
+ friend class QUICK_INDEX_INTERSECT_SELECT;
friend class QUICK_ROR_INTERSECT_SELECT;
friend class QUICK_GROUP_MIN_MAX_SELECT;
@@ -463,6 +472,44 @@ public:
READ_RECORD read_record;
};
+class QUICK_INDEX_INTERSECT_SELECT : public QUICK_SELECT_I
+{
+public:
+ QUICK_INDEX_INTERSECT_SELECT(THD *thd, TABLE *table);
+ ~QUICK_INDEX_INTERSECT_SELECT();
+
+ int init();
+ int reset(void);
+ int get_next();
+ bool reverse_sorted() { return false; }
+ bool unique_key_range() { return false; }
+ int get_type() { return QS_TYPE_INDEX_INTERSECT; }
+ void add_keys_and_lengths(String *key_names, String *used_lengths);
+ void add_info_string(String *str);
+ bool is_keys_used(const MY_BITMAP *fields);
+#ifndef DBUG_OFF
+ void dbug_dump(int indent, bool verbose);
+#endif
+
+ bool push_quick_back(QUICK_RANGE_SELECT *quick_sel_range);
+
+ /* range quick selects this index_merge read consists of */
+ List<QUICK_RANGE_SELECT> quick_selects;
+
+ /* quick select that uses clustered primary key (NULL if none) */
+ QUICK_RANGE_SELECT* pk_quick_select;
+
+ /* true if this select is currently doing a clustered PK scan */
+ bool doing_pk_scan;
+
+ MEM_ROOT alloc;
+ THD *thd;
+ int read_keys_and_merge();
+
+ /* used to get rows collected in Unique */
+ READ_RECORD read_record;
+};
+
/*
Rowid-Ordered Retrieval (ROR) index intersection quick select.
=== modified file 'sql/sql_class.h'
--- a/sql/sql_class.h 2009-09-15 10:46:35 +0000
+++ b/sql/sql_class.h 2010-06-29 00:02:19 +0000
@@ -2827,6 +2827,7 @@ class user_var_entry
DTCollation collation;
};
+
/*
Unique -- class for unique (removing of duplicates).
Puts all values to the TREE. If the tree becomes too big,
@@ -2845,11 +2846,21 @@ class Unique :public Sql_alloc
uchar *record_pointers;
bool flush();
uint size;
+#if 0
+#else
+ uint full_size;
+ uint min_dupl_count;
+#endif
public:
ulong elements;
Unique(qsort_cmp2 comp_func, void *comp_func_fixed_arg,
+#if 0
uint size_arg, ulonglong max_in_memory_size_arg);
+#else
+ uint size_arg, ulonglong max_in_memory_size_arg,
+ uint min_dupl_count_arg= 0);
+#endif
~Unique();
ulong elements_in_tree() { return tree.elements_in_tree; }
inline bool unique_add(void *ptr)
@@ -2861,6 +2872,9 @@ public:
DBUG_RETURN(!tree_insert(&tree, ptr, 0, tree.custom_arg));
}
+ bool is_in_memory() { return (my_b_tell(&file) == 0); }
+ void close_for_expansion() { tree.flag= TREE_ONLY_DUPS; }
+
bool get(TABLE *table);
static double get_use_cost(uint *buffer, uint nkeys, uint key_size,
ulonglong max_in_memory_size);
@@ -2877,6 +2891,11 @@ public:
friend int unique_write_to_file(uchar* key, element_count count, Unique *unique);
friend int unique_write_to_ptrs(uchar* key, element_count count, Unique *unique);
+
+ friend int unique_write_to_file_with_count(uchar* key, element_count count,
+ Unique *unique);
+ friend int unique_intersect_write_to_ptrs(uchar* key, element_count count,
+ Unique *unique);
};
=== modified file 'sql/sql_select.cc'
--- a/sql/sql_select.cc 2009-10-26 11:38:17 +0000
+++ b/sql/sql_select.cc 2010-06-29 00:02:19 +0000
@@ -13333,7 +13333,8 @@ test_if_skip_sort_order(JOIN_TAB *tab,OR
by clustered PK values.
*/
- if (quick_type == QUICK_SELECT_I::QS_TYPE_INDEX_MERGE ||
+ if (quick_type == QUICK_SELECT_I::QS_TYPE_INDEX_MERGE ||
+ quick_type == QUICK_SELECT_I::QS_TYPE_INDEX_INTERSECT ||
quick_type == QUICK_SELECT_I::QS_TYPE_ROR_UNION ||
quick_type == QUICK_SELECT_I::QS_TYPE_ROR_INTERSECT)
DBUG_RETURN(0);
@@ -13682,6 +13683,7 @@ check_reverse_order:
QUICK_SELECT_DESC *tmp;
int quick_type= select->quick->get_type();
if (quick_type == QUICK_SELECT_I::QS_TYPE_INDEX_MERGE ||
+ quick_type == QUICK_SELECT_I::QS_TYPE_INDEX_INTERSECT ||
quick_type == QUICK_SELECT_I::QS_TYPE_ROR_INTERSECT ||
quick_type == QUICK_SELECT_I::QS_TYPE_ROR_UNION ||
quick_type == QUICK_SELECT_I::QS_TYPE_GROUP_MIN_MAX)
@@ -16405,6 +16407,7 @@ static void select_describe(JOIN *join,
quick_type= tab->select->quick->get_type();
if ((quick_type == QUICK_SELECT_I::QS_TYPE_INDEX_MERGE) ||
(quick_type == QUICK_SELECT_I::QS_TYPE_ROR_INTERSECT) ||
+ (quick_type == QUICK_SELECT_I::QS_TYPE_ROR_INTERSECT) ||
(quick_type == QUICK_SELECT_I::QS_TYPE_ROR_UNION))
tab->type = JT_INDEX_MERGE;
else
@@ -16609,6 +16612,7 @@ static void select_describe(JOIN *join,
{
if (quick_type == QUICK_SELECT_I::QS_TYPE_ROR_UNION ||
quick_type == QUICK_SELECT_I::QS_TYPE_ROR_INTERSECT ||
+ quick_type == QUICK_SELECT_I::QS_TYPE_INDEX_INTERSECT ||
quick_type == QUICK_SELECT_I::QS_TYPE_INDEX_MERGE)
{
extra.append(STRING_WITH_LEN("; Using "));
=== modified file 'sql/sql_sort.h'
--- a/sql/sql_sort.h 2007-09-27 14:05:07 +0000
+++ b/sql/sql_sort.h 2010-06-29 00:02:19 +0000
@@ -57,6 +57,7 @@ typedef struct st_sort_param {
uint addon_length; /* Length of added packed fields */
uint res_length; /* Length of records in final sorted file/buffer */
uint keys; /* Max keys / buffer */
+ element_count min_dupl_count;
ha_rows max_rows,examined_rows;
TABLE *sort_form; /* For quicker make_sortkey */
SORT_FIELD *local_sortorder;
@@ -80,4 +81,9 @@ int merge_buffers(SORTPARAM *param,IO_CA
IO_CACHE *to_file, uchar *sort_buffer,
BUFFPEK *lastbuff,BUFFPEK *Fb,
BUFFPEK *Tb,int flag);
+int merge_index(SORTPARAM *param, uchar *sort_buffer,
+ BUFFPEK *buffpek, uint maxbuffer,
+ IO_CACHE *tempfile, IO_CACHE *outfile);
+
void reuse_freed_buff(QUEUE *queue, BUFFPEK *reuse, uint key_length);
+
=== modified file 'sql/uniques.cc'
--- a/sql/uniques.cc 2009-09-07 20:50:10 +0000
+++ b/sql/uniques.cc 2010-06-29 00:02:19 +0000
@@ -33,7 +33,6 @@
#include "mysql_priv.h"
#include "sql_sort.h"
-
int unique_write_to_file(uchar* key, element_count count, Unique *unique)
{
/*
@@ -45,6 +44,12 @@ int unique_write_to_file(uchar* key, ele
return my_b_write(&unique->file, key, unique->size) ? 1 : 0;
}
+int unique_write_to_file_with_count(uchar* key, element_count count, Unique *unique)
+{
+ return my_b_write(&unique->file, key, unique->size) ||
+ my_b_write(&unique->file, &count, sizeof(element_count)) ? 1 : 0;
+}
+
int unique_write_to_ptrs(uchar* key, element_count count, Unique *unique)
{
memcpy(unique->record_pointers, key, unique->size);
@@ -52,10 +57,26 @@ int unique_write_to_ptrs(uchar* key, ele
return 0;
}
+int unique_intersect_write_to_ptrs(uchar* key, element_count count, Unique *unique)
+{
+ if (count >= unique->min_dupl_count)
+ {
+ memcpy(unique->record_pointers, key, unique->size);
+ unique->record_pointers+=unique->size;
+ }
+ return 0;
+}
+
+
Unique::Unique(qsort_cmp2 comp_func, void * comp_func_fixed_arg,
- uint size_arg, ulonglong max_in_memory_size_arg)
+ uint size_arg, ulonglong max_in_memory_size_arg,
+ uint min_dupl_count_arg)
:max_in_memory_size(max_in_memory_size_arg), size(size_arg), elements(0)
{
+ min_dupl_count= min_dupl_count_arg;
+ full_size= size;
+ if (min_dupl_count_arg)
+ full_size+= sizeof(element_count);
my_b_clear(&file);
init_tree(&tree, (ulong) (max_in_memory_size / 16), 0, size, comp_func, 0,
NULL, comp_func_fixed_arg);
@@ -276,7 +297,11 @@ double Unique::get_use_cost(uint *buffer
result= 2*log2_n_fact(last_tree_elems + 1.0);
if (n_full_trees)
result+= n_full_trees * log2_n_fact(max_elements_in_tree + 1.0);
+#if 1
result /= TIME_FOR_COMPARE_ROWID;
+#else
+ result /= TIME_FOR_COMPARE_ROWID * 10;
+#endif
DBUG_PRINT("info",("unique trees sizes: %u=%u*%lu + %lu", nkeys,
n_full_trees, n_full_trees?max_elements_in_tree:0,
@@ -327,7 +352,10 @@ bool Unique::flush()
file_ptr.count=tree.elements_in_tree;
file_ptr.file_pos=my_b_tell(&file);
- if (tree_walk(&tree, (tree_walk_action) unique_write_to_file,
+ tree_walk_action action= min_dupl_count ?
+ (tree_walk_action) unique_write_to_file_with_count :
+ (tree_walk_action) unique_write_to_file;
+ if (tree_walk(&tree, action,
(void*) this, left_root_right) ||
insert_dynamic(&file_ptrs, (uchar*) &file_ptr))
return 1;
@@ -357,6 +385,7 @@ Unique::reset()
reinit_io_cache(&file, WRITE_CACHE, 0L, 0, 1);
}
elements= 0;
+ tree.flag= 0;
}
/*
@@ -576,14 +605,16 @@ bool Unique::get(TABLE *table)
{
SORTPARAM sort_param;
table->sort.found_records=elements+tree.elements_in_tree;
-
if (my_b_tell(&file) == 0)
{
/* Whole tree is in memory; Don't use disk if you don't need to */
if ((record_pointers=table->sort.record_pointers= (uchar*)
my_malloc(size * tree.elements_in_tree, MYF(0))))
{
- (void) tree_walk(&tree, (tree_walk_action) unique_write_to_ptrs,
+ tree_walk_action action= min_dupl_count ?
+ (tree_walk_action) unique_intersect_write_to_ptrs :
+ (tree_walk_action) unique_write_to_ptrs;
+ (void) tree_walk(&tree, action,
this, left_root_right);
return 0;
}
@@ -614,7 +645,10 @@ bool Unique::get(TABLE *table)
sort_param.max_rows= elements;
sort_param.sort_form=table;
sort_param.rec_length= sort_param.sort_length= sort_param.ref_length=
- size;
+ sort_param.rec_length= sort_param.sort_length= sort_param.ref_length=
+ full_size;
+ sort_param.min_dupl_count= min_dupl_count;
+ sort_param.res_length= 0;
sort_param.keys= (uint) (max_in_memory_size / sort_param.sort_length);
sort_param.not_killable=1;
@@ -635,8 +669,9 @@ bool Unique::get(TABLE *table)
if (flush_io_cache(&file) ||
reinit_io_cache(&file,READ_CACHE,0L,0,0))
goto err;
- if (merge_buffers(&sort_param, &file, outfile, sort_buffer, file_ptr,
- file_ptr, file_ptr+maxbuffer,0))
+ sort_param.res_length= sort_param.rec_length-
+ (min_dupl_count ? sizeof(min_dupl_count) : 0);
+ if (merge_index(&sort_param, sort_buffer, file_ptr, maxbuffer, &file, outfile))
goto err;
error=0;
err:
1
0
[Maria-developers] update_virtual_fields() calls missing in sql_join_cache.cc?
by Sergey Petrunya 28 Jun '10
by Sergey Petrunya 28 Jun '10
28 Jun '10
Hello Igor,
It has come to my attention that sql_join_cache.cc does not have as many
update_virtual_fields() calls as I think it ought to have.
My reasoning was as follows: AFAIU when one has read a record from a table,
they must call update_virtual_fields() before they try to evaluate the
attached table condition (because the condition may refer to virtual fields).
Now if one opens sql_join_cache.cc and looks at these three functions:
JOIN_CACHE_BKA::join_matching_records(bool skip_last)
{
...
while (!(error= file->multi_range_read_next((char **) &rec_ptr)))
{
...
rc= generate_full_extensions(rec_ptr);
...
}
...
}
JOIN_CACHE::generate_full_extensions(uchar *rec_ptr)
{
...
if (check_match(rec_ptr))
{
...
}
JOIN_CACHE::check_match(uchar *rec_ptr)
{
/* Check whether pushdown conditions are satisfied */
if (join_tab->select && join_tab->select->skip_record())
return FALSE;
...
}
one can see a call path where we read a table record with multi_range_read_next
and then proceed to evaluating the attached condition with skip_record()
without calling update_virtual_fields(). Is there a problem.
Another thing I can't understand about update_virtual_fields() is the asymmetry
between rr_XXX() functions. Why do rr_quick() and rr_sequential() call
update_virtual_fields() while rr_index_first() and rr_from_pointers() don't? If
that is intentional, I think it deserves to be documented.
BR
Sergey
--
Sergey Petrunia, Software Developer
Monty Program AB, http://askmonty.org
Blog: http://s.petrunia.net/blog
1
0
Hi.
A quick summary of recent changes in WorkLog:
* there are no "archived" tasks anymore. All cancelled and closed tasks
are "inactive", others are "active"
* guest users shouldn't be able to update WL tasks now.
* "private" field dissapeared, you won't see "Private: yes" anymore.
* major rework of the reporting engine:
- multiple report generators (two, at the moment)
- multi-value select filters
- configurable result display
- saved reports
- more fields to display and to filter on
* sections in the task view are reordered
* second code review removed
* lead architect added
* WL emails set List-Archive and have the task number first in the
subject
* a "developer" can be anybody, not just an employee of MPAB
* virtual tasks removed from the right sidebar
- one can get similar results with saved reports
* no "Current as text" button
* no two tasks may have the same title anymore (this was a bug in the
old worklog that caused duplicate titles)
I may occasionally do more work on WL, after using it for a while and
understanding what's missing.
Regards,
Sergei
1
0
[Maria-developers] WL#69 Deleted (by Serg): Fix table_cache negative scalability
by worklog-noreply@askmonty.org 28 Jun '10
by worklog-noreply@askmonty.org 28 Jun '10
28 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Fix table_cache negative scalability
CREATION DATE..: Fri, 18 Dec 2009, 16:30
SUPERVISOR.....: Bothorsen
IMPLEMENTOR....:
COPIES TO......:
CATEGORY.......: Server-Sprint
TASK ID........: 69 (http://askmonty.org/worklog/?tid=69)
VERSION........: WorkLog-3.4
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
DESCRIPTION:
Fix the problem described in this blog entry:
http://www.mysqlperformanceblog.com/2009/11/16/table_cache-negative-scalabi…
You can read the blog, or the text below.
--- quoted text ---
November 16, 2009
table_cache negative scalability
Posted by peter | Vote on Planet MySQL
Couple of months ago there was a post by FreshBooks on getting great performance
improvements by lowering table_cache variable. So I decided to investigate what
is really happening here.
The common sense approach to tuning caches is to get them as large as you can
if you have enough resources (such as memory). With MySQL common sense however
does not always works weve seen performance issues with large
query_cache_size also sort_buffer_size and read_buffer_size may not give you
better performance if you increase them. I found this also applies to some other
buffers.
Even though having previous experience of surprised behavior I did not expect
such a table_cache issue the LRU for cache management is classics and there
are scalable algorithms to deal with it. I would expect Monty to implement one
of them.
To do the test I have created 100.000 empty tables containing single integer
column and no indexes and when ran SELECT * FROM tableN in the loop. Each table
in such case is accessed only once and on any but first run each access would
require table replacement in table cache based on LRU logic.
MySQL Sandbox helped me to test this with different servers easily.
I did test on CentOS 5.3, Xeon E5405, 16GB RAM and EXT3 file system on the SATA
hard drive.
MySQL 5.0.85 Created 100.000 tables in around 3min 40 sec which is about 450
tables/sec This indicates the fsync is lying on this test system as default
sync_frm option is used.
With default table_cache=64 accessing all tables take 12 sec which is almost
8500 tables/sec which is a great speed. We can note significant writes to the
disk during this read-only benchmark. Why ? Because for MyISAM tables table
header has to be modified each time the table is opened. In this case the
performance was so great because all 100.000 tables data (first block of index)
was placed close by on disk as well as fully cached which made updates to
headers very slow. In the production systems with table headers not in OS cache
you often will see significantly low numbers 100 or less.
With significantly larger table_cache=16384 (and appropriately adjusted number
of open files) the same operation takes 660 seconds which is 151 tables/sec
which is around 50 times slower. Wow. This is the slow down. We can see the load
becomes very CPU bound in this case and it looks like some of the table_cache
algorithms do not scale well.
The absolute numbers are also very interesting 151 tables/sec is not that bad
if you look at it as an absolute number. So if you tune table cache is normal
case and is able to bring down your miss rate (opened_tables) to 10/sec or less
by using large table_cache you should do so. However if you have so many tables
you still see 100+ misses/sec while your data (at least table headers) is well
cached so the cost of table cache miss is not very high, you may be better of
with significantly reduced table cache size.
The next step for me was to see if the problem was fixed in MySQL 5.1 in this
version table_cache was significantly redone and split in table_open_cache and
table_definition_cache and I assumed the behavior may be different as well.
MySQL 5.1.40
I started testing with default table_open_cache=64 and
table_definition_cache=256 the read took about 12 seconds very close to MySQL
5.0.85.
As I increased table_definition_cache to 16384 result remained the same so this
variable is not causing the bottleneck. However increasing table_open_cache to
16384 causes scan to take about 780 sec which is a bit worse than MySQL 5.0.85.
So the problem is not fixed in MySQL 5.1, lets see how MySQL 5.4 behaves.
MySQL 5.4.2
MySQL 5.4.2 has higher default table_open_cache so I took it down to 64 so we
can compare apples to apples. It performs same as MySQL 5.0 and MySQL 5.1 with
small table cache.
With table_open_cache increased to 16384 the test took 750 seconds so the
problem exists in MySQL 5.4 as well.
So the problem is real and it is not fixed even in Performance focused MySQL
5.4. As we can see large table_cache (or table_open_cache_ values indeed can
cause significant performance problems. Interesting enough Innodb has a very
similar task of managing its own cache of file descriptors (set by
innodb_open_files) As the time allows I should test if Heikki knows how to
implement LRU properly so it does not have problem with large number. Well see.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
[Maria-developers] WL#70 Deleted (by Serg): Fix table_cache negative scalability
by worklog-noreply@askmonty.org 28 Jun '10
by worklog-noreply@askmonty.org 28 Jun '10
28 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Fix table_cache negative scalability
CREATION DATE..: Fri, 18 Dec 2009, 16:30
SUPERVISOR.....: Bothorsen
IMPLEMENTOR....:
COPIES TO......:
CATEGORY.......: Server-Sprint
TASK ID........: 70 (http://askmonty.org/worklog/?tid=70)
VERSION........: WorkLog-3.4
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
DESCRIPTION:
Fix the problem described in this blog entry:
http://www.mysqlperformanceblog.com/2009/11/16/table_cache-negative-scalabi…
You can read the blog, or the text below.
--- quoted text ---
November 16, 2009
table_cache negative scalability
Posted by peter | Vote on Planet MySQL
Couple of months ago there was a post by FreshBooks on getting great performance
improvements by lowering table_cache variable. So I decided to investigate what
is really happening here.
The common sense approach to tuning caches is to get them as large as you can
if you have enough resources (such as memory). With MySQL common sense however
does not always works weve seen performance issues with large
query_cache_size also sort_buffer_size and read_buffer_size may not give you
better performance if you increase them. I found this also applies to some other
buffers.
Even though having previous experience of surprised behavior I did not expect
such a table_cache issue the LRU for cache management is classics and there
are scalable algorithms to deal with it. I would expect Monty to implement one
of them.
To do the test I have created 100.000 empty tables containing single integer
column and no indexes and when ran SELECT * FROM tableN in the loop. Each table
in such case is accessed only once and on any but first run each access would
require table replacement in table cache based on LRU logic.
MySQL Sandbox helped me to test this with different servers easily.
I did test on CentOS 5.3, Xeon E5405, 16GB RAM and EXT3 file system on the SATA
hard drive.
MySQL 5.0.85 Created 100.000 tables in around 3min 40 sec which is about 450
tables/sec This indicates the fsync is lying on this test system as default
sync_frm option is used.
With default table_cache=64 accessing all tables take 12 sec which is almost
8500 tables/sec which is a great speed. We can note significant writes to the
disk during this read-only benchmark. Why ? Because for MyISAM tables table
header has to be modified each time the table is opened. In this case the
performance was so great because all 100.000 tables data (first block of index)
was placed close by on disk as well as fully cached which made updates to
headers very slow. In the production systems with table headers not in OS cache
you often will see significantly low numbers 100 or less.
With significantly larger table_cache=16384 (and appropriately adjusted number
of open files) the same operation takes 660 seconds which is 151 tables/sec
which is around 50 times slower. Wow. This is the slow down. We can see the load
becomes very CPU bound in this case and it looks like some of the table_cache
algorithms do not scale well.
The absolute numbers are also very interesting 151 tables/sec is not that bad
if you look at it as an absolute number. So if you tune table cache is normal
case and is able to bring down your miss rate (opened_tables) to 10/sec or less
by using large table_cache you should do so. However if you have so many tables
you still see 100+ misses/sec while your data (at least table headers) is well
cached so the cost of table cache miss is not very high, you may be better of
with significantly reduced table cache size.
The next step for me was to see if the problem was fixed in MySQL 5.1 in this
version table_cache was significantly redone and split in table_open_cache and
table_definition_cache and I assumed the behavior may be different as well.
MySQL 5.1.40
I started testing with default table_open_cache=64 and
table_definition_cache=256 the read took about 12 seconds very close to MySQL
5.0.85.
As I increased table_definition_cache to 16384 result remained the same so this
variable is not causing the bottleneck. However increasing table_open_cache to
16384 causes scan to take about 780 sec which is a bit worse than MySQL 5.0.85.
So the problem is not fixed in MySQL 5.1, lets see how MySQL 5.4 behaves.
MySQL 5.4.2
MySQL 5.4.2 has higher default table_open_cache so I took it down to 64 so we
can compare apples to apples. It performs same as MySQL 5.0 and MySQL 5.1 with
small table cache.
With table_open_cache increased to 16384 the test took 750 seconds so the
problem exists in MySQL 5.4 as well.
So the problem is real and it is not fixed even in Performance focused MySQL
5.4. As we can see large table_cache (or table_open_cache_ values indeed can
cause significant performance problems. Interesting enough Innodb has a very
similar task of managing its own cache of file descriptors (set by
innodb_open_files) As the time allows I should test if Heikki knows how to
implement LRU properly so it does not have problem with large number. Well see.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
[Maria-developers] WL#72 Deleted (by Serg): Fix table_cache negative scalability
by worklog-noreply@askmonty.org 28 Jun '10
by worklog-noreply@askmonty.org 28 Jun '10
28 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Fix table_cache negative scalability
CREATION DATE..: Fri, 18 Dec 2009, 16:31
SUPERVISOR.....: Bothorsen
IMPLEMENTOR....:
COPIES TO......:
CATEGORY.......: Server-Sprint
TASK ID........: 72 (http://askmonty.org/worklog/?tid=72)
VERSION........: WorkLog-3.4
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
DESCRIPTION:
Fix the problem described in this blog entry:
http://www.mysqlperformanceblog.com/2009/11/16/table_cache-negative-scalabi…
I attempted to paste the contents of the blog here, but worklog didn't accept
the text. You have to read the blog entry.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
[Maria-developers] WL#73 Deleted (by Serg): Fix table_cache negative scalability
by worklog-noreply@askmonty.org 28 Jun '10
by worklog-noreply@askmonty.org 28 Jun '10
28 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Fix table_cache negative scalability
CREATION DATE..: Fri, 18 Dec 2009, 16:31
SUPERVISOR.....: Bothorsen
IMPLEMENTOR....:
COPIES TO......:
CATEGORY.......: Server-Sprint
TASK ID........: 73 (http://askmonty.org/worklog/?tid=73)
VERSION........: WorkLog-3.4
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
DESCRIPTION:
Fix the problem described in this blog entry:
http://www.mysqlperformanceblog.com/2009/11/16/table_cache-negative-scalabi…
You can read the blog, or the text below.
--- quoted text ---
November 16, 2009
table_cache negative scalability
Posted by peter | Vote on Planet MySQL
Couple of months ago there was a post by FreshBooks on getting great performance
improvements by lowering table_cache variable. So I decided to investigate what
is really happening here.
The common sense approach to tuning caches is to get them as large as you can
if you have enough resources (such as memory). With MySQL common sense however
does not always works weve seen performance issues with large
query_cache_size also sort_buffer_size and read_buffer_size may not give you
better performance if you increase them. I found this also applies to some other
buffers.
Even though having previous experience of surprised behavior I did not expect
such a table_cache issue the LRU for cache management is classics and there
are scalable algorithms to deal with it. I would expect Monty to implement one
of them.
To do the test I have created 100.000 empty tables containing single integer
column and no indexes and when ran SELECT * FROM tableN in the loop. Each table
in such case is accessed only once and on any but first run each access would
require table replacement in table cache based on LRU logic.
MySQL Sandbox helped me to test this with different servers easily.
I did test on CentOS 5.3, Xeon E5405, 16GB RAM and EXT3 file system on the SATA
hard drive.
MySQL 5.0.85 Created 100.000 tables in around 3min 40 sec which is about 450
tables/sec This indicates the fsync is lying on this test system as default
sync_frm option is used.
With default table_cache=64 accessing all tables take 12 sec which is almost
8500 tables/sec which is a great speed. We can note significant writes to the
disk during this read-only benchmark. Why ? Because for MyISAM tables table
header has to be modified each time the table is opened. In this case the
performance was so great because all 100.000 tables data (first block of index)
was placed close by on disk as well as fully cached which made updates to
headers very slow. In the production systems with table headers not in OS cache
you often will see significantly low numbers 100 or less.
With significantly larger table_cache=16384 (and appropriately adjusted number
of open files) the same operation takes 660 seconds which is 151 tables/sec
which is around 50 times slower. Wow. This is the slow down. We can see the load
becomes very CPU bound in this case and it looks like some of the table_cache
algorithms do not scale well.
The absolute numbers are also very interesting 151 tables/sec is not that bad
if you look at it as an absolute number. So if you tune table cache is normal
case and is able to bring down your miss rate (opened_tables) to 10/sec or less
by using large table_cache you should do so. However if you have so many tables
you still see 100+ misses/sec while your data (at least table headers) is well
cached so the cost of table cache miss is not very high, you may be better of
with significantly reduced table cache size.
The next step for me was to see if the problem was fixed in MySQL 5.1 in this
version table_cache was significantly redone and split in table_open_cache and
table_definition_cache and I assumed the behavior may be different as well.
MySQL 5.1.40
I started testing with default table_open_cache=64 and
table_definition_cache=256 the read took about 12 seconds very close to MySQL
5.0.85.
As I increased table_definition_cache to 16384 result remained the same so this
variable is not causing the bottleneck. However increasing table_open_cache to
16384 causes scan to take about 780 sec which is a bit worse than MySQL 5.0.85.
So the problem is not fixed in MySQL 5.1, lets see how MySQL 5.4 behaves.
MySQL 5.4.2
MySQL 5.4.2 has higher default table_open_cache so I took it down to 64 so we
can compare apples to apples. It performs same as MySQL 5.0 and MySQL 5.1 with
small table cache.
With table_open_cache increased to 16384 the test took 750 seconds so the
problem exists in MySQL 5.4 as well.
So the problem is real and it is not fixed even in Performance focused MySQL
5.4. As we can see large table_cache (or table_open_cache_ values indeed can
cause significant performance problems. Interesting enough Innodb has a very
similar task of managing its own cache of file descriptors (set by
innodb_open_files) As the time allows I should test if Heikki knows how to
implement LRU properly so it does not have problem with large number. Well see.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
[Maria-developers] WL#78 Deleted (by Serg): INSTALL PLUGIN *
by worklog-noreply@askmonty.org 28 Jun '10
by worklog-noreply@askmonty.org 28 Jun '10
28 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: INSTALL PLUGIN *
CREATION DATE..: Tue, 09 Feb 2010, 18:10
SUPERVISOR.....: Sergei
IMPLEMENTOR....:
COPIES TO......:
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 78 (http://askmonty.org/worklog/?tid=78)
VERSION........: WorkLog-3.4
STATUS.........: Cancelled
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 10 (hours remain)
ORIG. ESTIMATE.: 10
PROGRESS NOTES:
-=-=(Serg - Mon, 08 Mar 2010, 20:13)=-=-
High Level Description modified.
--- /tmp/wklog.78.old.32207 2010-03-08 20:13:06.000000000 +0000
+++ /tmp/wklog.78.new.32207 2010-03-08 20:13:06.000000000 +0000
@@ -6,3 +6,5 @@
INSTALL PLUGIN * SONAME xxx
would be a more convenient way to install everything at once.
+
+cancelled, as a duplicate of mwl:77
-=-=(Serg - Mon, 08 Mar 2010, 20:12)=-=-
Status updated.
--- /tmp/wklog.78.old.32035 2010-03-08 20:12:39.000000000 +0000
+++ /tmp/wklog.78.new.32035 2010-03-08 20:12:39.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Cancelled
DESCRIPTION:
InnoDB, XtraDB, PBXT (at least) come with a storage engine plugin and many
information_schema plugins in one .so file.
Currently one needs to install them all one by one.
INSTALL PLUGIN * SONAME xxx
would be a more convenient way to install everything at once.
cancelled, as a duplicate of mwl:77
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
[Maria-developers] WL#78 Deleted (by Serg): INSTALL PLUGIN *
by worklog-noreply@askmonty.org 28 Jun '10
by worklog-noreply@askmonty.org 28 Jun '10
28 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: INSTALL PLUGIN *
CREATION DATE..: Tue, 09 Feb 2010, 18:10
SUPERVISOR.....: Sergei
IMPLEMENTOR....:
COPIES TO......:
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 78 (http://askmonty.org/worklog/?tid=78)
VERSION........: WorkLog-3.4
STATUS.........: Cancelled
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 10 (hours remain)
ORIG. ESTIMATE.: 10
PROGRESS NOTES:
-=-=(Serg - Mon, 08 Mar 2010, 20:13)=-=-
High Level Description modified.
--- /tmp/wklog.78.old.32207 2010-03-08 20:13:06.000000000 +0000
+++ /tmp/wklog.78.new.32207 2010-03-08 20:13:06.000000000 +0000
@@ -6,3 +6,5 @@
INSTALL PLUGIN * SONAME xxx
would be a more convenient way to install everything at once.
+
+cancelled, as a duplicate of mwl:77
-=-=(Serg - Mon, 08 Mar 2010, 20:12)=-=-
Status updated.
--- /tmp/wklog.78.old.32035 2010-03-08 20:12:39.000000000 +0000
+++ /tmp/wklog.78.new.32035 2010-03-08 20:12:39.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Cancelled
DESCRIPTION:
InnoDB, XtraDB, PBXT (at least) come with a storage engine plugin and many
information_schema plugins in one .so file.
Currently one needs to install them all one by one.
INSTALL PLUGIN * SONAME xxx
would be a more convenient way to install everything at once.
cancelled, as a duplicate of mwl:77
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
[Maria-developers] WL#75 Deleted (by Serg): Extend build to create a shared libmysqld.so library
by worklog-noreply@askmonty.org 28 Jun '10
by worklog-noreply@askmonty.org 28 Jun '10
28 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Extend build to create a shared libmysqld.so library
CREATION DATE..: Fri, 22 Jan 2010, 09:39
SUPERVISOR.....: Knielsen
IMPLEMENTOR....:
COPIES TO......:
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 75 (http://askmonty.org/worklog/?tid=75)
VERSION........: Server-9.x
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
DESCRIPTION:
Currently, the embedded library libmysqld is only built as a static library,
libmysqld.a.
However, there is also a need for a shared embedded library, libmysqld.so.
A good example is Amarok, which is now using MySQL (and by default an embedded
libmysqld) for storing the user's music collection. Amarok is designed to load
a number of modules as .so plugins, and the code using libmysqld is one such
plugin. Any code that is to be loaded as a shared object on Linux must be
built with -fPIC (position independent code). Any library used must thus also
be built with -fPIC and preferably itself be linked statically.
Amarok is a widely used application (end-user desktop). It could be a good
leverage for making libmysqld more popular eg. in distros. However, currently
the distros need to resort to various hacks to make things work with libmysqld
and Amarok. Some links:
Fedora patches .spec to add -fPIC in the build, extract all objects from
libmysqld.a, and re-link them as libmysqld.so.
http://cvs.fedoraproject.org/viewvc/devel/mysql/mysql.spec?r1=1.108&r2=1.109
Gentoo apparently have a patch for MySQL to build libmysqld.so:
http://bugs.gentoo.org/attachment.cgi?id=188057
Debian seems to suggest building both libmysqld.a and libmysqld_pic.a, the
latter to be used for shared objects linking with embedded server:
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=508406#52
It would be nice if the main build system was able to build properly
libmysqld.so, so different distros would not need to resort to different hacks
to get things working.
The normal way to build both .a and .so as far as I know is to use libtool; it
will then build each object twice (with and without -fPIC) if needed. This one
might want to make optional to reduce build times. It would also need to be
done for each storage engine.
An alternative would be to build everything with -fPIC. This should probably
be optional also, as on some architectures (ELF x86 32-bit), there is some
speed penalty for -fPIC. The libmysqld.so would then be made only in this
case. Distro packages could do a separate build with -fPIC to make
libmysqld.so (to not get -fPIC into the main server code).
There is a MySQL bug for this:
http://bugs.mysql.com/bug.php?id=39288
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
28 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Unused
CREATION DATE..: Sun, 14 Feb 2010, 00:09
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Igor, Monty, Sergei
CATEGORY.......: Server-BackLog
TASK ID........: 84 (http://askmonty.org/worklog/?tid=84)
VERSION........: Server-9.x
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Igor - Tue, 16 Mar 2010, 19:28)=-=-
Title modified.
--- /tmp/wklog.84.old.22271 2010-03-16 19:28:50.000000000 +0000
+++ /tmp/wklog.84.new.22271 2010-03-16 19:28:50.000000000 +0000
@@ -1 +1 @@
-Partitioned Key Cache for MyISAM
+Unused
-=-=(Igor - Tue, 16 Mar 2010, 19:28)=-=-
Version updated.
--- /tmp/wklog.84.old.22271 2010-03-16 19:28:50.000000000 +0000
+++ /tmp/wklog.84.new.22271 2010-03-16 19:28:50.000000000 +0000
@@ -1 +1 @@
-Benchmarks-3.0
+Server-9.x
-=-=(Igor - Tue, 16 Mar 2010, 19:28)=-=-
High Level Description modified.
--- /tmp/wklog.84.old.22253 2010-03-16 19:28:09.000000000 +0000
+++ /tmp/wklog.84.new.22253 2010-03-16 19:28:09.000000000 +0000
@@ -1,18 +1 @@
-A partitioned key cache is a collection of structures for regular MyiSAM key
-caches called key cache partitions. Any page from a file can be placed into a
-buffer of only one partition. The number of the partition is calculated from the
-file number and the position of the page in the file, and it's always the same
-for the page. The function that maps pages into partitions takes care of even
-distribution of pages among partitions.
-Partition key cache mitigate one of the major problem of simple key cache:
-thread contention for key cache lock (mutex). Every call of a key cache
-interface function must acquire this lock. So threads compete for this lock even
-in the case when they have acquired shared locks for the file and pages they
-want read from are in the key cache buffers. When working with a partitioned key
-cache any key cache interface function that needs only one page has to acquire
-the key cache lock only for the partition the page is ascribed to. This makes
-the chances for threads not compete for the same key cache lock better.
-
-The idea and the original of the partitioned key cache was provided by one of
-our external contributers.
DESCRIPTION:
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
28 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Unused
CREATION DATE..: Sun, 14 Feb 2010, 00:09
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Igor, Monty, Sergei
CATEGORY.......: Server-BackLog
TASK ID........: 84 (http://askmonty.org/worklog/?tid=84)
VERSION........: Server-9.x
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Igor - Tue, 16 Mar 2010, 19:28)=-=-
Title modified.
--- /tmp/wklog.84.old.22271 2010-03-16 19:28:50.000000000 +0000
+++ /tmp/wklog.84.new.22271 2010-03-16 19:28:50.000000000 +0000
@@ -1 +1 @@
-Partitioned Key Cache for MyISAM
+Unused
-=-=(Igor - Tue, 16 Mar 2010, 19:28)=-=-
Version updated.
--- /tmp/wklog.84.old.22271 2010-03-16 19:28:50.000000000 +0000
+++ /tmp/wklog.84.new.22271 2010-03-16 19:28:50.000000000 +0000
@@ -1 +1 @@
-Benchmarks-3.0
+Server-9.x
-=-=(Igor - Tue, 16 Mar 2010, 19:28)=-=-
High Level Description modified.
--- /tmp/wklog.84.old.22253 2010-03-16 19:28:09.000000000 +0000
+++ /tmp/wklog.84.new.22253 2010-03-16 19:28:09.000000000 +0000
@@ -1,18 +1 @@
-A partitioned key cache is a collection of structures for regular MyiSAM key
-caches called key cache partitions. Any page from a file can be placed into a
-buffer of only one partition. The number of the partition is calculated from the
-file number and the position of the page in the file, and it's always the same
-for the page. The function that maps pages into partitions takes care of even
-distribution of pages among partitions.
-Partition key cache mitigate one of the major problem of simple key cache:
-thread contention for key cache lock (mutex). Every call of a key cache
-interface function must acquire this lock. So threads compete for this lock even
-in the case when they have acquired shared locks for the file and pages they
-want read from are in the key cache buffers. When working with a partitioned key
-cache any key cache interface function that needs only one page has to acquire
-the key cache lock only for the partition the page is ascribed to. This makes
-the chances for threads not compete for the same key cache lock better.
-
-The idea and the original of the partitioned key cache was provided by one of
-our external contributers.
DESCRIPTION:
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
28 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Unused
CREATION DATE..: Sun, 14 Feb 2010, 00:09
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Igor, Monty, Sergei
CATEGORY.......: Server-BackLog
TASK ID........: 84 (http://askmonty.org/worklog/?tid=84)
VERSION........: Server-9.x
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Igor - Tue, 16 Mar 2010, 19:28)=-=-
Title modified.
--- /tmp/wklog.84.old.22271 2010-03-16 19:28:50.000000000 +0000
+++ /tmp/wklog.84.new.22271 2010-03-16 19:28:50.000000000 +0000
@@ -1 +1 @@
-Partitioned Key Cache for MyISAM
+Unused
-=-=(Igor - Tue, 16 Mar 2010, 19:28)=-=-
Version updated.
--- /tmp/wklog.84.old.22271 2010-03-16 19:28:50.000000000 +0000
+++ /tmp/wklog.84.new.22271 2010-03-16 19:28:50.000000000 +0000
@@ -1 +1 @@
-Benchmarks-3.0
+Server-9.x
-=-=(Igor - Tue, 16 Mar 2010, 19:28)=-=-
High Level Description modified.
--- /tmp/wklog.84.old.22253 2010-03-16 19:28:09.000000000 +0000
+++ /tmp/wklog.84.new.22253 2010-03-16 19:28:09.000000000 +0000
@@ -1,18 +1 @@
-A partitioned key cache is a collection of structures for regular MyiSAM key
-caches called key cache partitions. Any page from a file can be placed into a
-buffer of only one partition. The number of the partition is calculated from the
-file number and the position of the page in the file, and it's always the same
-for the page. The function that maps pages into partitions takes care of even
-distribution of pages among partitions.
-Partition key cache mitigate one of the major problem of simple key cache:
-thread contention for key cache lock (mutex). Every call of a key cache
-interface function must acquire this lock. So threads compete for this lock even
-in the case when they have acquired shared locks for the file and pages they
-want read from are in the key cache buffers. When working with a partitioned key
-cache any key cache interface function that needs only one page has to acquire
-the key cache lock only for the partition the page is ascribed to. This makes
-the chances for threads not compete for the same key cache lock better.
-
-The idea and the original of the partitioned key cache was provided by one of
-our external contributers.
DESCRIPTION:
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
28 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Unused
CREATION DATE..: Sun, 14 Feb 2010, 00:17
SUPERVISOR.....: Monty
IMPLEMENTOR....: Igor
COPIES TO......: Igor, Monty, Sergei
CATEGORY.......: Server-Sprint
TASK ID........: 86 (http://askmonty.org/worklog/?tid=86)
VERSION........: Server-5.2
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 80 (hours remain)
ORIG. ESTIMATE.: 80
PROGRESS NOTES:
-=-=(Igor - Tue, 16 Mar 2010, 19:30)=-=-
Title modified.
--- /tmp/wklog.86.old.22309 2010-03-16 19:30:04.000000000 +0000
+++ /tmp/wklog.86.new.22309 2010-03-16 19:30:04.000000000 +0000
@@ -1 +1 @@
-Partitioned Key Cache for MyISAM
+Unused
-=-=(Igor - Tue, 16 Mar 2010, 19:29)=-=-
High Level Description modified.
--- /tmp/wklog.86.old.22292 2010-03-16 19:29:37.000000000 +0000
+++ /tmp/wklog.86.new.22292 2010-03-16 19:29:37.000000000 +0000
@@ -1,19 +1 @@
-A partitioned key cache is a collection of structures for regular MyiSAM key
-caches called key cache partitions. Any page from a file can be placed into a
-buffer of only one partition. The number of the partition is calculated from the
-file number and the position of the page in the file, and it's always the same
-for the page. The function that maps pages into partitions takes care of even
-distribution of pages among partitions.
-Partition key cache mitigate one of the major problem of simple key cache:
-thread contention for key cache lock (mutex). Every call of a key cache
-interface function must acquire this lock. So threads compete for this lock even
-in the case when they have acquired shared locks for the file and pages they
-want read from are in the key cache buffers. When working with a partitioned key
-cache any key cache interface function that needs only one page has to acquire
-the key cache lock only for the partition the page is ascribed to. This makes
-the chances for threads not compete for the same key cache lock better.
-
-The idea and the original of the partitioned key cache was provided by one of
-our external contributers (see the attached file segmented_keycache_v2.diff with
-the original patch from the contributor).
-=-=(Igor - Sun, 14 Feb 2010, 00:19)=-=-
Privacy level updated.
--- /tmp/wklog.86.old.10092 2010-02-13 22:19:03.000000000 +0000
+++ /tmp/wklog.86.new.10092 2010-02-13 22:19:03.000000000 +0000
@@ -1 +1 @@
-y
+n
-=-=(Igor - Sun, 14 Feb 2010, 00:19)=-=-
Category updated.
--- /tmp/wklog.86.old.10092 2010-02-13 22:19:03.000000000 +0000
+++ /tmp/wklog.86.new.10092 2010-02-13 22:19:03.000000000 +0000
@@ -1 +1 @@
-Server-BackLog
+Server-Sprint
-=-=(Igor - Sun, 14 Feb 2010, 00:18)=-=-
Version updated.
--- /tmp/wklog.86.old.10044 2010-02-14 00:18:31.000000000 +0200
+++ /tmp/wklog.86.new.10044 2010-02-14 00:18:31.000000000 +0200
@@ -1 +1 @@
-Benchmarks-3.0
+Server-5.2
DESCRIPTION:
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
28 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Unused
CREATION DATE..: Sun, 14 Feb 2010, 00:17
SUPERVISOR.....: Monty
IMPLEMENTOR....: Igor
COPIES TO......: Igor, Monty, Sergei
CATEGORY.......: Server-Sprint
TASK ID........: 86 (http://askmonty.org/worklog/?tid=86)
VERSION........: Server-5.2
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 80 (hours remain)
ORIG. ESTIMATE.: 80
PROGRESS NOTES:
-=-=(Igor - Tue, 16 Mar 2010, 19:30)=-=-
Title modified.
--- /tmp/wklog.86.old.22309 2010-03-16 19:30:04.000000000 +0000
+++ /tmp/wklog.86.new.22309 2010-03-16 19:30:04.000000000 +0000
@@ -1 +1 @@
-Partitioned Key Cache for MyISAM
+Unused
-=-=(Igor - Tue, 16 Mar 2010, 19:29)=-=-
High Level Description modified.
--- /tmp/wklog.86.old.22292 2010-03-16 19:29:37.000000000 +0000
+++ /tmp/wklog.86.new.22292 2010-03-16 19:29:37.000000000 +0000
@@ -1,19 +1 @@
-A partitioned key cache is a collection of structures for regular MyiSAM key
-caches called key cache partitions. Any page from a file can be placed into a
-buffer of only one partition. The number of the partition is calculated from the
-file number and the position of the page in the file, and it's always the same
-for the page. The function that maps pages into partitions takes care of even
-distribution of pages among partitions.
-Partition key cache mitigate one of the major problem of simple key cache:
-thread contention for key cache lock (mutex). Every call of a key cache
-interface function must acquire this lock. So threads compete for this lock even
-in the case when they have acquired shared locks for the file and pages they
-want read from are in the key cache buffers. When working with a partitioned key
-cache any key cache interface function that needs only one page has to acquire
-the key cache lock only for the partition the page is ascribed to. This makes
-the chances for threads not compete for the same key cache lock better.
-
-The idea and the original of the partitioned key cache was provided by one of
-our external contributers (see the attached file segmented_keycache_v2.diff with
-the original patch from the contributor).
-=-=(Igor - Sun, 14 Feb 2010, 00:19)=-=-
Privacy level updated.
--- /tmp/wklog.86.old.10092 2010-02-13 22:19:03.000000000 +0000
+++ /tmp/wklog.86.new.10092 2010-02-13 22:19:03.000000000 +0000
@@ -1 +1 @@
-y
+n
-=-=(Igor - Sun, 14 Feb 2010, 00:19)=-=-
Category updated.
--- /tmp/wklog.86.old.10092 2010-02-13 22:19:03.000000000 +0000
+++ /tmp/wklog.86.new.10092 2010-02-13 22:19:03.000000000 +0000
@@ -1 +1 @@
-Server-BackLog
+Server-Sprint
-=-=(Igor - Sun, 14 Feb 2010, 00:18)=-=-
Version updated.
--- /tmp/wklog.86.old.10044 2010-02-14 00:18:31.000000000 +0200
+++ /tmp/wklog.86.new.10044 2010-02-14 00:18:31.000000000 +0200
@@ -1 +1 @@
-Benchmarks-3.0
+Server-5.2
DESCRIPTION:
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
28 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Unused
CREATION DATE..: Sun, 14 Feb 2010, 00:17
SUPERVISOR.....: Monty
IMPLEMENTOR....: Igor
COPIES TO......: Igor, Monty, Sergei
CATEGORY.......: Server-Sprint
TASK ID........: 86 (http://askmonty.org/worklog/?tid=86)
VERSION........: Server-5.2
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 80 (hours remain)
ORIG. ESTIMATE.: 80
PROGRESS NOTES:
-=-=(Igor - Tue, 16 Mar 2010, 19:30)=-=-
Title modified.
--- /tmp/wklog.86.old.22309 2010-03-16 19:30:04.000000000 +0000
+++ /tmp/wklog.86.new.22309 2010-03-16 19:30:04.000000000 +0000
@@ -1 +1 @@
-Partitioned Key Cache for MyISAM
+Unused
-=-=(Igor - Tue, 16 Mar 2010, 19:29)=-=-
High Level Description modified.
--- /tmp/wklog.86.old.22292 2010-03-16 19:29:37.000000000 +0000
+++ /tmp/wklog.86.new.22292 2010-03-16 19:29:37.000000000 +0000
@@ -1,19 +1 @@
-A partitioned key cache is a collection of structures for regular MyiSAM key
-caches called key cache partitions. Any page from a file can be placed into a
-buffer of only one partition. The number of the partition is calculated from the
-file number and the position of the page in the file, and it's always the same
-for the page. The function that maps pages into partitions takes care of even
-distribution of pages among partitions.
-Partition key cache mitigate one of the major problem of simple key cache:
-thread contention for key cache lock (mutex). Every call of a key cache
-interface function must acquire this lock. So threads compete for this lock even
-in the case when they have acquired shared locks for the file and pages they
-want read from are in the key cache buffers. When working with a partitioned key
-cache any key cache interface function that needs only one page has to acquire
-the key cache lock only for the partition the page is ascribed to. This makes
-the chances for threads not compete for the same key cache lock better.
-
-The idea and the original of the partitioned key cache was provided by one of
-our external contributers (see the attached file segmented_keycache_v2.diff with
-the original patch from the contributor).
-=-=(Igor - Sun, 14 Feb 2010, 00:19)=-=-
Privacy level updated.
--- /tmp/wklog.86.old.10092 2010-02-13 22:19:03.000000000 +0000
+++ /tmp/wklog.86.new.10092 2010-02-13 22:19:03.000000000 +0000
@@ -1 +1 @@
-y
+n
-=-=(Igor - Sun, 14 Feb 2010, 00:19)=-=-
Category updated.
--- /tmp/wklog.86.old.10092 2010-02-13 22:19:03.000000000 +0000
+++ /tmp/wklog.86.new.10092 2010-02-13 22:19:03.000000000 +0000
@@ -1 +1 @@
-Server-BackLog
+Server-Sprint
-=-=(Igor - Sun, 14 Feb 2010, 00:18)=-=-
Version updated.
--- /tmp/wklog.86.old.10044 2010-02-14 00:18:31.000000000 +0200
+++ /tmp/wklog.86.new.10044 2010-02-14 00:18:31.000000000 +0200
@@ -1 +1 @@
-Benchmarks-3.0
+Server-5.2
DESCRIPTION:
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
28 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Unused
CREATION DATE..: Sun, 28 Feb 2010, 14:08
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Igor, Psergey, Timour
CATEGORY.......: Server-Sprint
TASK ID........: 93 (http://askmonty.org/worklog/?tid=93)
VERSION........: Server-5.3
STATUS.........: Cancelled
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Serg - Mon, 08 Mar 2010, 20:15)=-=-
Status updated.
--- /tmp/wklog.93.old.32332 2010-03-08 20:15:35.000000000 +0000
+++ /tmp/wklog.93.new.32332 2010-03-08 20:15:35.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Cancelled
-=-=(Psergey - Sun, 28 Feb 2010, 14:53)=-=-
High Level Description modified.
--- /tmp/wklog.93.old.22008 2010-02-28 14:53:14.000000000 +0000
+++ /tmp/wklog.93.new.22008 2010-02-28 14:53:14.000000000 +0000
@@ -1 +1 @@
-This is an umbrella task for all tasks in MariaDB 5.3
+
-=-=(Psergey - Sun, 28 Feb 2010, 14:53)=-=-
Title modified.
--- /tmp/wklog.93.old.21988 2010-02-28 14:53:03.000000000 +0000
+++ /tmp/wklog.93.new.21988 2010-02-28 14:53:03.000000000 +0000
@@ -1 +1 @@
-MariaDB 5.3
+Unused
DESCRIPTION:
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
28 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Unused
CREATION DATE..: Sun, 28 Feb 2010, 14:08
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Igor, Psergey, Timour
CATEGORY.......: Server-Sprint
TASK ID........: 93 (http://askmonty.org/worklog/?tid=93)
VERSION........: Server-5.3
STATUS.........: Cancelled
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Serg - Mon, 08 Mar 2010, 20:15)=-=-
Status updated.
--- /tmp/wklog.93.old.32332 2010-03-08 20:15:35.000000000 +0000
+++ /tmp/wklog.93.new.32332 2010-03-08 20:15:35.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Cancelled
-=-=(Psergey - Sun, 28 Feb 2010, 14:53)=-=-
High Level Description modified.
--- /tmp/wklog.93.old.22008 2010-02-28 14:53:14.000000000 +0000
+++ /tmp/wklog.93.new.22008 2010-02-28 14:53:14.000000000 +0000
@@ -1 +1 @@
-This is an umbrella task for all tasks in MariaDB 5.3
+
-=-=(Psergey - Sun, 28 Feb 2010, 14:53)=-=-
Title modified.
--- /tmp/wklog.93.old.21988 2010-02-28 14:53:03.000000000 +0000
+++ /tmp/wklog.93.new.21988 2010-02-28 14:53:03.000000000 +0000
@@ -1 +1 @@
-MariaDB 5.3
+Unused
DESCRIPTION:
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
28 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Unused
CREATION DATE..: Sun, 28 Feb 2010, 14:08
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Igor, Psergey, Timour
CATEGORY.......: Server-Sprint
TASK ID........: 93 (http://askmonty.org/worklog/?tid=93)
VERSION........: Server-5.3
STATUS.........: Cancelled
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Serg - Mon, 08 Mar 2010, 20:15)=-=-
Status updated.
--- /tmp/wklog.93.old.32332 2010-03-08 20:15:35.000000000 +0000
+++ /tmp/wklog.93.new.32332 2010-03-08 20:15:35.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Cancelled
-=-=(Psergey - Sun, 28 Feb 2010, 14:53)=-=-
High Level Description modified.
--- /tmp/wklog.93.old.22008 2010-02-28 14:53:14.000000000 +0000
+++ /tmp/wklog.93.new.22008 2010-02-28 14:53:14.000000000 +0000
@@ -1 +1 @@
-This is an umbrella task for all tasks in MariaDB 5.3
+
-=-=(Psergey - Sun, 28 Feb 2010, 14:53)=-=-
Title modified.
--- /tmp/wklog.93.old.21988 2010-02-28 14:53:03.000000000 +0000
+++ /tmp/wklog.93.new.21988 2010-02-28 14:53:03.000000000 +0000
@@ -1 +1 @@
-MariaDB 5.3
+Unused
DESCRIPTION:
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
28 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Unused
CREATION DATE..: Sun, 28 Feb 2010, 14:08
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Igor, Psergey, Timour
CATEGORY.......: Server-Sprint
TASK ID........: 93 (http://askmonty.org/worklog/?tid=93)
VERSION........: Server-5.3
STATUS.........: Cancelled
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Serg - Mon, 08 Mar 2010, 20:15)=-=-
Status updated.
--- /tmp/wklog.93.old.32332 2010-03-08 20:15:35.000000000 +0000
+++ /tmp/wklog.93.new.32332 2010-03-08 20:15:35.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Cancelled
-=-=(Psergey - Sun, 28 Feb 2010, 14:53)=-=-
High Level Description modified.
--- /tmp/wklog.93.old.22008 2010-02-28 14:53:14.000000000 +0000
+++ /tmp/wklog.93.new.22008 2010-02-28 14:53:14.000000000 +0000
@@ -1 +1 @@
-This is an umbrella task for all tasks in MariaDB 5.3
+
-=-=(Psergey - Sun, 28 Feb 2010, 14:53)=-=-
Title modified.
--- /tmp/wklog.93.old.21988 2010-02-28 14:53:03.000000000 +0000
+++ /tmp/wklog.93.new.21988 2010-02-28 14:53:03.000000000 +0000
@@ -1 +1 @@
-MariaDB 5.3
+Unused
DESCRIPTION:
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
28 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Unused
CREATION DATE..: Sun, 28 Feb 2010, 14:08
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Igor, Psergey, Timour
CATEGORY.......: Server-Sprint
TASK ID........: 94 (http://askmonty.org/worklog/?tid=94)
VERSION........: Server-5.3
STATUS.........: Cancelled
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Serg - Mon, 08 Mar 2010, 20:15)=-=-
Status updated.
--- /tmp/wklog.94.old.32348 2010-03-08 20:15:38.000000000 +0000
+++ /tmp/wklog.94.new.32348 2010-03-08 20:15:38.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Cancelled
-=-=(Psergey - Sun, 28 Feb 2010, 14:54)=-=-
Dependency deleted: 94 no longer depends on 95
-=-=(Psergey - Sun, 28 Feb 2010, 14:54)=-=-
Dependency deleted: 94 no longer depends on 68
-=-=(Psergey - Sun, 28 Feb 2010, 14:54)=-=-
Dependency deleted: 94 no longer depends on 90
-=-=(Psergey - Sun, 28 Feb 2010, 14:54)=-=-
Dependency deleted: 94 no longer depends on 91
-=-=(Psergey - Sun, 28 Feb 2010, 14:54)=-=-
Dependency deleted: 94 no longer depends on 67
-=-=(Psergey - Sun, 28 Feb 2010, 14:53)=-=-
Title modified.
--- /tmp/wklog.94.old.22032 2010-02-28 14:53:45.000000000 +0000
+++ /tmp/wklog.94.new.22032 2010-02-28 14:53:45.000000000 +0000
@@ -1 +1 @@
-MariaDB 5.3
+Unused
-=-=(Psergey - Sun, 28 Feb 2010, 14:34)=-=-
Dependency created: 94 now depends on 95
-=-=(Psergey - Sun, 28 Feb 2010, 14:09)=-=-
Dependency created: 94 now depends on 67
-=-=(Psergey - Sun, 28 Feb 2010, 14:09)=-=-
Dependency created: 94 now depends on 91
------------------------------------------------------------
-=-=(View All Progress Notes, 12 total)=-=-
http://askmonty.org/worklog/index.pl?tid=94&nolimit=1
DESCRIPTION:
This is an umbrella task for all tasks in MariaDB 5.3
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
28 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Unused
CREATION DATE..: Sun, 28 Feb 2010, 14:08
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Igor, Psergey, Timour
CATEGORY.......: Server-Sprint
TASK ID........: 94 (http://askmonty.org/worklog/?tid=94)
VERSION........: Server-5.3
STATUS.........: Cancelled
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Serg - Mon, 08 Mar 2010, 20:15)=-=-
Status updated.
--- /tmp/wklog.94.old.32348 2010-03-08 20:15:38.000000000 +0000
+++ /tmp/wklog.94.new.32348 2010-03-08 20:15:38.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Cancelled
-=-=(Psergey - Sun, 28 Feb 2010, 14:54)=-=-
Dependency deleted: 94 no longer depends on 95
-=-=(Psergey - Sun, 28 Feb 2010, 14:54)=-=-
Dependency deleted: 94 no longer depends on 68
-=-=(Psergey - Sun, 28 Feb 2010, 14:54)=-=-
Dependency deleted: 94 no longer depends on 90
-=-=(Psergey - Sun, 28 Feb 2010, 14:54)=-=-
Dependency deleted: 94 no longer depends on 91
-=-=(Psergey - Sun, 28 Feb 2010, 14:54)=-=-
Dependency deleted: 94 no longer depends on 67
-=-=(Psergey - Sun, 28 Feb 2010, 14:53)=-=-
Title modified.
--- /tmp/wklog.94.old.22032 2010-02-28 14:53:45.000000000 +0000
+++ /tmp/wklog.94.new.22032 2010-02-28 14:53:45.000000000 +0000
@@ -1 +1 @@
-MariaDB 5.3
+Unused
-=-=(Psergey - Sun, 28 Feb 2010, 14:34)=-=-
Dependency created: 94 now depends on 95
-=-=(Psergey - Sun, 28 Feb 2010, 14:09)=-=-
Dependency created: 94 now depends on 67
-=-=(Psergey - Sun, 28 Feb 2010, 14:09)=-=-
Dependency created: 94 now depends on 91
------------------------------------------------------------
-=-=(View All Progress Notes, 12 total)=-=-
http://askmonty.org/worklog/index.pl?tid=94&nolimit=1
DESCRIPTION:
This is an umbrella task for all tasks in MariaDB 5.3
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
28 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Unused
CREATION DATE..: Sun, 28 Feb 2010, 14:08
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Igor, Psergey, Timour
CATEGORY.......: Server-Sprint
TASK ID........: 94 (http://askmonty.org/worklog/?tid=94)
VERSION........: Server-5.3
STATUS.........: Cancelled
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Serg - Mon, 08 Mar 2010, 20:15)=-=-
Status updated.
--- /tmp/wklog.94.old.32348 2010-03-08 20:15:38.000000000 +0000
+++ /tmp/wklog.94.new.32348 2010-03-08 20:15:38.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Cancelled
-=-=(Psergey - Sun, 28 Feb 2010, 14:54)=-=-
Dependency deleted: 94 no longer depends on 95
-=-=(Psergey - Sun, 28 Feb 2010, 14:54)=-=-
Dependency deleted: 94 no longer depends on 68
-=-=(Psergey - Sun, 28 Feb 2010, 14:54)=-=-
Dependency deleted: 94 no longer depends on 90
-=-=(Psergey - Sun, 28 Feb 2010, 14:54)=-=-
Dependency deleted: 94 no longer depends on 91
-=-=(Psergey - Sun, 28 Feb 2010, 14:54)=-=-
Dependency deleted: 94 no longer depends on 67
-=-=(Psergey - Sun, 28 Feb 2010, 14:53)=-=-
Title modified.
--- /tmp/wklog.94.old.22032 2010-02-28 14:53:45.000000000 +0000
+++ /tmp/wklog.94.new.22032 2010-02-28 14:53:45.000000000 +0000
@@ -1 +1 @@
-MariaDB 5.3
+Unused
-=-=(Psergey - Sun, 28 Feb 2010, 14:34)=-=-
Dependency created: 94 now depends on 95
-=-=(Psergey - Sun, 28 Feb 2010, 14:09)=-=-
Dependency created: 94 now depends on 67
-=-=(Psergey - Sun, 28 Feb 2010, 14:09)=-=-
Dependency created: 94 now depends on 91
------------------------------------------------------------
-=-=(View All Progress Notes, 12 total)=-=-
http://askmonty.org/worklog/index.pl?tid=94&nolimit=1
DESCRIPTION:
This is an umbrella task for all tasks in MariaDB 5.3
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
28 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Unused
CREATION DATE..: Sun, 28 Feb 2010, 14:08
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Igor, Psergey, Timour
CATEGORY.......: Server-Sprint
TASK ID........: 94 (http://askmonty.org/worklog/?tid=94)
VERSION........: Server-5.3
STATUS.........: Cancelled
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Serg - Mon, 08 Mar 2010, 20:15)=-=-
Status updated.
--- /tmp/wklog.94.old.32348 2010-03-08 20:15:38.000000000 +0000
+++ /tmp/wklog.94.new.32348 2010-03-08 20:15:38.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Cancelled
-=-=(Psergey - Sun, 28 Feb 2010, 14:54)=-=-
Dependency deleted: 94 no longer depends on 95
-=-=(Psergey - Sun, 28 Feb 2010, 14:54)=-=-
Dependency deleted: 94 no longer depends on 68
-=-=(Psergey - Sun, 28 Feb 2010, 14:54)=-=-
Dependency deleted: 94 no longer depends on 90
-=-=(Psergey - Sun, 28 Feb 2010, 14:54)=-=-
Dependency deleted: 94 no longer depends on 91
-=-=(Psergey - Sun, 28 Feb 2010, 14:54)=-=-
Dependency deleted: 94 no longer depends on 67
-=-=(Psergey - Sun, 28 Feb 2010, 14:53)=-=-
Title modified.
--- /tmp/wklog.94.old.22032 2010-02-28 14:53:45.000000000 +0000
+++ /tmp/wklog.94.new.22032 2010-02-28 14:53:45.000000000 +0000
@@ -1 +1 @@
-MariaDB 5.3
+Unused
-=-=(Psergey - Sun, 28 Feb 2010, 14:34)=-=-
Dependency created: 94 now depends on 95
-=-=(Psergey - Sun, 28 Feb 2010, 14:09)=-=-
Dependency created: 94 now depends on 67
-=-=(Psergey - Sun, 28 Feb 2010, 14:09)=-=-
Dependency created: 94 now depends on 91
------------------------------------------------------------
-=-=(View All Progress Notes, 12 total)=-=-
http://askmonty.org/worklog/index.pl?tid=94&nolimit=1
DESCRIPTION:
This is an umbrella task for all tasks in MariaDB 5.3
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
28 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Unused
CREATION DATE..: Sun, 28 Feb 2010, 14:34
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Igor, Psergey, Timour
CATEGORY.......: Server-Sprint
TASK ID........: 95 (http://askmonty.org/worklog/?tid=95)
VERSION........: Server-9.x
STATUS.........: Cancelled
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Serg - Mon, 08 Mar 2010, 20:15)=-=-
Status updated.
--- /tmp/wklog.95.old.32364 2010-03-08 20:15:43.000000000 +0000
+++ /tmp/wklog.95.new.32364 2010-03-08 20:15:43.000000000 +0000
@@ -1 +1 @@
-Assigned
+Cancelled
-=-=(Psergey - Sun, 28 Feb 2010, 14:55)=-=-
High Level Description modified.
--- /tmp/wklog.95.old.22166 2010-02-28 14:55:07.000000000 +0000
+++ /tmp/wklog.95.new.22166 2010-02-28 14:55:07.000000000 +0000
@@ -1 +1,2 @@
-We must fix known semi-join subquery bugs.
+Unused
+
-=-=(Psergey - Sun, 28 Feb 2010, 14:54)=-=-
Title modified.
--- /tmp/wklog.95.old.22148 2010-02-28 14:54:56.000000000 +0000
+++ /tmp/wklog.95.new.22148 2010-02-28 14:54:56.000000000 +0000
@@ -1 +1 @@
-Subqueries backport: fix known semi-join subquery bugs
+Unused
-=-=(Psergey - Sun, 28 Feb 2010, 14:54)=-=-
Dependency deleted: 94 no longer depends on 95
-=-=(Psergey - Sun, 28 Feb 2010, 14:34)=-=-
Dependency created: 94 now depends on 95
DESCRIPTION:
Unused
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
28 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Unused
CREATION DATE..: Sun, 28 Feb 2010, 14:34
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Igor, Psergey, Timour
CATEGORY.......: Server-Sprint
TASK ID........: 95 (http://askmonty.org/worklog/?tid=95)
VERSION........: Server-9.x
STATUS.........: Cancelled
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Serg - Mon, 08 Mar 2010, 20:15)=-=-
Status updated.
--- /tmp/wklog.95.old.32364 2010-03-08 20:15:43.000000000 +0000
+++ /tmp/wklog.95.new.32364 2010-03-08 20:15:43.000000000 +0000
@@ -1 +1 @@
-Assigned
+Cancelled
-=-=(Psergey - Sun, 28 Feb 2010, 14:55)=-=-
High Level Description modified.
--- /tmp/wklog.95.old.22166 2010-02-28 14:55:07.000000000 +0000
+++ /tmp/wklog.95.new.22166 2010-02-28 14:55:07.000000000 +0000
@@ -1 +1,2 @@
-We must fix known semi-join subquery bugs.
+Unused
+
-=-=(Psergey - Sun, 28 Feb 2010, 14:54)=-=-
Title modified.
--- /tmp/wklog.95.old.22148 2010-02-28 14:54:56.000000000 +0000
+++ /tmp/wklog.95.new.22148 2010-02-28 14:54:56.000000000 +0000
@@ -1 +1 @@
-Subqueries backport: fix known semi-join subquery bugs
+Unused
-=-=(Psergey - Sun, 28 Feb 2010, 14:54)=-=-
Dependency deleted: 94 no longer depends on 95
-=-=(Psergey - Sun, 28 Feb 2010, 14:34)=-=-
Dependency created: 94 now depends on 95
DESCRIPTION:
Unused
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
28 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Unused
CREATION DATE..: Sun, 28 Feb 2010, 14:34
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Igor, Psergey, Timour
CATEGORY.......: Server-Sprint
TASK ID........: 95 (http://askmonty.org/worklog/?tid=95)
VERSION........: Server-9.x
STATUS.........: Cancelled
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Serg - Mon, 08 Mar 2010, 20:15)=-=-
Status updated.
--- /tmp/wklog.95.old.32364 2010-03-08 20:15:43.000000000 +0000
+++ /tmp/wklog.95.new.32364 2010-03-08 20:15:43.000000000 +0000
@@ -1 +1 @@
-Assigned
+Cancelled
-=-=(Psergey - Sun, 28 Feb 2010, 14:55)=-=-
High Level Description modified.
--- /tmp/wklog.95.old.22166 2010-02-28 14:55:07.000000000 +0000
+++ /tmp/wklog.95.new.22166 2010-02-28 14:55:07.000000000 +0000
@@ -1 +1,2 @@
-We must fix known semi-join subquery bugs.
+Unused
+
-=-=(Psergey - Sun, 28 Feb 2010, 14:54)=-=-
Title modified.
--- /tmp/wklog.95.old.22148 2010-02-28 14:54:56.000000000 +0000
+++ /tmp/wklog.95.new.22148 2010-02-28 14:54:56.000000000 +0000
@@ -1 +1 @@
-Subqueries backport: fix known semi-join subquery bugs
+Unused
-=-=(Psergey - Sun, 28 Feb 2010, 14:54)=-=-
Dependency deleted: 94 no longer depends on 95
-=-=(Psergey - Sun, 28 Feb 2010, 14:34)=-=-
Dependency created: 94 now depends on 95
DESCRIPTION:
Unused
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
28 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Unused
CREATION DATE..: Sun, 28 Feb 2010, 14:34
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Igor, Psergey, Timour
CATEGORY.......: Server-Sprint
TASK ID........: 95 (http://askmonty.org/worklog/?tid=95)
VERSION........: Server-9.x
STATUS.........: Cancelled
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Serg - Mon, 08 Mar 2010, 20:15)=-=-
Status updated.
--- /tmp/wklog.95.old.32364 2010-03-08 20:15:43.000000000 +0000
+++ /tmp/wklog.95.new.32364 2010-03-08 20:15:43.000000000 +0000
@@ -1 +1 @@
-Assigned
+Cancelled
-=-=(Psergey - Sun, 28 Feb 2010, 14:55)=-=-
High Level Description modified.
--- /tmp/wklog.95.old.22166 2010-02-28 14:55:07.000000000 +0000
+++ /tmp/wklog.95.new.22166 2010-02-28 14:55:07.000000000 +0000
@@ -1 +1,2 @@
-We must fix known semi-join subquery bugs.
+Unused
+
-=-=(Psergey - Sun, 28 Feb 2010, 14:54)=-=-
Title modified.
--- /tmp/wklog.95.old.22148 2010-02-28 14:54:56.000000000 +0000
+++ /tmp/wklog.95.new.22148 2010-02-28 14:54:56.000000000 +0000
@@ -1 +1 @@
-Subqueries backport: fix known semi-join subquery bugs
+Unused
-=-=(Psergey - Sun, 28 Feb 2010, 14:54)=-=-
Dependency deleted: 94 no longer depends on 95
-=-=(Psergey - Sun, 28 Feb 2010, 14:34)=-=-
Dependency created: 94 now depends on 95
DESCRIPTION:
Unused
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
28 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Unused
CREATION DATE..: Sun, 28 Feb 2010, 14:34
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Igor, Psergey, Timour
CATEGORY.......: Server-Sprint
TASK ID........: 96 (http://askmonty.org/worklog/?tid=96)
VERSION........: Server-9.x
STATUS.........: Cancelled
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Serg - Mon, 08 Mar 2010, 20:15)=-=-
Status updated.
--- /tmp/wklog.96.old.32380 2010-03-08 20:15:46.000000000 +0000
+++ /tmp/wklog.96.new.32380 2010-03-08 20:15:46.000000000 +0000
@@ -1 +1 @@
-Assigned
+Cancelled
-=-=(Psergey - Sun, 28 Feb 2010, 14:55)=-=-
High Level Description modified.
--- /tmp/wklog.96.old.22203 2010-02-28 14:55:49.000000000 +0000
+++ /tmp/wklog.96.new.22203 2010-02-28 14:55:49.000000000 +0000
@@ -1 +1 @@
-We must fix known semi-join subquery bugs.
+
-=-=(Psergey - Sun, 28 Feb 2010, 14:55)=-=-
Title modified.
--- /tmp/wklog.96.old.22185 2010-02-28 14:55:37.000000000 +0000
+++ /tmp/wklog.96.new.22185 2010-02-28 14:55:37.000000000 +0000
@@ -1 +1 @@
-Subqueries backport: fix known semi-join subquery bugs
+Unused
DESCRIPTION:
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
28 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Unused
CREATION DATE..: Sun, 28 Feb 2010, 14:34
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Igor, Psergey, Timour
CATEGORY.......: Server-Sprint
TASK ID........: 96 (http://askmonty.org/worklog/?tid=96)
VERSION........: Server-9.x
STATUS.........: Cancelled
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Serg - Mon, 08 Mar 2010, 20:15)=-=-
Status updated.
--- /tmp/wklog.96.old.32380 2010-03-08 20:15:46.000000000 +0000
+++ /tmp/wklog.96.new.32380 2010-03-08 20:15:46.000000000 +0000
@@ -1 +1 @@
-Assigned
+Cancelled
-=-=(Psergey - Sun, 28 Feb 2010, 14:55)=-=-
High Level Description modified.
--- /tmp/wklog.96.old.22203 2010-02-28 14:55:49.000000000 +0000
+++ /tmp/wklog.96.new.22203 2010-02-28 14:55:49.000000000 +0000
@@ -1 +1 @@
-We must fix known semi-join subquery bugs.
+
-=-=(Psergey - Sun, 28 Feb 2010, 14:55)=-=-
Title modified.
--- /tmp/wklog.96.old.22185 2010-02-28 14:55:37.000000000 +0000
+++ /tmp/wklog.96.new.22185 2010-02-28 14:55:37.000000000 +0000
@@ -1 +1 @@
-Subqueries backport: fix known semi-join subquery bugs
+Unused
DESCRIPTION:
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
28 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Unused
CREATION DATE..: Sun, 28 Feb 2010, 14:34
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Igor, Psergey, Timour
CATEGORY.......: Server-Sprint
TASK ID........: 96 (http://askmonty.org/worklog/?tid=96)
VERSION........: Server-9.x
STATUS.........: Cancelled
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Serg - Mon, 08 Mar 2010, 20:15)=-=-
Status updated.
--- /tmp/wklog.96.old.32380 2010-03-08 20:15:46.000000000 +0000
+++ /tmp/wklog.96.new.32380 2010-03-08 20:15:46.000000000 +0000
@@ -1 +1 @@
-Assigned
+Cancelled
-=-=(Psergey - Sun, 28 Feb 2010, 14:55)=-=-
High Level Description modified.
--- /tmp/wklog.96.old.22203 2010-02-28 14:55:49.000000000 +0000
+++ /tmp/wklog.96.new.22203 2010-02-28 14:55:49.000000000 +0000
@@ -1 +1 @@
-We must fix known semi-join subquery bugs.
+
-=-=(Psergey - Sun, 28 Feb 2010, 14:55)=-=-
Title modified.
--- /tmp/wklog.96.old.22185 2010-02-28 14:55:37.000000000 +0000
+++ /tmp/wklog.96.new.22185 2010-02-28 14:55:37.000000000 +0000
@@ -1 +1 @@
-Subqueries backport: fix known semi-join subquery bugs
+Unused
DESCRIPTION:
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
28 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Unused
CREATION DATE..: Sun, 28 Feb 2010, 14:34
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Igor, Psergey, Timour
CATEGORY.......: Server-Sprint
TASK ID........: 96 (http://askmonty.org/worklog/?tid=96)
VERSION........: Server-9.x
STATUS.........: Cancelled
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Serg - Mon, 08 Mar 2010, 20:15)=-=-
Status updated.
--- /tmp/wklog.96.old.32380 2010-03-08 20:15:46.000000000 +0000
+++ /tmp/wklog.96.new.32380 2010-03-08 20:15:46.000000000 +0000
@@ -1 +1 @@
-Assigned
+Cancelled
-=-=(Psergey - Sun, 28 Feb 2010, 14:55)=-=-
High Level Description modified.
--- /tmp/wklog.96.old.22203 2010-02-28 14:55:49.000000000 +0000
+++ /tmp/wklog.96.new.22203 2010-02-28 14:55:49.000000000 +0000
@@ -1 +1 @@
-We must fix known semi-join subquery bugs.
+
-=-=(Psergey - Sun, 28 Feb 2010, 14:55)=-=-
Title modified.
--- /tmp/wklog.96.old.22185 2010-02-28 14:55:37.000000000 +0000
+++ /tmp/wklog.96.new.22185 2010-02-28 14:55:37.000000000 +0000
@@ -1 +1 @@
-Subqueries backport: fix known semi-join subquery bugs
+Unused
DESCRIPTION:
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v4.0.0)
1
0
[Maria-developers] Updated (by Guest): Store in binlog text of statements that caused RBR events (47)
by worklog-noreply@askmonty.org 28 Jun '10
by worklog-noreply@askmonty.org 28 Jun '10
28 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Store in binlog text of statements that caused RBR events
CREATION DATE..: Sat, 15 Aug 2009, 23:48
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Knielsen, Serg
CATEGORY.......: Server-Sprint
TASK ID........: 47 (http://askmonty.org/worklog/?tid=47)
VERSION........: Server-9.x
STATUS.........: Complete
PRIORITY.......: 60
WORKED HOURS...: 72
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 35
PROGRESS NOTES:
-=-=(Guest - Mon, 28 Jun 2010, 11:54)=-=-
Status updated.
--- /tmp/wklog.47.old.915 2010-06-28 11:54:12.000000000 +0000
+++ /tmp/wklog.47.new.915 2010-06-28 11:54:12.000000000 +0000
@@ -1 +1 @@
-Code-Review
+Complete
-=-=(Alexi - Thu, 24 Jun 2010, 09:49)=-=-
Final implementation cleanup, testing, help Percona with build issues related to the WL
Worked 5 hours and estimate 0 hours remain (original estimate increased by 5 hours).
-=-=(Alexi - Thu, 24 Jun 2010, 09:47)=-=-
Making rpl- and binlog-tests stable w.r.t. adding new binlog events.
Worked 25 hours and estimate 0 hours remain (original estimate increased by 25 hours).
-=-=(Knielsen - Mon, 21 Jun 2010, 08:32)=-=-
Final review.
Assist with some problems applying the patch.
Worked 1 hour and estimate 0 hours remain (original estimate increased by 1 hour).
-=-=(Guest - Thu, 17 Jun 2010, 00:38)=-=-
Dependency deleted: 39 no longer depends on 47
-=-=(Knielsen - Mon, 07 Jun 2010, 07:13)=-=-
Help debug some test failures seen in Buildbot.
Worked 6 hours and estimate 0 hours remain (original estimate increased by 6 hours).
-=-=(Knielsen - Mon, 31 May 2010, 06:49)=-=-
Help Alexi debug+fix some test problems in the patch.
Worked 4 hours and estimate 0 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 25 May 2010, 08:29)=-=-
Help debug strange problem in mysqlbinlog.test.
Worked 1 hour and estimate 4 hours remain (original estimate unchanged).
-=-=(Knielsen - Mon, 17 May 2010, 08:45)=-=-
Merge with latest trunk and run Buildbot tests.
Worked 1 hour and estimate 5 hours remain (original estimate unchanged).
-=-=(Knielsen - Wed, 05 May 2010, 13:53)=-=-
Review of fixes to first review done. No new issues found.
Worked 2 hours and estimate 6 hours remain (original estimate unchanged).
------------------------------------------------------------
-=-=(View All Progress Notes, 38 total)=-=-
http://askmonty.org/worklog/index.pl?tid=47&nolimit=1
DESCRIPTION:
Store in binlog (and show in mysqlbinlog output) texts of statements that
caused RBR events
This is needed for (list from Monty):
- Easier to understand why updates happened
- Would make it easier to find out where in application things went
wrong (as you can search for exact strings)
- Allow one to filter things based on comments in the statement.
The cost of this can be that the binlog will be approximately 2x in size
(especially insert of big blob's would be a bit painful), so this should
be an optional feature.
HIGH-LEVEL SPECIFICATION:
Content
~~~~~~~
1. Annotate_rows_log_event
2. Server option: --binlog-annotate-rows-events
3. Server option: --replicate-annotate-rows-events
4. mysqlbinlog option: --print-annotate-rows-events
5. mysqlbinlog output
1. Annotate_rows_log_event [ ANNOTATE_ROWS_EVENT ]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Describes the query which caused the corresponding rows events. Has empty
post-header and contains the query text in its data part. Example:
************************
ANNOTATE_ROWS_EVENT
************************
00000220 | B6 A0 2C 4B | time_when = 1261215926
00000224 | 33 | event_type = 51
00000225 | 64 00 00 00 | server_id = 100
00000229 | 36 00 00 00 | event_len = 54
0000022D | 56 02 00 00 | log_pos = 00000256
00000231 | 00 00 | flags = <none>
------------------------
00000233 | 49 4E 53 45 | query = "INSERT INTO t1 VALUES (1), (2), (3)"
00000237 | 52 54 20 49 |
0000023B | 4E 54 4F 20 |
0000023F | 74 31 20 56 |
00000243 | 41 4C 55 45 |
00000247 | 53 20 28 31 |
0000024B | 29 2C 20 28 |
0000024F | 32 29 2C 20 |
00000253 | 28 33 29 |
************************
In binary log, Annotate_rows event follows the (possible) 'BEGIN' Query event
and precedes the first of Table map events which accompany the corresponding
rows events. (See example in the "mysqlbinlog output" section below.)
2. Server option: --binlog-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the master to write Annotate_rows events to the binary log.
* Variable Name: binlog_annotate_rows_events
* Scope: Global & Session
* Access Type: Dynamic
* Data Type: bool
* Default Value: OFF
NOTE. Session values allows to annotate only some selected statements:
...
SET SESSION binlog_annotate_rows_events=ON;
... statements to be annotated ...
SET SESSION binlog_annotate_rows_events=OFF;
... statements not to be annotated ...
3. Server option: --replicate-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the slave to reproduce Annotate_rows events recieved from the master
in its own binary log (sensible only in pair with log-slave-updates option).
* Variable Name: replicate_annotate_rows_events
* Scope: Global
* Access Type: Read only
* Data Type: bool
* Default Value: OFF
NOTE. Why do we additionally need this 'replicate' option? Why not to make
the slave to reproduce this events when its binlog-annotate-rows-events
global value is ON? Well, because, for example, we may want to configure
the slave which should reproduce Annotate_rows events but has global
binlog-annotate-rows-events = OFF meaning this to be the default value for
the client threads (see also "How slave treats replicate-annotate-rows-events
option" in LLD part).
4. mysqlbinlog option: --print-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
With this option, mysqlbinlog prints the content of Annotate_rows events (if
the binary log does contain them). Without this option (i.e. by default),
mysqlbinlog skips Annotate_rows events.
5. mysqlbinlog output
~~~~~~~~~~~~~~~~~~~~~
With --print-annotate-rows-events, mysqlbinlog outputs Annotate_rows events
in a form like this:
...
# at 1646
#091219 12:45:26 server id 100 end_log_pos 1714 Query thread_id=1
exec_time=0 error_code=0
SET TIMESTAMP=1261215926/*!*/;
BEGIN
/*!*/;
# at 1714
# at 1812
# at 1853
# at 1894
# at 1938
#091219 12:45:26 server id 100 end_log_pos 1812 Query: `DELETE t1, t2 FROM
t1 INNER JOIN t2 INNER JOIN t3 WHERE t1.a=t2.a AND t2.a=t3.a`
#091219 12:45:26 server id 100 end_log_pos 1853 Table_map: `test`.`t1`
mapped to number 16
#091219 12:45:26 server id 100 end_log_pos 1894 Table_map: `test`.`t2`
mapped to number 17
#091219 12:45:26 server id 100 end_log_pos 1938 Delete_rows: table id 16
#091219 12:45:26 server id 100 end_log_pos 1982 Delete_rows: table id 17
flags: STMT_END_F
...
LOW-LEVEL DESIGN:
Content
~~~~~~~
1. Annotate_rows event number
2. Outline of Annotate_rows event behavior
3. How Master writes Annotate_rows events to the binary log
4. How slave treats replicate-annotate-rows-events option
5. How slave IO thread requests Annotate_rows events
6. How master executes the request
7. How slave SQL thread processes Annotate_rows events
8. General remarks
1. Annotate_rows event number
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To avoid possible event numbers conflict with MySQL/Sun, we leave a gap
between the last MySQL event number and the Annotate_rows event number:
enum Log_event_type
{ ...
INCIDENT_EVENT= 26,
// New MySQL event numbers are to be added here
MYSQL_EVENTS_END,
MARIA_EVENTS_BEGIN= 51,
// New Maria event numbers start from here
ANNOTATE_ROWS_EVENT= 51,
ENUM_END_EVENT
};
together with the corresponding extension of 'post_header_len' array in the
Format description event. (This extension does not affect the compatibility
of the binary log). Here is how Format description event looks like with
this extension:
************************
FORMAT_DESCRIPTION_EVENT
************************
00000004 | A1 A0 2C 4B | time_when = 1261215905
00000008 | 0F | event_type = 15
00000009 | 64 00 00 00 | server_id = 100
0000000D | 7F 00 00 00 | event_len = 127
00000011 | 83 00 00 00 | log_pos = 00000083
00000015 | 01 00 | flags = LOG_EVENT_BINLOG_IN_USE_F
------------------------
00000017 | 04 00 | binlog_ver = 4
00000019 | 35 2E 32 2E | server_ver = 5.2.0-MariaDB-alpha-debug-log
..... ...
0000004B | A1 A0 2C 4B | time_created = 1261215905
0000004F | 13 | common_header_len = 19
------------------------
post_header_len
------------------------
00000050 | 38 | 56 - START_EVENT_V3 [1]
..... ...
00000069 | 02 | 2 - INCIDENT_EVENT [26]
0000006A | 00 | 0 - RESERVED [27]
..... ...
00000081 | 00 | 0 - RESERVED [50]
00000082 | 00 | 0 - ANNOTATE_ROWS_EVENT [51]
************************
2. Outline of Annotate_rows event behavior
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Each Annotate_rows_log_event object has two private members describing the
corresponding query:
char *m_query_txt;
uint m_query_len;
When the object is created for writing to a binary log, this query is taken
from 'thd' (for short, below we omit the 'Annotate_rows_log_event::' prefix
as well as other implementation details):
Annotate_rows_log_event(THD *thd)
{
m_query_txt = thd->query();
m_query_len = thd->query_length();
}
When the object is read from a binary log, the query is taken from the buffer
containing the binary log representation of the event (this buffer is allocated
in Log_event object from which all Log events are derived):
Annotate_rows_log_event(char *buf, uint event_len,
Format_description_log_event *desc)
{
m_query_len = event_len - desc->common_header_len;
m_query_txt = buf + desc->common_header_len;
}
The events are written to the binary log by the Log_event::write() member
which calls virtual write_data_header() and write_data_body() members
("data header" and "post header" are synonym in replication terminology).
In our case, data header is empty and data body is just the query:
bool write_data_body(IO_CACHE *file)
{
return my_b_safe_write(file, (uchar*) m_query_txt, m_query_len);
}
Printing the event is just printing the query:
void Annotate_rows_log_event::print(FILE *file, PRINT_EVENT_INFO *pinfo)
{
my_b_printf(&pinfo->head_cache, "\tQuery: `%s`\n", m_query_txt);
}
3. How Master writes Annotate_rows events to the binary log
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The event is written to the binary log just before the group of Table_map
events which precede corresponding Rows events (one query may generate
several Table map events in the binary log, but the corresponding
Annotate_rows event must be written only once before the first Table map
event; hence the boolean variable 'with_annotate' below):
int write_locked_table_maps(THD *thd)
{ ...
bool with_annotate= thd->variables.binlog_annotate_rows_events;
...
for (uint i= 0; i < ... <number of tables> ...; ++i)
{ ...
thd->binlog_write_table_map(table, ..., with_annotate);
with_annotate= 0; // write Annotate_event not more than once
...
}
...
}
int THD::binlog_write_table_map(TABLE *table, ..., bool with_annotate)
{ ...
Table_map_log_event the_event(...);
...
if (with_annotate)
{
Annotate_rows_log_event anno(this);
mysql_bin_log.write(&anno);
}
mysql_bin_log.write(&the_event);
...
}
4. How slave treats replicate-annotate-rows-events option
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The replicate-annotate-rows-events option is treated just as the session
value of the binlog_annotate_rows_events variable for the slave IO and
SQL threads. This setting is done during initialization of these threads:
pthread_handler_t handle_slave_io(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_IO);
...
}
pthread_handler_t handle_slave_sql(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_SQL);
...
}
int init_slave_thread(THD* thd, SLAVE_THD_TYPE thd_type)
{ ...
thd->variables.binlog_annotate_rows_events=
opt_replicate_annotate_rows_events;
...
}
5. How slave IO thread requests Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If the replicate-annotate-rows-events option is not set on a slave, there
is no need for master to send Annotate_rows events to this slave. The slave
(or mysqlbinlog in remote case), before requesting binlog dump via the
COM_BINLOG_DUMP command, informs the master whether it should send these
events by executing the newly added COM_BINLOG_DUMP_OPTIONS_EXT server
command:
case COM_BINLOG_DUMP_OPTIONS_EXT:
thd->binlog_dump_flags_ext= packet[0];
my_ok(thd);
break;
Note. We add this new command and don't use COM_BINLOG_DUMP to avoid possible
conflicts with MySQL/Sun.
6. How master executes the request
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
case COM_BINLOG_DUMP:
{ ...
flags= uint2korr(packet + 4);
...
mysql_binlog_send(thd, ..., flags);
...
}
void mysql_binlog_send(THD* thd, ..., ushort flags)
{ ...
Log_event::read_log_event(&log, packet, ...);
...
if ((*packet)[EVENT_TYPE_OFFSET + 1] != ANNOTATE_ROWS_EVENT ||
flags & BINLOG_SEND_ANNOTATE_ROWS_EVENT)
{
my_net_write(net, packet->ptr(), packet->length());
}
...
}
7. How slave SQL thread processes Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The slave processes each recieved event by "applying" it, i.e. by
calling the Log_event::apply_event() function which in turn calls
the virtual do_apply_event() member specific for each type of the
event.
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev = next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
int apply_event_and_update_pos(Log_event *ev, ...)
{ ...
ev->apply_event(...);
...
}
int Log_event::apply_event(...)
{
return do_apply_event(...);
}
What does it mean to "apply" an Annotate_rows event? It means to set current
thd query to that of the described by the event, i.e. to the query which
caused the subsequent Rows events (see "How Master writes Annotate_rows
events to the binary log" to follow what happens further when the subsequent
Rows events are applied):
int Annotate_rows_log_event::do_apply_event(...)
{
thd->set_query(m_query_txt, m_query_len);
}
NOTE. I am not sure, but possibly current values of thd->query and
thd->query_length should be saved before calling set_query() and to be
restored on the Annotate_rows_log_event object deletion.
Is it really needed ?
After calling this do_apply_event() function we may not delete the
Annotate_rows_log_event object immediatedly (see exec_relay_log_event()
above) because thd->query now points to the string inside this object.
We may keep the pointer to this object in the Relay_log_info:
class Relay_log_info
{
public:
...
void set_annotate_event(Annotate_rows_log_event*);
Annotate_rows_log_event* get_annotate_event();
void free_annotate_event();
...
private:
Annotate_rows_log_event* m_annotate_event;
};
The saved Annotate_rows object should be deleted when all corresponding
Rows events will be processed:
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev= next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (rli->get_annotate_event() && is_last_rows_event(ev))
rli->free_annotate_event();
else if (ev->get_type_code() == ANNOTATE_ROWS_EVENT)
rli->set_annotate_event((Annotate_rows_log_event*) ev);
else if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
where
bool is_last_rows_event(Log_event* ev)
{
Log_event_type type= ev->get_type_code();
if (IS_ROWS_EVENT_TYPE(type))
{
Rows_log_event* rows= (Rows_log_event*)ev;
return rows->get_flags(Rows_log_event::STMT_END_F);
}
return 0;
}
#define IS_ROWS_EVENT_TYPE(type) ((type) == WRITE_ROWS_EVENT || \
(type) == UPDATE_ROWS_EVENT || \
(type) == DELETE_ROWS_EVENT)
8. General remarks
~~~~~~~~~~~~~~~~~~
Kristian noticed that introducing new log event type should be coordinated
somehow with MySQL/Sun:
Kristian: The numeric code for this event must be assigned carefully.
It should be coordinated with MySQL/Sun, otherwise we can get into a
situation where MySQL uses the same numeric code for one event that
MariaDB uses for ANNOTATE_ROWS_EVENT, which would make merging the two
impossible.
Alex: I reserved about 20 numbers not to have possible conflicts
with MySQL.
Kristian: Still, I think it would be appropriate to send a polite email
to internals(a)lists.mysql.com about this and suggesting to reserve the
event number.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Guest): Store in binlog text of statements that caused RBR events (47)
by worklog-noreply@askmonty.org 28 Jun '10
by worklog-noreply@askmonty.org 28 Jun '10
28 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Store in binlog text of statements that caused RBR events
CREATION DATE..: Sat, 15 Aug 2009, 23:48
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Knielsen, Serg
CATEGORY.......: Server-Sprint
TASK ID........: 47 (http://askmonty.org/worklog/?tid=47)
VERSION........: Server-9.x
STATUS.........: Complete
PRIORITY.......: 60
WORKED HOURS...: 72
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 35
PROGRESS NOTES:
-=-=(Guest - Mon, 28 Jun 2010, 11:54)=-=-
Status updated.
--- /tmp/wklog.47.old.915 2010-06-28 11:54:12.000000000 +0000
+++ /tmp/wklog.47.new.915 2010-06-28 11:54:12.000000000 +0000
@@ -1 +1 @@
-Code-Review
+Complete
-=-=(Alexi - Thu, 24 Jun 2010, 09:49)=-=-
Final implementation cleanup, testing, help Percona with build issues related to the WL
Worked 5 hours and estimate 0 hours remain (original estimate increased by 5 hours).
-=-=(Alexi - Thu, 24 Jun 2010, 09:47)=-=-
Making rpl- and binlog-tests stable w.r.t. adding new binlog events.
Worked 25 hours and estimate 0 hours remain (original estimate increased by 25 hours).
-=-=(Knielsen - Mon, 21 Jun 2010, 08:32)=-=-
Final review.
Assist with some problems applying the patch.
Worked 1 hour and estimate 0 hours remain (original estimate increased by 1 hour).
-=-=(Guest - Thu, 17 Jun 2010, 00:38)=-=-
Dependency deleted: 39 no longer depends on 47
-=-=(Knielsen - Mon, 07 Jun 2010, 07:13)=-=-
Help debug some test failures seen in Buildbot.
Worked 6 hours and estimate 0 hours remain (original estimate increased by 6 hours).
-=-=(Knielsen - Mon, 31 May 2010, 06:49)=-=-
Help Alexi debug+fix some test problems in the patch.
Worked 4 hours and estimate 0 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 25 May 2010, 08:29)=-=-
Help debug strange problem in mysqlbinlog.test.
Worked 1 hour and estimate 4 hours remain (original estimate unchanged).
-=-=(Knielsen - Mon, 17 May 2010, 08:45)=-=-
Merge with latest trunk and run Buildbot tests.
Worked 1 hour and estimate 5 hours remain (original estimate unchanged).
-=-=(Knielsen - Wed, 05 May 2010, 13:53)=-=-
Review of fixes to first review done. No new issues found.
Worked 2 hours and estimate 6 hours remain (original estimate unchanged).
------------------------------------------------------------
-=-=(View All Progress Notes, 38 total)=-=-
http://askmonty.org/worklog/index.pl?tid=47&nolimit=1
DESCRIPTION:
Store in binlog (and show in mysqlbinlog output) texts of statements that
caused RBR events
This is needed for (list from Monty):
- Easier to understand why updates happened
- Would make it easier to find out where in application things went
wrong (as you can search for exact strings)
- Allow one to filter things based on comments in the statement.
The cost of this can be that the binlog will be approximately 2x in size
(especially insert of big blob's would be a bit painful), so this should
be an optional feature.
HIGH-LEVEL SPECIFICATION:
Content
~~~~~~~
1. Annotate_rows_log_event
2. Server option: --binlog-annotate-rows-events
3. Server option: --replicate-annotate-rows-events
4. mysqlbinlog option: --print-annotate-rows-events
5. mysqlbinlog output
1. Annotate_rows_log_event [ ANNOTATE_ROWS_EVENT ]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Describes the query which caused the corresponding rows events. Has empty
post-header and contains the query text in its data part. Example:
************************
ANNOTATE_ROWS_EVENT
************************
00000220 | B6 A0 2C 4B | time_when = 1261215926
00000224 | 33 | event_type = 51
00000225 | 64 00 00 00 | server_id = 100
00000229 | 36 00 00 00 | event_len = 54
0000022D | 56 02 00 00 | log_pos = 00000256
00000231 | 00 00 | flags = <none>
------------------------
00000233 | 49 4E 53 45 | query = "INSERT INTO t1 VALUES (1), (2), (3)"
00000237 | 52 54 20 49 |
0000023B | 4E 54 4F 20 |
0000023F | 74 31 20 56 |
00000243 | 41 4C 55 45 |
00000247 | 53 20 28 31 |
0000024B | 29 2C 20 28 |
0000024F | 32 29 2C 20 |
00000253 | 28 33 29 |
************************
In binary log, Annotate_rows event follows the (possible) 'BEGIN' Query event
and precedes the first of Table map events which accompany the corresponding
rows events. (See example in the "mysqlbinlog output" section below.)
2. Server option: --binlog-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the master to write Annotate_rows events to the binary log.
* Variable Name: binlog_annotate_rows_events
* Scope: Global & Session
* Access Type: Dynamic
* Data Type: bool
* Default Value: OFF
NOTE. Session values allows to annotate only some selected statements:
...
SET SESSION binlog_annotate_rows_events=ON;
... statements to be annotated ...
SET SESSION binlog_annotate_rows_events=OFF;
... statements not to be annotated ...
3. Server option: --replicate-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the slave to reproduce Annotate_rows events recieved from the master
in its own binary log (sensible only in pair with log-slave-updates option).
* Variable Name: replicate_annotate_rows_events
* Scope: Global
* Access Type: Read only
* Data Type: bool
* Default Value: OFF
NOTE. Why do we additionally need this 'replicate' option? Why not to make
the slave to reproduce this events when its binlog-annotate-rows-events
global value is ON? Well, because, for example, we may want to configure
the slave which should reproduce Annotate_rows events but has global
binlog-annotate-rows-events = OFF meaning this to be the default value for
the client threads (see also "How slave treats replicate-annotate-rows-events
option" in LLD part).
4. mysqlbinlog option: --print-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
With this option, mysqlbinlog prints the content of Annotate_rows events (if
the binary log does contain them). Without this option (i.e. by default),
mysqlbinlog skips Annotate_rows events.
5. mysqlbinlog output
~~~~~~~~~~~~~~~~~~~~~
With --print-annotate-rows-events, mysqlbinlog outputs Annotate_rows events
in a form like this:
...
# at 1646
#091219 12:45:26 server id 100 end_log_pos 1714 Query thread_id=1
exec_time=0 error_code=0
SET TIMESTAMP=1261215926/*!*/;
BEGIN
/*!*/;
# at 1714
# at 1812
# at 1853
# at 1894
# at 1938
#091219 12:45:26 server id 100 end_log_pos 1812 Query: `DELETE t1, t2 FROM
t1 INNER JOIN t2 INNER JOIN t3 WHERE t1.a=t2.a AND t2.a=t3.a`
#091219 12:45:26 server id 100 end_log_pos 1853 Table_map: `test`.`t1`
mapped to number 16
#091219 12:45:26 server id 100 end_log_pos 1894 Table_map: `test`.`t2`
mapped to number 17
#091219 12:45:26 server id 100 end_log_pos 1938 Delete_rows: table id 16
#091219 12:45:26 server id 100 end_log_pos 1982 Delete_rows: table id 17
flags: STMT_END_F
...
LOW-LEVEL DESIGN:
Content
~~~~~~~
1. Annotate_rows event number
2. Outline of Annotate_rows event behavior
3. How Master writes Annotate_rows events to the binary log
4. How slave treats replicate-annotate-rows-events option
5. How slave IO thread requests Annotate_rows events
6. How master executes the request
7. How slave SQL thread processes Annotate_rows events
8. General remarks
1. Annotate_rows event number
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To avoid possible event numbers conflict with MySQL/Sun, we leave a gap
between the last MySQL event number and the Annotate_rows event number:
enum Log_event_type
{ ...
INCIDENT_EVENT= 26,
// New MySQL event numbers are to be added here
MYSQL_EVENTS_END,
MARIA_EVENTS_BEGIN= 51,
// New Maria event numbers start from here
ANNOTATE_ROWS_EVENT= 51,
ENUM_END_EVENT
};
together with the corresponding extension of 'post_header_len' array in the
Format description event. (This extension does not affect the compatibility
of the binary log). Here is how Format description event looks like with
this extension:
************************
FORMAT_DESCRIPTION_EVENT
************************
00000004 | A1 A0 2C 4B | time_when = 1261215905
00000008 | 0F | event_type = 15
00000009 | 64 00 00 00 | server_id = 100
0000000D | 7F 00 00 00 | event_len = 127
00000011 | 83 00 00 00 | log_pos = 00000083
00000015 | 01 00 | flags = LOG_EVENT_BINLOG_IN_USE_F
------------------------
00000017 | 04 00 | binlog_ver = 4
00000019 | 35 2E 32 2E | server_ver = 5.2.0-MariaDB-alpha-debug-log
..... ...
0000004B | A1 A0 2C 4B | time_created = 1261215905
0000004F | 13 | common_header_len = 19
------------------------
post_header_len
------------------------
00000050 | 38 | 56 - START_EVENT_V3 [1]
..... ...
00000069 | 02 | 2 - INCIDENT_EVENT [26]
0000006A | 00 | 0 - RESERVED [27]
..... ...
00000081 | 00 | 0 - RESERVED [50]
00000082 | 00 | 0 - ANNOTATE_ROWS_EVENT [51]
************************
2. Outline of Annotate_rows event behavior
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Each Annotate_rows_log_event object has two private members describing the
corresponding query:
char *m_query_txt;
uint m_query_len;
When the object is created for writing to a binary log, this query is taken
from 'thd' (for short, below we omit the 'Annotate_rows_log_event::' prefix
as well as other implementation details):
Annotate_rows_log_event(THD *thd)
{
m_query_txt = thd->query();
m_query_len = thd->query_length();
}
When the object is read from a binary log, the query is taken from the buffer
containing the binary log representation of the event (this buffer is allocated
in Log_event object from which all Log events are derived):
Annotate_rows_log_event(char *buf, uint event_len,
Format_description_log_event *desc)
{
m_query_len = event_len - desc->common_header_len;
m_query_txt = buf + desc->common_header_len;
}
The events are written to the binary log by the Log_event::write() member
which calls virtual write_data_header() and write_data_body() members
("data header" and "post header" are synonym in replication terminology).
In our case, data header is empty and data body is just the query:
bool write_data_body(IO_CACHE *file)
{
return my_b_safe_write(file, (uchar*) m_query_txt, m_query_len);
}
Printing the event is just printing the query:
void Annotate_rows_log_event::print(FILE *file, PRINT_EVENT_INFO *pinfo)
{
my_b_printf(&pinfo->head_cache, "\tQuery: `%s`\n", m_query_txt);
}
3. How Master writes Annotate_rows events to the binary log
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The event is written to the binary log just before the group of Table_map
events which precede corresponding Rows events (one query may generate
several Table map events in the binary log, but the corresponding
Annotate_rows event must be written only once before the first Table map
event; hence the boolean variable 'with_annotate' below):
int write_locked_table_maps(THD *thd)
{ ...
bool with_annotate= thd->variables.binlog_annotate_rows_events;
...
for (uint i= 0; i < ... <number of tables> ...; ++i)
{ ...
thd->binlog_write_table_map(table, ..., with_annotate);
with_annotate= 0; // write Annotate_event not more than once
...
}
...
}
int THD::binlog_write_table_map(TABLE *table, ..., bool with_annotate)
{ ...
Table_map_log_event the_event(...);
...
if (with_annotate)
{
Annotate_rows_log_event anno(this);
mysql_bin_log.write(&anno);
}
mysql_bin_log.write(&the_event);
...
}
4. How slave treats replicate-annotate-rows-events option
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The replicate-annotate-rows-events option is treated just as the session
value of the binlog_annotate_rows_events variable for the slave IO and
SQL threads. This setting is done during initialization of these threads:
pthread_handler_t handle_slave_io(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_IO);
...
}
pthread_handler_t handle_slave_sql(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_SQL);
...
}
int init_slave_thread(THD* thd, SLAVE_THD_TYPE thd_type)
{ ...
thd->variables.binlog_annotate_rows_events=
opt_replicate_annotate_rows_events;
...
}
5. How slave IO thread requests Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If the replicate-annotate-rows-events option is not set on a slave, there
is no need for master to send Annotate_rows events to this slave. The slave
(or mysqlbinlog in remote case), before requesting binlog dump via the
COM_BINLOG_DUMP command, informs the master whether it should send these
events by executing the newly added COM_BINLOG_DUMP_OPTIONS_EXT server
command:
case COM_BINLOG_DUMP_OPTIONS_EXT:
thd->binlog_dump_flags_ext= packet[0];
my_ok(thd);
break;
Note. We add this new command and don't use COM_BINLOG_DUMP to avoid possible
conflicts with MySQL/Sun.
6. How master executes the request
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
case COM_BINLOG_DUMP:
{ ...
flags= uint2korr(packet + 4);
...
mysql_binlog_send(thd, ..., flags);
...
}
void mysql_binlog_send(THD* thd, ..., ushort flags)
{ ...
Log_event::read_log_event(&log, packet, ...);
...
if ((*packet)[EVENT_TYPE_OFFSET + 1] != ANNOTATE_ROWS_EVENT ||
flags & BINLOG_SEND_ANNOTATE_ROWS_EVENT)
{
my_net_write(net, packet->ptr(), packet->length());
}
...
}
7. How slave SQL thread processes Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The slave processes each recieved event by "applying" it, i.e. by
calling the Log_event::apply_event() function which in turn calls
the virtual do_apply_event() member specific for each type of the
event.
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev = next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
int apply_event_and_update_pos(Log_event *ev, ...)
{ ...
ev->apply_event(...);
...
}
int Log_event::apply_event(...)
{
return do_apply_event(...);
}
What does it mean to "apply" an Annotate_rows event? It means to set current
thd query to that of the described by the event, i.e. to the query which
caused the subsequent Rows events (see "How Master writes Annotate_rows
events to the binary log" to follow what happens further when the subsequent
Rows events are applied):
int Annotate_rows_log_event::do_apply_event(...)
{
thd->set_query(m_query_txt, m_query_len);
}
NOTE. I am not sure, but possibly current values of thd->query and
thd->query_length should be saved before calling set_query() and to be
restored on the Annotate_rows_log_event object deletion.
Is it really needed ?
After calling this do_apply_event() function we may not delete the
Annotate_rows_log_event object immediatedly (see exec_relay_log_event()
above) because thd->query now points to the string inside this object.
We may keep the pointer to this object in the Relay_log_info:
class Relay_log_info
{
public:
...
void set_annotate_event(Annotate_rows_log_event*);
Annotate_rows_log_event* get_annotate_event();
void free_annotate_event();
...
private:
Annotate_rows_log_event* m_annotate_event;
};
The saved Annotate_rows object should be deleted when all corresponding
Rows events will be processed:
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev= next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (rli->get_annotate_event() && is_last_rows_event(ev))
rli->free_annotate_event();
else if (ev->get_type_code() == ANNOTATE_ROWS_EVENT)
rli->set_annotate_event((Annotate_rows_log_event*) ev);
else if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
where
bool is_last_rows_event(Log_event* ev)
{
Log_event_type type= ev->get_type_code();
if (IS_ROWS_EVENT_TYPE(type))
{
Rows_log_event* rows= (Rows_log_event*)ev;
return rows->get_flags(Rows_log_event::STMT_END_F);
}
return 0;
}
#define IS_ROWS_EVENT_TYPE(type) ((type) == WRITE_ROWS_EVENT || \
(type) == UPDATE_ROWS_EVENT || \
(type) == DELETE_ROWS_EVENT)
8. General remarks
~~~~~~~~~~~~~~~~~~
Kristian noticed that introducing new log event type should be coordinated
somehow with MySQL/Sun:
Kristian: The numeric code for this event must be assigned carefully.
It should be coordinated with MySQL/Sun, otherwise we can get into a
situation where MySQL uses the same numeric code for one event that
MariaDB uses for ANNOTATE_ROWS_EVENT, which would make merging the two
impossible.
Alex: I reserved about 20 numbers not to have possible conflicts
with MySQL.
Kristian: Still, I think it would be appropriate to send a polite email
to internals(a)lists.mysql.com about this and suggesting to reserve the
event number.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Guest): Store in binlog text of statements that caused RBR events (47)
by worklog-noreply@askmonty.org 28 Jun '10
by worklog-noreply@askmonty.org 28 Jun '10
28 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Store in binlog text of statements that caused RBR events
CREATION DATE..: Sat, 15 Aug 2009, 23:48
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Knielsen, Serg
CATEGORY.......: Server-Sprint
TASK ID........: 47 (http://askmonty.org/worklog/?tid=47)
VERSION........: Server-9.x
STATUS.........: Complete
PRIORITY.......: 60
WORKED HOURS...: 72
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 35
PROGRESS NOTES:
-=-=(Guest - Mon, 28 Jun 2010, 11:54)=-=-
Status updated.
--- /tmp/wklog.47.old.915 2010-06-28 11:54:12.000000000 +0000
+++ /tmp/wklog.47.new.915 2010-06-28 11:54:12.000000000 +0000
@@ -1 +1 @@
-Code-Review
+Complete
-=-=(Alexi - Thu, 24 Jun 2010, 09:49)=-=-
Final implementation cleanup, testing, help Percona with build issues related to the WL
Worked 5 hours and estimate 0 hours remain (original estimate increased by 5 hours).
-=-=(Alexi - Thu, 24 Jun 2010, 09:47)=-=-
Making rpl- and binlog-tests stable w.r.t. adding new binlog events.
Worked 25 hours and estimate 0 hours remain (original estimate increased by 25 hours).
-=-=(Knielsen - Mon, 21 Jun 2010, 08:32)=-=-
Final review.
Assist with some problems applying the patch.
Worked 1 hour and estimate 0 hours remain (original estimate increased by 1 hour).
-=-=(Guest - Thu, 17 Jun 2010, 00:38)=-=-
Dependency deleted: 39 no longer depends on 47
-=-=(Knielsen - Mon, 07 Jun 2010, 07:13)=-=-
Help debug some test failures seen in Buildbot.
Worked 6 hours and estimate 0 hours remain (original estimate increased by 6 hours).
-=-=(Knielsen - Mon, 31 May 2010, 06:49)=-=-
Help Alexi debug+fix some test problems in the patch.
Worked 4 hours and estimate 0 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 25 May 2010, 08:29)=-=-
Help debug strange problem in mysqlbinlog.test.
Worked 1 hour and estimate 4 hours remain (original estimate unchanged).
-=-=(Knielsen - Mon, 17 May 2010, 08:45)=-=-
Merge with latest trunk and run Buildbot tests.
Worked 1 hour and estimate 5 hours remain (original estimate unchanged).
-=-=(Knielsen - Wed, 05 May 2010, 13:53)=-=-
Review of fixes to first review done. No new issues found.
Worked 2 hours and estimate 6 hours remain (original estimate unchanged).
------------------------------------------------------------
-=-=(View All Progress Notes, 38 total)=-=-
http://askmonty.org/worklog/index.pl?tid=47&nolimit=1
DESCRIPTION:
Store in binlog (and show in mysqlbinlog output) texts of statements that
caused RBR events
This is needed for (list from Monty):
- Easier to understand why updates happened
- Would make it easier to find out where in application things went
wrong (as you can search for exact strings)
- Allow one to filter things based on comments in the statement.
The cost of this can be that the binlog will be approximately 2x in size
(especially insert of big blob's would be a bit painful), so this should
be an optional feature.
HIGH-LEVEL SPECIFICATION:
Content
~~~~~~~
1. Annotate_rows_log_event
2. Server option: --binlog-annotate-rows-events
3. Server option: --replicate-annotate-rows-events
4. mysqlbinlog option: --print-annotate-rows-events
5. mysqlbinlog output
1. Annotate_rows_log_event [ ANNOTATE_ROWS_EVENT ]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Describes the query which caused the corresponding rows events. Has empty
post-header and contains the query text in its data part. Example:
************************
ANNOTATE_ROWS_EVENT
************************
00000220 | B6 A0 2C 4B | time_when = 1261215926
00000224 | 33 | event_type = 51
00000225 | 64 00 00 00 | server_id = 100
00000229 | 36 00 00 00 | event_len = 54
0000022D | 56 02 00 00 | log_pos = 00000256
00000231 | 00 00 | flags = <none>
------------------------
00000233 | 49 4E 53 45 | query = "INSERT INTO t1 VALUES (1), (2), (3)"
00000237 | 52 54 20 49 |
0000023B | 4E 54 4F 20 |
0000023F | 74 31 20 56 |
00000243 | 41 4C 55 45 |
00000247 | 53 20 28 31 |
0000024B | 29 2C 20 28 |
0000024F | 32 29 2C 20 |
00000253 | 28 33 29 |
************************
In binary log, Annotate_rows event follows the (possible) 'BEGIN' Query event
and precedes the first of Table map events which accompany the corresponding
rows events. (See example in the "mysqlbinlog output" section below.)
2. Server option: --binlog-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the master to write Annotate_rows events to the binary log.
* Variable Name: binlog_annotate_rows_events
* Scope: Global & Session
* Access Type: Dynamic
* Data Type: bool
* Default Value: OFF
NOTE. Session values allows to annotate only some selected statements:
...
SET SESSION binlog_annotate_rows_events=ON;
... statements to be annotated ...
SET SESSION binlog_annotate_rows_events=OFF;
... statements not to be annotated ...
3. Server option: --replicate-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the slave to reproduce Annotate_rows events recieved from the master
in its own binary log (sensible only in pair with log-slave-updates option).
* Variable Name: replicate_annotate_rows_events
* Scope: Global
* Access Type: Read only
* Data Type: bool
* Default Value: OFF
NOTE. Why do we additionally need this 'replicate' option? Why not to make
the slave to reproduce this events when its binlog-annotate-rows-events
global value is ON? Well, because, for example, we may want to configure
the slave which should reproduce Annotate_rows events but has global
binlog-annotate-rows-events = OFF meaning this to be the default value for
the client threads (see also "How slave treats replicate-annotate-rows-events
option" in LLD part).
4. mysqlbinlog option: --print-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
With this option, mysqlbinlog prints the content of Annotate_rows events (if
the binary log does contain them). Without this option (i.e. by default),
mysqlbinlog skips Annotate_rows events.
5. mysqlbinlog output
~~~~~~~~~~~~~~~~~~~~~
With --print-annotate-rows-events, mysqlbinlog outputs Annotate_rows events
in a form like this:
...
# at 1646
#091219 12:45:26 server id 100 end_log_pos 1714 Query thread_id=1
exec_time=0 error_code=0
SET TIMESTAMP=1261215926/*!*/;
BEGIN
/*!*/;
# at 1714
# at 1812
# at 1853
# at 1894
# at 1938
#091219 12:45:26 server id 100 end_log_pos 1812 Query: `DELETE t1, t2 FROM
t1 INNER JOIN t2 INNER JOIN t3 WHERE t1.a=t2.a AND t2.a=t3.a`
#091219 12:45:26 server id 100 end_log_pos 1853 Table_map: `test`.`t1`
mapped to number 16
#091219 12:45:26 server id 100 end_log_pos 1894 Table_map: `test`.`t2`
mapped to number 17
#091219 12:45:26 server id 100 end_log_pos 1938 Delete_rows: table id 16
#091219 12:45:26 server id 100 end_log_pos 1982 Delete_rows: table id 17
flags: STMT_END_F
...
LOW-LEVEL DESIGN:
Content
~~~~~~~
1. Annotate_rows event number
2. Outline of Annotate_rows event behavior
3. How Master writes Annotate_rows events to the binary log
4. How slave treats replicate-annotate-rows-events option
5. How slave IO thread requests Annotate_rows events
6. How master executes the request
7. How slave SQL thread processes Annotate_rows events
8. General remarks
1. Annotate_rows event number
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To avoid possible event numbers conflict with MySQL/Sun, we leave a gap
between the last MySQL event number and the Annotate_rows event number:
enum Log_event_type
{ ...
INCIDENT_EVENT= 26,
// New MySQL event numbers are to be added here
MYSQL_EVENTS_END,
MARIA_EVENTS_BEGIN= 51,
// New Maria event numbers start from here
ANNOTATE_ROWS_EVENT= 51,
ENUM_END_EVENT
};
together with the corresponding extension of 'post_header_len' array in the
Format description event. (This extension does not affect the compatibility
of the binary log). Here is how Format description event looks like with
this extension:
************************
FORMAT_DESCRIPTION_EVENT
************************
00000004 | A1 A0 2C 4B | time_when = 1261215905
00000008 | 0F | event_type = 15
00000009 | 64 00 00 00 | server_id = 100
0000000D | 7F 00 00 00 | event_len = 127
00000011 | 83 00 00 00 | log_pos = 00000083
00000015 | 01 00 | flags = LOG_EVENT_BINLOG_IN_USE_F
------------------------
00000017 | 04 00 | binlog_ver = 4
00000019 | 35 2E 32 2E | server_ver = 5.2.0-MariaDB-alpha-debug-log
..... ...
0000004B | A1 A0 2C 4B | time_created = 1261215905
0000004F | 13 | common_header_len = 19
------------------------
post_header_len
------------------------
00000050 | 38 | 56 - START_EVENT_V3 [1]
..... ...
00000069 | 02 | 2 - INCIDENT_EVENT [26]
0000006A | 00 | 0 - RESERVED [27]
..... ...
00000081 | 00 | 0 - RESERVED [50]
00000082 | 00 | 0 - ANNOTATE_ROWS_EVENT [51]
************************
2. Outline of Annotate_rows event behavior
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Each Annotate_rows_log_event object has two private members describing the
corresponding query:
char *m_query_txt;
uint m_query_len;
When the object is created for writing to a binary log, this query is taken
from 'thd' (for short, below we omit the 'Annotate_rows_log_event::' prefix
as well as other implementation details):
Annotate_rows_log_event(THD *thd)
{
m_query_txt = thd->query();
m_query_len = thd->query_length();
}
When the object is read from a binary log, the query is taken from the buffer
containing the binary log representation of the event (this buffer is allocated
in Log_event object from which all Log events are derived):
Annotate_rows_log_event(char *buf, uint event_len,
Format_description_log_event *desc)
{
m_query_len = event_len - desc->common_header_len;
m_query_txt = buf + desc->common_header_len;
}
The events are written to the binary log by the Log_event::write() member
which calls virtual write_data_header() and write_data_body() members
("data header" and "post header" are synonym in replication terminology).
In our case, data header is empty and data body is just the query:
bool write_data_body(IO_CACHE *file)
{
return my_b_safe_write(file, (uchar*) m_query_txt, m_query_len);
}
Printing the event is just printing the query:
void Annotate_rows_log_event::print(FILE *file, PRINT_EVENT_INFO *pinfo)
{
my_b_printf(&pinfo->head_cache, "\tQuery: `%s`\n", m_query_txt);
}
3. How Master writes Annotate_rows events to the binary log
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The event is written to the binary log just before the group of Table_map
events which precede corresponding Rows events (one query may generate
several Table map events in the binary log, but the corresponding
Annotate_rows event must be written only once before the first Table map
event; hence the boolean variable 'with_annotate' below):
int write_locked_table_maps(THD *thd)
{ ...
bool with_annotate= thd->variables.binlog_annotate_rows_events;
...
for (uint i= 0; i < ... <number of tables> ...; ++i)
{ ...
thd->binlog_write_table_map(table, ..., with_annotate);
with_annotate= 0; // write Annotate_event not more than once
...
}
...
}
int THD::binlog_write_table_map(TABLE *table, ..., bool with_annotate)
{ ...
Table_map_log_event the_event(...);
...
if (with_annotate)
{
Annotate_rows_log_event anno(this);
mysql_bin_log.write(&anno);
}
mysql_bin_log.write(&the_event);
...
}
4. How slave treats replicate-annotate-rows-events option
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The replicate-annotate-rows-events option is treated just as the session
value of the binlog_annotate_rows_events variable for the slave IO and
SQL threads. This setting is done during initialization of these threads:
pthread_handler_t handle_slave_io(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_IO);
...
}
pthread_handler_t handle_slave_sql(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_SQL);
...
}
int init_slave_thread(THD* thd, SLAVE_THD_TYPE thd_type)
{ ...
thd->variables.binlog_annotate_rows_events=
opt_replicate_annotate_rows_events;
...
}
5. How slave IO thread requests Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If the replicate-annotate-rows-events option is not set on a slave, there
is no need for master to send Annotate_rows events to this slave. The slave
(or mysqlbinlog in remote case), before requesting binlog dump via the
COM_BINLOG_DUMP command, informs the master whether it should send these
events by executing the newly added COM_BINLOG_DUMP_OPTIONS_EXT server
command:
case COM_BINLOG_DUMP_OPTIONS_EXT:
thd->binlog_dump_flags_ext= packet[0];
my_ok(thd);
break;
Note. We add this new command and don't use COM_BINLOG_DUMP to avoid possible
conflicts with MySQL/Sun.
6. How master executes the request
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
case COM_BINLOG_DUMP:
{ ...
flags= uint2korr(packet + 4);
...
mysql_binlog_send(thd, ..., flags);
...
}
void mysql_binlog_send(THD* thd, ..., ushort flags)
{ ...
Log_event::read_log_event(&log, packet, ...);
...
if ((*packet)[EVENT_TYPE_OFFSET + 1] != ANNOTATE_ROWS_EVENT ||
flags & BINLOG_SEND_ANNOTATE_ROWS_EVENT)
{
my_net_write(net, packet->ptr(), packet->length());
}
...
}
7. How slave SQL thread processes Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The slave processes each recieved event by "applying" it, i.e. by
calling the Log_event::apply_event() function which in turn calls
the virtual do_apply_event() member specific for each type of the
event.
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev = next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
int apply_event_and_update_pos(Log_event *ev, ...)
{ ...
ev->apply_event(...);
...
}
int Log_event::apply_event(...)
{
return do_apply_event(...);
}
What does it mean to "apply" an Annotate_rows event? It means to set current
thd query to that of the described by the event, i.e. to the query which
caused the subsequent Rows events (see "How Master writes Annotate_rows
events to the binary log" to follow what happens further when the subsequent
Rows events are applied):
int Annotate_rows_log_event::do_apply_event(...)
{
thd->set_query(m_query_txt, m_query_len);
}
NOTE. I am not sure, but possibly current values of thd->query and
thd->query_length should be saved before calling set_query() and to be
restored on the Annotate_rows_log_event object deletion.
Is it really needed ?
After calling this do_apply_event() function we may not delete the
Annotate_rows_log_event object immediatedly (see exec_relay_log_event()
above) because thd->query now points to the string inside this object.
We may keep the pointer to this object in the Relay_log_info:
class Relay_log_info
{
public:
...
void set_annotate_event(Annotate_rows_log_event*);
Annotate_rows_log_event* get_annotate_event();
void free_annotate_event();
...
private:
Annotate_rows_log_event* m_annotate_event;
};
The saved Annotate_rows object should be deleted when all corresponding
Rows events will be processed:
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev= next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (rli->get_annotate_event() && is_last_rows_event(ev))
rli->free_annotate_event();
else if (ev->get_type_code() == ANNOTATE_ROWS_EVENT)
rli->set_annotate_event((Annotate_rows_log_event*) ev);
else if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
where
bool is_last_rows_event(Log_event* ev)
{
Log_event_type type= ev->get_type_code();
if (IS_ROWS_EVENT_TYPE(type))
{
Rows_log_event* rows= (Rows_log_event*)ev;
return rows->get_flags(Rows_log_event::STMT_END_F);
}
return 0;
}
#define IS_ROWS_EVENT_TYPE(type) ((type) == WRITE_ROWS_EVENT || \
(type) == UPDATE_ROWS_EVENT || \
(type) == DELETE_ROWS_EVENT)
8. General remarks
~~~~~~~~~~~~~~~~~~
Kristian noticed that introducing new log event type should be coordinated
somehow with MySQL/Sun:
Kristian: The numeric code for this event must be assigned carefully.
It should be coordinated with MySQL/Sun, otherwise we can get into a
situation where MySQL uses the same numeric code for one event that
MariaDB uses for ANNOTATE_ROWS_EVENT, which would make merging the two
impossible.
Alex: I reserved about 20 numbers not to have possible conflicts
with MySQL.
Kristian: Still, I think it would be appropriate to send a polite email
to internals(a)lists.mysql.com about this and suggesting to reserve the
event number.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Guest): Store in binlog text of statements that caused RBR events (47)
by worklog-noreply@askmonty.org 28 Jun '10
by worklog-noreply@askmonty.org 28 Jun '10
28 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Store in binlog text of statements that caused RBR events
CREATION DATE..: Sat, 15 Aug 2009, 23:48
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Knielsen, Serg
CATEGORY.......: Server-Sprint
TASK ID........: 47 (http://askmonty.org/worklog/?tid=47)
VERSION........: Server-9.x
STATUS.........: Complete
PRIORITY.......: 60
WORKED HOURS...: 72
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 35
PROGRESS NOTES:
-=-=(Guest - Mon, 28 Jun 2010, 11:54)=-=-
Status updated.
--- /tmp/wklog.47.old.915 2010-06-28 11:54:12.000000000 +0000
+++ /tmp/wklog.47.new.915 2010-06-28 11:54:12.000000000 +0000
@@ -1 +1 @@
-Code-Review
+Complete
-=-=(Alexi - Thu, 24 Jun 2010, 09:49)=-=-
Final implementation cleanup, testing, help Percona with build issues related to the WL
Worked 5 hours and estimate 0 hours remain (original estimate increased by 5 hours).
-=-=(Alexi - Thu, 24 Jun 2010, 09:47)=-=-
Making rpl- and binlog-tests stable w.r.t. adding new binlog events.
Worked 25 hours and estimate 0 hours remain (original estimate increased by 25 hours).
-=-=(Knielsen - Mon, 21 Jun 2010, 08:32)=-=-
Final review.
Assist with some problems applying the patch.
Worked 1 hour and estimate 0 hours remain (original estimate increased by 1 hour).
-=-=(Guest - Thu, 17 Jun 2010, 00:38)=-=-
Dependency deleted: 39 no longer depends on 47
-=-=(Knielsen - Mon, 07 Jun 2010, 07:13)=-=-
Help debug some test failures seen in Buildbot.
Worked 6 hours and estimate 0 hours remain (original estimate increased by 6 hours).
-=-=(Knielsen - Mon, 31 May 2010, 06:49)=-=-
Help Alexi debug+fix some test problems in the patch.
Worked 4 hours and estimate 0 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 25 May 2010, 08:29)=-=-
Help debug strange problem in mysqlbinlog.test.
Worked 1 hour and estimate 4 hours remain (original estimate unchanged).
-=-=(Knielsen - Mon, 17 May 2010, 08:45)=-=-
Merge with latest trunk and run Buildbot tests.
Worked 1 hour and estimate 5 hours remain (original estimate unchanged).
-=-=(Knielsen - Wed, 05 May 2010, 13:53)=-=-
Review of fixes to first review done. No new issues found.
Worked 2 hours and estimate 6 hours remain (original estimate unchanged).
------------------------------------------------------------
-=-=(View All Progress Notes, 38 total)=-=-
http://askmonty.org/worklog/index.pl?tid=47&nolimit=1
DESCRIPTION:
Store in binlog (and show in mysqlbinlog output) texts of statements that
caused RBR events
This is needed for (list from Monty):
- Easier to understand why updates happened
- Would make it easier to find out where in application things went
wrong (as you can search for exact strings)
- Allow one to filter things based on comments in the statement.
The cost of this can be that the binlog will be approximately 2x in size
(especially insert of big blob's would be a bit painful), so this should
be an optional feature.
HIGH-LEVEL SPECIFICATION:
Content
~~~~~~~
1. Annotate_rows_log_event
2. Server option: --binlog-annotate-rows-events
3. Server option: --replicate-annotate-rows-events
4. mysqlbinlog option: --print-annotate-rows-events
5. mysqlbinlog output
1. Annotate_rows_log_event [ ANNOTATE_ROWS_EVENT ]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Describes the query which caused the corresponding rows events. Has empty
post-header and contains the query text in its data part. Example:
************************
ANNOTATE_ROWS_EVENT
************************
00000220 | B6 A0 2C 4B | time_when = 1261215926
00000224 | 33 | event_type = 51
00000225 | 64 00 00 00 | server_id = 100
00000229 | 36 00 00 00 | event_len = 54
0000022D | 56 02 00 00 | log_pos = 00000256
00000231 | 00 00 | flags = <none>
------------------------
00000233 | 49 4E 53 45 | query = "INSERT INTO t1 VALUES (1), (2), (3)"
00000237 | 52 54 20 49 |
0000023B | 4E 54 4F 20 |
0000023F | 74 31 20 56 |
00000243 | 41 4C 55 45 |
00000247 | 53 20 28 31 |
0000024B | 29 2C 20 28 |
0000024F | 32 29 2C 20 |
00000253 | 28 33 29 |
************************
In binary log, Annotate_rows event follows the (possible) 'BEGIN' Query event
and precedes the first of Table map events which accompany the corresponding
rows events. (See example in the "mysqlbinlog output" section below.)
2. Server option: --binlog-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the master to write Annotate_rows events to the binary log.
* Variable Name: binlog_annotate_rows_events
* Scope: Global & Session
* Access Type: Dynamic
* Data Type: bool
* Default Value: OFF
NOTE. Session values allows to annotate only some selected statements:
...
SET SESSION binlog_annotate_rows_events=ON;
... statements to be annotated ...
SET SESSION binlog_annotate_rows_events=OFF;
... statements not to be annotated ...
3. Server option: --replicate-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the slave to reproduce Annotate_rows events recieved from the master
in its own binary log (sensible only in pair with log-slave-updates option).
* Variable Name: replicate_annotate_rows_events
* Scope: Global
* Access Type: Read only
* Data Type: bool
* Default Value: OFF
NOTE. Why do we additionally need this 'replicate' option? Why not to make
the slave to reproduce this events when its binlog-annotate-rows-events
global value is ON? Well, because, for example, we may want to configure
the slave which should reproduce Annotate_rows events but has global
binlog-annotate-rows-events = OFF meaning this to be the default value for
the client threads (see also "How slave treats replicate-annotate-rows-events
option" in LLD part).
4. mysqlbinlog option: --print-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
With this option, mysqlbinlog prints the content of Annotate_rows events (if
the binary log does contain them). Without this option (i.e. by default),
mysqlbinlog skips Annotate_rows events.
5. mysqlbinlog output
~~~~~~~~~~~~~~~~~~~~~
With --print-annotate-rows-events, mysqlbinlog outputs Annotate_rows events
in a form like this:
...
# at 1646
#091219 12:45:26 server id 100 end_log_pos 1714 Query thread_id=1
exec_time=0 error_code=0
SET TIMESTAMP=1261215926/*!*/;
BEGIN
/*!*/;
# at 1714
# at 1812
# at 1853
# at 1894
# at 1938
#091219 12:45:26 server id 100 end_log_pos 1812 Query: `DELETE t1, t2 FROM
t1 INNER JOIN t2 INNER JOIN t3 WHERE t1.a=t2.a AND t2.a=t3.a`
#091219 12:45:26 server id 100 end_log_pos 1853 Table_map: `test`.`t1`
mapped to number 16
#091219 12:45:26 server id 100 end_log_pos 1894 Table_map: `test`.`t2`
mapped to number 17
#091219 12:45:26 server id 100 end_log_pos 1938 Delete_rows: table id 16
#091219 12:45:26 server id 100 end_log_pos 1982 Delete_rows: table id 17
flags: STMT_END_F
...
LOW-LEVEL DESIGN:
Content
~~~~~~~
1. Annotate_rows event number
2. Outline of Annotate_rows event behavior
3. How Master writes Annotate_rows events to the binary log
4. How slave treats replicate-annotate-rows-events option
5. How slave IO thread requests Annotate_rows events
6. How master executes the request
7. How slave SQL thread processes Annotate_rows events
8. General remarks
1. Annotate_rows event number
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To avoid possible event numbers conflict with MySQL/Sun, we leave a gap
between the last MySQL event number and the Annotate_rows event number:
enum Log_event_type
{ ...
INCIDENT_EVENT= 26,
// New MySQL event numbers are to be added here
MYSQL_EVENTS_END,
MARIA_EVENTS_BEGIN= 51,
// New Maria event numbers start from here
ANNOTATE_ROWS_EVENT= 51,
ENUM_END_EVENT
};
together with the corresponding extension of 'post_header_len' array in the
Format description event. (This extension does not affect the compatibility
of the binary log). Here is how Format description event looks like with
this extension:
************************
FORMAT_DESCRIPTION_EVENT
************************
00000004 | A1 A0 2C 4B | time_when = 1261215905
00000008 | 0F | event_type = 15
00000009 | 64 00 00 00 | server_id = 100
0000000D | 7F 00 00 00 | event_len = 127
00000011 | 83 00 00 00 | log_pos = 00000083
00000015 | 01 00 | flags = LOG_EVENT_BINLOG_IN_USE_F
------------------------
00000017 | 04 00 | binlog_ver = 4
00000019 | 35 2E 32 2E | server_ver = 5.2.0-MariaDB-alpha-debug-log
..... ...
0000004B | A1 A0 2C 4B | time_created = 1261215905
0000004F | 13 | common_header_len = 19
------------------------
post_header_len
------------------------
00000050 | 38 | 56 - START_EVENT_V3 [1]
..... ...
00000069 | 02 | 2 - INCIDENT_EVENT [26]
0000006A | 00 | 0 - RESERVED [27]
..... ...
00000081 | 00 | 0 - RESERVED [50]
00000082 | 00 | 0 - ANNOTATE_ROWS_EVENT [51]
************************
2. Outline of Annotate_rows event behavior
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Each Annotate_rows_log_event object has two private members describing the
corresponding query:
char *m_query_txt;
uint m_query_len;
When the object is created for writing to a binary log, this query is taken
from 'thd' (for short, below we omit the 'Annotate_rows_log_event::' prefix
as well as other implementation details):
Annotate_rows_log_event(THD *thd)
{
m_query_txt = thd->query();
m_query_len = thd->query_length();
}
When the object is read from a binary log, the query is taken from the buffer
containing the binary log representation of the event (this buffer is allocated
in Log_event object from which all Log events are derived):
Annotate_rows_log_event(char *buf, uint event_len,
Format_description_log_event *desc)
{
m_query_len = event_len - desc->common_header_len;
m_query_txt = buf + desc->common_header_len;
}
The events are written to the binary log by the Log_event::write() member
which calls virtual write_data_header() and write_data_body() members
("data header" and "post header" are synonym in replication terminology).
In our case, data header is empty and data body is just the query:
bool write_data_body(IO_CACHE *file)
{
return my_b_safe_write(file, (uchar*) m_query_txt, m_query_len);
}
Printing the event is just printing the query:
void Annotate_rows_log_event::print(FILE *file, PRINT_EVENT_INFO *pinfo)
{
my_b_printf(&pinfo->head_cache, "\tQuery: `%s`\n", m_query_txt);
}
3. How Master writes Annotate_rows events to the binary log
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The event is written to the binary log just before the group of Table_map
events which precede corresponding Rows events (one query may generate
several Table map events in the binary log, but the corresponding
Annotate_rows event must be written only once before the first Table map
event; hence the boolean variable 'with_annotate' below):
int write_locked_table_maps(THD *thd)
{ ...
bool with_annotate= thd->variables.binlog_annotate_rows_events;
...
for (uint i= 0; i < ... <number of tables> ...; ++i)
{ ...
thd->binlog_write_table_map(table, ..., with_annotate);
with_annotate= 0; // write Annotate_event not more than once
...
}
...
}
int THD::binlog_write_table_map(TABLE *table, ..., bool with_annotate)
{ ...
Table_map_log_event the_event(...);
...
if (with_annotate)
{
Annotate_rows_log_event anno(this);
mysql_bin_log.write(&anno);
}
mysql_bin_log.write(&the_event);
...
}
4. How slave treats replicate-annotate-rows-events option
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The replicate-annotate-rows-events option is treated just as the session
value of the binlog_annotate_rows_events variable for the slave IO and
SQL threads. This setting is done during initialization of these threads:
pthread_handler_t handle_slave_io(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_IO);
...
}
pthread_handler_t handle_slave_sql(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_SQL);
...
}
int init_slave_thread(THD* thd, SLAVE_THD_TYPE thd_type)
{ ...
thd->variables.binlog_annotate_rows_events=
opt_replicate_annotate_rows_events;
...
}
5. How slave IO thread requests Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If the replicate-annotate-rows-events option is not set on a slave, there
is no need for master to send Annotate_rows events to this slave. The slave
(or mysqlbinlog in remote case), before requesting binlog dump via the
COM_BINLOG_DUMP command, informs the master whether it should send these
events by executing the newly added COM_BINLOG_DUMP_OPTIONS_EXT server
command:
case COM_BINLOG_DUMP_OPTIONS_EXT:
thd->binlog_dump_flags_ext= packet[0];
my_ok(thd);
break;
Note. We add this new command and don't use COM_BINLOG_DUMP to avoid possible
conflicts with MySQL/Sun.
6. How master executes the request
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
case COM_BINLOG_DUMP:
{ ...
flags= uint2korr(packet + 4);
...
mysql_binlog_send(thd, ..., flags);
...
}
void mysql_binlog_send(THD* thd, ..., ushort flags)
{ ...
Log_event::read_log_event(&log, packet, ...);
...
if ((*packet)[EVENT_TYPE_OFFSET + 1] != ANNOTATE_ROWS_EVENT ||
flags & BINLOG_SEND_ANNOTATE_ROWS_EVENT)
{
my_net_write(net, packet->ptr(), packet->length());
}
...
}
7. How slave SQL thread processes Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The slave processes each recieved event by "applying" it, i.e. by
calling the Log_event::apply_event() function which in turn calls
the virtual do_apply_event() member specific for each type of the
event.
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev = next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
int apply_event_and_update_pos(Log_event *ev, ...)
{ ...
ev->apply_event(...);
...
}
int Log_event::apply_event(...)
{
return do_apply_event(...);
}
What does it mean to "apply" an Annotate_rows event? It means to set current
thd query to that of the described by the event, i.e. to the query which
caused the subsequent Rows events (see "How Master writes Annotate_rows
events to the binary log" to follow what happens further when the subsequent
Rows events are applied):
int Annotate_rows_log_event::do_apply_event(...)
{
thd->set_query(m_query_txt, m_query_len);
}
NOTE. I am not sure, but possibly current values of thd->query and
thd->query_length should be saved before calling set_query() and to be
restored on the Annotate_rows_log_event object deletion.
Is it really needed ?
After calling this do_apply_event() function we may not delete the
Annotate_rows_log_event object immediatedly (see exec_relay_log_event()
above) because thd->query now points to the string inside this object.
We may keep the pointer to this object in the Relay_log_info:
class Relay_log_info
{
public:
...
void set_annotate_event(Annotate_rows_log_event*);
Annotate_rows_log_event* get_annotate_event();
void free_annotate_event();
...
private:
Annotate_rows_log_event* m_annotate_event;
};
The saved Annotate_rows object should be deleted when all corresponding
Rows events will be processed:
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev= next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (rli->get_annotate_event() && is_last_rows_event(ev))
rli->free_annotate_event();
else if (ev->get_type_code() == ANNOTATE_ROWS_EVENT)
rli->set_annotate_event((Annotate_rows_log_event*) ev);
else if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
where
bool is_last_rows_event(Log_event* ev)
{
Log_event_type type= ev->get_type_code();
if (IS_ROWS_EVENT_TYPE(type))
{
Rows_log_event* rows= (Rows_log_event*)ev;
return rows->get_flags(Rows_log_event::STMT_END_F);
}
return 0;
}
#define IS_ROWS_EVENT_TYPE(type) ((type) == WRITE_ROWS_EVENT || \
(type) == UPDATE_ROWS_EVENT || \
(type) == DELETE_ROWS_EVENT)
8. General remarks
~~~~~~~~~~~~~~~~~~
Kristian noticed that introducing new log event type should be coordinated
somehow with MySQL/Sun:
Kristian: The numeric code for this event must be assigned carefully.
It should be coordinated with MySQL/Sun, otherwise we can get into a
situation where MySQL uses the same numeric code for one event that
MariaDB uses for ANNOTATE_ROWS_EVENT, which would make merging the two
impossible.
Alex: I reserved about 20 numbers not to have possible conflicts
with MySQL.
Kristian: Still, I think it would be appropriate to send a polite email
to internals(a)lists.mysql.com about this and suggesting to reserve the
event number.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Knielsen): Replication API for stacked event generators (120)
by worklog-noreply@askmonty.org 28 Jun '10
by worklog-noreply@askmonty.org 28 Jun '10
28 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Replication API for stacked event generators
CREATION DATE..: Mon, 07 Jun 2010, 13:13
SUPERVISOR.....: Knielsen
IMPLEMENTOR....: Knielsen
COPIES TO......:
CATEGORY.......: Server-Sprint
TASK ID........: 120 (http://askmonty.org/worklog/?tid=120)
VERSION........: Server-9.x
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 2
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Knielsen - Mon, 28 Jun 2010, 07:11)=-=-
High-Level Specification modified.
--- /tmp/wklog.120.old.18416 2010-06-28 07:11:09.000000000 +0000
+++ /tmp/wklog.120.new.18416 2010-06-28 07:11:09.000000000 +0000
@@ -14,10 +14,14 @@
An example of an event consumer is the writing of the binlog on a master.
-Event generators are not really plugins. Rather, there are specific points in
-the server where events are generated. However, a generator can be part of a
-plugin, for example a PBXT engine-level replication event generator would be
-part of the PBXT storage engine plugin.
+Some event generators are not really plugins. Rather, there are specific
+points in the server where events are generated. However, a generator can be
+part of a plugin, for example a PBXT engine-level replication event generator
+would be part of the PBXT storage engine plugin. And for example we could
+write a filter plugin, which would be stacked on top of an existing generator
+and provide the same event types and interfaces, but filtered in some way (for
+example by removing certain events on the master side, or by re-writing events
+in certain ways).
Event consumers on the other hand could be a plugin.
@@ -85,6 +89,9 @@
some is kept as reference to context (eg. THD) only. This however looses most
of the mentioned advantages for materialisation.
+Also, the non-materialising interface should be a good interface on top of
+which to build a materialising interface.
+
The design proposed here aims for as little materialisation as possible.
-=-=(Knielsen - Thu, 24 Jun 2010, 14:29)=-=-
Category updated.
--- /tmp/wklog.120.old.7440 2010-06-24 14:29:38.000000000 +0000
+++ /tmp/wklog.120.new.7440 2010-06-24 14:29:38.000000000 +0000
@@ -1 +1 @@
-Server-RawIdeaBin
+Server-Sprint
-=-=(Knielsen - Thu, 24 Jun 2010, 14:29)=-=-
Status updated.
--- /tmp/wklog.120.old.7440 2010-06-24 14:29:38.000000000 +0000
+++ /tmp/wklog.120.new.7440 2010-06-24 14:29:38.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Knielsen - Thu, 24 Jun 2010, 14:29)=-=-
Low Level Design modified.
--- /tmp/wklog.120.old.7355 2010-06-24 14:29:11.000000000 +0000
+++ /tmp/wklog.120.new.7355 2010-06-24 14:29:11.000000000 +0000
@@ -1 +1,385 @@
+A consumer is implented as a virtual class (interface). There is one virtual
+function for every event that can be received. A consumer would derive from
+the base class and override methods for the events it wants to receive.
+
+There is one consumer interface for each generator. When a generator A is
+stacked on B, the consumer interface for A inherits from the interface for
+B. This way, when A defers an event to B, the consumer for A will receive the
+corresponding event from B.
+
+There are methods for a consumer to register itself to receive events from
+each generator. I still need to find a way for a consumer in one plugin to
+register itself with a generator implemented in another plugin (eg. PBXT
+engine-level replication). I also need to add a way for consumers to
+de-register themselves.
+
+The current design has consumer callbacks return 0 for success and error code
+otherwise. I still need to think more about whether this is useful (ie. what
+is the semantics of returning an error from a consumer callback).
+
+Each event passed to consumers is defined as a class with public accessor
+methods to a private context (which is mostly the THD).
+
+My intension is to make all events passed around const, so that the same event
+can be passed to each of multiple registered consumers (and to emphasise that
+consumers do not have the ability to modify events). It still needs to be seen
+whether that const-ness will be feasible in practise without very heavy
+modification/constification of exiting code.
+
+What follows is a partial draft of a possible definition of the API as
+concrete C++ class definitions.
+
+-----------------------------------------------------------------------
+
+/*
+ Virtual base class for generated replication events.
+
+ This is the parent of events generated from all kinds of generators. Only
+ child classes can be instantiated.
+
+ This class can be used by code that wants to treat events in a generic way,
+ without any knowledge of event details. I still need to decide whether such
+ generic code is sensible.
+*/
+class rpl_event_base
+{
+ /*
+ Maybe we will want the ability to materialise an event to a standard
+ binary format. This could be achieved with a base method like this. The
+ actual materialisation would be implemented in each deriving class. The
+ public methods would provide different interfaces for specifying the
+ buffer or for writing directly into IO_CACHE or file.
+ */
+
+ /* Return 0 on success, -1 on error, -2 on out-of-buffer. */
+ int materialise(uchar *buffer, size_t buflen) const;
+ /*
+ Returns NULL on error or else malloc()ed buffer with materialised event,
+ caller must free().
+ */
+ uchar *materialise() const;
+ /* Same but using passed in memroot. */
+ uchar *materialise(mem_root *memroot) const;
+ /*
+ Materialise to user-supplied writer function (could write directly to file
+ or the like).
+ */
+ int materialise(int (*writer)(uchar *data, size_t len, void *context)) const;
+
+ /*
+ As to for what to do with a materialised event, there are a couple of
+ possibilities.
+
+ One is to have a de_materialise() method somewhere that can construct an
+ rpl_event_base (really a derived class of course) from a buffer or writer
+ function. This would require each accessor function to conditionally read
+ its data from either THD context or buffer (GCC is able to optimise
+ several such conditionals in multiple accessor function calls into one
+ conditional), or we can make all accessors virtual if the performance hit
+ is acceptable.
+
+ Another is to have different classes for accessing events read from
+ materialised event data.
+
+ Also, I still need to think about whether it is at all useful to be able
+ to generically materialise an event at this level. It may be that any
+ binlog/transport will in any case need to undertand more of the format of
+ events, so that such materialisation/transport is better done at a
+ different layer.
+ */
+
+protected:
+ /* Implementation which is the basis for materialise(). */
+ virtual int do_materialise(int (*writer)(uchar *data, size_t len,
+ void *context)) const = 0;
+
+private:
+ /* Virtual base class, private constructor to prevent instantiation. */
+ rpl_event_base();
+};
+
+
+/*
+ These are the event types output from the transaction event generator.
+
+ This generator is not stacked on anything.
+
+ The transaction event generator marks the start and end (commit or rollback)
+ of transactions. It also gives information about whether the transaction was
+ a full transaction or autocommitted statement, whether transactional tables
+ were involved, whether non-transactional tables were involved, and XA
+ information (ToDo).
+*/
+
+/* Base class for transaction events. */
+class rpl_event_transaction_base : public rpl_event_base
+{
+public:
+ /*
+ Get the local transaction id. This idea is only unique within one server.
+ It is allocated whenever a new transaction is started.
+ Can be used to identify events belonging to the same transaction in a
+ binlog-like stream of events streamed in parallel among multiple
+ transactions.
+ */
+ uint64_t get_local_trx_id() const { return thd->local_trx_id; };
+
+ bool get_is_autocommit() const;
+
+private:
+ /* The context is the THD. */
+ THD *thd;
+
+ rpl_event_transaction_base(THD *_thd) : thd(_thd) { };
+};
+
+/* Transaction start event. */
+class rpl_event_transaction_start : public rpl_event_transaction_base
+{
+
+};
+
+/* Transaction commit. */
+class rpl_event_transaction_commit : public rpl_event_transaction_base
+{
+public:
+ /*
+ The global transaction id is unique cross-server.
+
+ It can be used to identify the position from which to start a slave
+ replicating from a master.
+
+ This global ID is only available once the transaction is decided to commit
+ by the TC manager / primary redundancy service. This TC also allocates the
+ ID and decides the exact semantics (can there be gaps, etc); however the
+ format is fixed (cluster_id, running_counter).
+ */
+ struct global_transaction_id
+ {
+ uint32_t cluster_id;
+ uint64_t counter;
+ };
+
+ const global_transaction_id *get_global_transaction_id() const;
+};
+
+/* Transaction rollback. */
+class rpl_event_transaction_rollback : public rpl_event_transaction_base
+{
+
+};
+
+
+/* Base class for statement events. */
+class rpl_event_statement_base : public rpl_event_base
+{
+public:
+ LEX_STRING get_current_db() const;
+};
+
+class rpl_event_statement_start : public rpl_event_statement_base
+{
+
+};
+
+class rpl_event_statement_end : public rpl_event_statement_base
+{
+public:
+ int get_errorcode() const;
+};
+
+class rpl_event_statement_query : public rpl_event_statement_base
+{
+public:
+ LEX_STRING get_query_string();
+ ulong get_sql_mode();
+ const CHARSET_INFO *get_character_set_client();
+ const CHARSET_INFO *get_collation_connection();
+ const CHARSET_INFO *get_collation_server();
+ const CHARSET_INFO *get_collation_default_db();
+
+ /*
+ Access to relevant flags that affect query execution.
+
+ Use as if (ev->get_flags() & (uint32)ROW_FOREIGN_KEY_CHECKS) { ... }
+ */
+ enum flag_bits
+ {
+ STMT_FOREIGN_KEY_CHECKS, // @@foreign_key_checks
+ STMT_UNIQUE_KEY_CHECKS, // @@unique_checks
+ STMT_AUTO_IS_NULL, // @@sql_auto_is_null
+ };
+ uint32_t get_flags();
+
+ ulong get_auto_increment_offset();
+ ulong get_auto_increment_increment();
+
+ // And so on for: time zone; day/month names; connection id; LAST_INSERT_ID;
+ // INSERT_ID; random seed; user variables.
+ //
+ // We probably also need get_uses_temporary_table(), get_used_user_vars(),
+ // get_uses_auto_increment() and so on, so a consumer can get more
+ // information about what kind of context information a query will need when
+ // executed on a slave.
+};
+
+class rpl_event_statement_load_query : public rpl_event_statement_query
+{
+
+};
+
+/*
+ This event is fired with blocks of data for files read (from server-local
+ file or client connection) for LOAD DATA.
+*/
+class rpl_event_statement_load_data_block : public rpl_event_statement_base
+{
+public:
+ struct block
+ {
+ const uchar *ptr;
+ size_t size;
+ };
+ block get_block() const;
+};
+
+/* Base class for row-based replication events. */
+class rpl_event_row_base : public rpl_event_base
+{
+public:
+ /*
+ Access to relevant handler extra flags and other flags that affect row
+ operations.
+
+ Use as if (ev->get_flags() & (uint32)ROW_WRITE_CAN_REPLACE) { ... }
+ */
+ enum flag_bits
+ {
+ ROW_WRITE_CAN_REPLACE, // HA_EXTRA_WRITE_CAN_REPLACE
+ ROW_IGNORE_DUP_KEY, // HA_EXTRA_IGNORE_DUP_KEY
+ ROW_IGNORE_NO_KEY, // HA_EXTRA_IGNORE_NO_KEY
+ ROW_DISABLE_FOREIGN_KEY_CHECKS, // ! @@foreign_key_checks
+ ROW_DISABLE_UNIQUE_KEY_CHECKS, // ! @@unique_checks
+ };
+ uint32_t get_flags();
+
+ /* Access to list of tables modified. */
+ class table_iterator
+ {
+ public:
+ /* Returns table, NULL after last. */
+ const TABLE *get_next();
+ private:
+ // ...
+ };
+ table_iterator get_modified_tables() const;
+
+private:
+ /* Context used to provide accessors. */
+ THD *thd;
+
+protected:
+ rpl_event_row_base(THD *_thd) : thd(_thd) { }
+};
+
+
+class rpl_event_row_write : public rpl_event_row_base
+{
+public:
+ const BITMAP *get_write_set() const;
+ const uchar *get_after_image() const;
+};
+
+class rpl_event_row_update : public rpl_event_row_base
+{
+public:
+ const BITMAP *get_read_set() const;
+ const BITMAP *get_write_set() const;
+ const uchar *get_before_image() const;
+ const uchar *get_after_image() const;
+};
+
+class rpl_event_row_delete : public rpl_event_row_base
+{
+public:
+ const BITMAP *get_read_set() const;
+ const uchar *get_before_image() const;
+};
+
+
+/*
+ Event consumer callbacks.
+
+ An event consumer registers with an event generator to receive event
+ notifications from that generator.
+
+ The consumer has callbacks (in the form of virtual functions) for the
+ individual event types the consumer is interested in. Only callbacks that
+ are non-NULL will be invoked. If an event applies to multiple callbacks in a
+ single callback struct, it will only be passed to the most specific non-NULL
+ callback (so events never fire more than once per registration).
+
+ The lifetime of the memory holding the event is only for the duration of the
+ callback invocation, unless otherwise noted.
+
+ Callbacks return 0 for success or error code (ToDo: does this make sense?).
+*/
+
+struct rpl_event_consumer_transaction
+{
+ virtual int trx_start(const rpl_event_transaction_start *) { return 0; }
+ virtual int trx_commit(const rpl_event_transaction_commit *) { return 0; }
+ virtual int trx_rollback(const rpl_event_transaction_rollback *) { return 0; }
+};
+
+/*
+ Consuming statement-based events.
+
+ The statement event generator is stacked on top of the transaction event
+ generator, so we can receive those events as well.
+*/
+struct rpl_event_consumer_statement : public rpl_event_consumer_transaction
+{
+ virtual int stmt_start(const rpl_event_statement_start *) { return 0; }
+ virtual int stmt_end(const rpl_event_statement_end *) { return 0; }
+
+ virtual int stmt_query(const rpl_event_statement_query *) { return 0; }
+
+ /* Data for a file used in LOAD DATA [LOCAL] INFILE. */
+ virtual int stmt_load_data_block(const rpl_event_statement_load_data_block *)
+ { return 0; }
+
+ /*
+ These are specific kinds of statements; if specified they override
+ consume_stmt_query() for the corresponding event.
+ */
+ virtual int stmt_load_query(const rpl_event_statement_load_query *ev)
+ { return stmt_query(ev); }
+};
+
+/*
+ Consuming row-based events.
+
+ The row event generator is stacked on top of the statement event generator.
+*/
+struct rpl_event_consumer_row : public rpl_event_consumer_statement
+{
+ virtual int row_write(const rpl_event_row_write *) { return 0; }
+ virtual int row_update(const rpl_event_row_update *) { return 0; }
+ virtual int row_delete(const rpl_event_row_delete *) { return 0; }
+};
+
+
+/*
+ Registration functions.
+
+ ToDo: Make a way to de-register.
+
+ ToDo: Find a good way for a plugin (eg. PBXT) to publish a generator
+ registration method.
+*/
+
+int rpl_event_transaction_register(const rpl_event_consumer_transaction *cbs);
+int rpl_event_statement_register(const rpl_event_consumer_statement *cbs);
+int rpl_event_row_register(const rpl_event_consumer_row *cbs);
-=-=(Knielsen - Thu, 24 Jun 2010, 14:28)=-=-
High-Level Specification modified.
--- /tmp/wklog.120.old.7341 2010-06-24 14:28:17.000000000 +0000
+++ /tmp/wklog.120.new.7341 2010-06-24 14:28:17.000000000 +0000
@@ -1 +1,159 @@
+Generators and consumbers
+-------------------------
+
+We have the two concepts:
+
+1. Event _generators_, that produce events describing all changes to data in a
+ server.
+
+2. Event consumers, that receive such events and use them in various ways.
+
+Examples of event generators is execution of SQL statements, which generates
+events like those used for statement-based replication. Another example is
+PBXT engine-level replication.
+
+An example of an event consumer is the writing of the binlog on a master.
+
+Event generators are not really plugins. Rather, there are specific points in
+the server where events are generated. However, a generator can be part of a
+plugin, for example a PBXT engine-level replication event generator would be
+part of the PBXT storage engine plugin.
+
+Event consumers on the other hand could be a plugin.
+
+One generator can be stacked on top of another. This means that a generator on
+top (for example row-based events) will handle some events itself
+(eg. non-deterministic update in mixed-mode binlogging). Other events that it
+does not want to or cannot handle (for example deterministic delete or DDL)
+will be defered to the generator below (for example statement-based events).
+
+
+Materialisation (or not)
+------------------------
+
+A central decision is how to represent events that are generated in the API at
+the point of generation.
+
+I want to avoid making the API require that events are materialised. By
+"Materialised" I mean that all (or most) of the data for the event is written
+into memory in a struct/class used inside the server or serialised in a data
+buffer (byte buffer) in a format suitable for network transport or disk
+storage.
+
+Using a non-materialised event means storing just a reference to appropriate
+context that allows to retrieve all information for the event using
+accessors. Ie. typically this would be based on getting the event information
+from the THD pointer.
+
+Some reasons to avoid using materialised events in the API:
+
+ - Replication events have a _lot_ of detailed context information that can be
+ needed in events: user-defined variables, random seed, character sets,
+ table column names and types, etc. etc. If we make the API based on
+ materialisation, then the initial decision about which context information
+ to include with which events will have to be done in the API, while ideally
+ we want this decision to be done by the individual consumer plugin. There
+ will this be a conflict between what to include (to allow consumers access)
+ and what to exclude (to avoid excessive needless work).
+
+ - Materialising means defining a very specific format, which will tend to
+ make the API less generic and flexible.
+
+ - Unless the materialised format is made _very_ specific (and thus very
+ inflexible), it is unlikely to be directly useful for transport
+ (eg. binlog), so it will need to be re-materialised into a different format
+ anyway, wasting work.
+
+ - If a generator on top handles an event, then we want to avoid wasting work
+ materialising an event in a generator below which would be completely
+ unused. Thus there would be a need for the upper generator to somehow
+ notify the lower generator ahead of event generation time to not fire an
+ event, complicating the API.
+
+Some advantages for materialisation:
+
+ - Using an API based on passing around some well-defined struct event (or
+ byte buffer) will be simpler than the complex class hierarchy proposed here
+ with no requirement for materialisation.
+
+ - Defining a materialised format would allow an easy way to use the same
+ consumer code on a generator that produces events at the source of
+ execution and on a generator that produces events from eg. reading them
+ from an event log.
+
+Note that there can be some middle way, where some data is materialised and
+some is kept as reference to context (eg. THD) only. This however looses most
+of the mentioned advantages for materialisation.
+
+The design proposed here aims for as little materialisation as possible.
+
+
+Default materialisation format
+------------------------------
+
+
+While the proposed API doesn't _require_ materialisation, we can still think
+about providing the _option_ for built-in materialisation. This could be
+useful if such materialisation is made suitable for transport to a different
+server (eg. no endian-dependance etc). If there is a facility for such
+materialisation built-in to the API, it becomes possible to write something
+like a generic binlog plugin or generic network transport plugin. This would
+be really useful for eg. PBXT engine-level replication, as it could be
+implemented without having to re-invent a binlog format.
+
+I added in the proposed API a simple facility to materialise every event as a
+string of bytes. To use this, I still need to add a suitable facility to
+de-materialise the event.
+
+However, it is still an open question whether such a facility will be at all
+useful. It still has some of the problems with materialisation mentioned
+above. And I think it is likely that a good binlog implementation will need
+to do more than just blindly copy opaque events from one endpoint to
+another. For example, it might need different event boundaries (merge and/or
+split events); it might need to augment or modify events, or inject new
+events, etc.
+
+So I think maybe it is better to add such a generic materialisation facility
+on top of the basic event generator API. Such a facility would provide
+materialisation of an replication event stream, not of individual events, so
+would be more flexible in providing a good implementation. It would be
+implemented for all generators. It would separate from both the event
+generator API (so we have flexibility to put a filter class in-between
+generator and materialisation), and could also be separate from the actual
+transport handling stuff like fsync() of binlog files and socket connections
+etc. It would be paired with a corresponding applier API which would handle
+executing events on a slave.
+
+Then we can have a default materialised event format, which is available, but
+not mandatory. So there can still be other formats alongside (like legacy
+MySQL 5.1 binlog event format and maybe Tungsten would have its own format).
+
+
+Encapsulation
+-------------
+
+Another fundamental question about the design is the level of encapsulation
+used for the API.
+
+At the implementation level, a lot of the work is basically to pull out all of
+the needed information from the THD object/context. The API I propose tries to
+_not_ expose the THD to consumers. Instead it provides accessor functions for
+all the bits and pieces relevant to each replication event, while the event
+class itself likely will be more or less just an encapsulated THD.
+
+So an alternative would be to have a generic event that was just (type, THD).
+Then consumers could just pull out whatever information they want from the
+THD. The THD implementation is already exposed to storage engines. This would
+of course greatly reduce the size of the API, eliminating lots of class
+definitions and accessor functions. Though arguably it wouldn't really
+simplify the API, as the complexity would just be in understanding the THD
+class.
+
+Note that we do not have to take any performance hit from using encapsulated
+accessors since compilers can inline them (though if inlining then we do not
+get any ABI stability with respect to THD implemetation).
+
+For now, the API is proposed without exposing the THD class. (Similar
+encapsulation could be added in actual implementation to also not expose TABLE
+and similar classes).
-=-=(Knielsen - Thu, 24 Jun 2010, 12:04)=-=-
Dependency created: 107 now depends on 120
-=-=(Knielsen - Thu, 24 Jun 2010, 11:59)=-=-
High Level Description modified.
--- /tmp/wklog.120.old.516 2010-06-24 11:59:24.000000000 +0000
+++ /tmp/wklog.120.new.516 2010-06-24 11:59:24.000000000 +0000
@@ -11,4 +11,4 @@
Event generators can be stacked, and a generator may defer event generation to
the next one down the stack. For example, the row-level replication event
-generator may defer DLL to the statement-level replication event generator.
+generator may defer DDL to the statement-level replication event generator.
-=-=(Knielsen - Mon, 21 Jun 2010, 08:35)=-=-
Research and design thoughts.
Worked 2 hours and estimate 0 hours remain (original estimate increased by 2 hours).
DESCRIPTION:
A part of the replication project, MWL#107.
Events are produced by event Generators. Examples are
- Generation of statement-based replication events
- Generation of row-based events
- Generation of PBXT engine-level replication events
and maybe reading of events from relay log on slave may also be an example of
generating events.
Event generators can be stacked, and a generator may defer event generation to
the next one down the stack. For example, the row-level replication event
generator may defer DDL to the statement-level replication event generator.
HIGH-LEVEL SPECIFICATION:
Generators and consumbers
-------------------------
We have the two concepts:
1. Event _generators_, that produce events describing all changes to data in a
server.
2. Event consumers, that receive such events and use them in various ways.
Examples of event generators is execution of SQL statements, which generates
events like those used for statement-based replication. Another example is
PBXT engine-level replication.
An example of an event consumer is the writing of the binlog on a master.
Some event generators are not really plugins. Rather, there are specific
points in the server where events are generated. However, a generator can be
part of a plugin, for example a PBXT engine-level replication event generator
would be part of the PBXT storage engine plugin. And for example we could
write a filter plugin, which would be stacked on top of an existing generator
and provide the same event types and interfaces, but filtered in some way (for
example by removing certain events on the master side, or by re-writing events
in certain ways).
Event consumers on the other hand could be a plugin.
One generator can be stacked on top of another. This means that a generator on
top (for example row-based events) will handle some events itself
(eg. non-deterministic update in mixed-mode binlogging). Other events that it
does not want to or cannot handle (for example deterministic delete or DDL)
will be defered to the generator below (for example statement-based events).
Materialisation (or not)
------------------------
A central decision is how to represent events that are generated in the API at
the point of generation.
I want to avoid making the API require that events are materialised. By
"Materialised" I mean that all (or most) of the data for the event is written
into memory in a struct/class used inside the server or serialised in a data
buffer (byte buffer) in a format suitable for network transport or disk
storage.
Using a non-materialised event means storing just a reference to appropriate
context that allows to retrieve all information for the event using
accessors. Ie. typically this would be based on getting the event information
from the THD pointer.
Some reasons to avoid using materialised events in the API:
- Replication events have a _lot_ of detailed context information that can be
needed in events: user-defined variables, random seed, character sets,
table column names and types, etc. etc. If we make the API based on
materialisation, then the initial decision about which context information
to include with which events will have to be done in the API, while ideally
we want this decision to be done by the individual consumer plugin. There
will this be a conflict between what to include (to allow consumers access)
and what to exclude (to avoid excessive needless work).
- Materialising means defining a very specific format, which will tend to
make the API less generic and flexible.
- Unless the materialised format is made _very_ specific (and thus very
inflexible), it is unlikely to be directly useful for transport
(eg. binlog), so it will need to be re-materialised into a different format
anyway, wasting work.
- If a generator on top handles an event, then we want to avoid wasting work
materialising an event in a generator below which would be completely
unused. Thus there would be a need for the upper generator to somehow
notify the lower generator ahead of event generation time to not fire an
event, complicating the API.
Some advantages for materialisation:
- Using an API based on passing around some well-defined struct event (or
byte buffer) will be simpler than the complex class hierarchy proposed here
with no requirement for materialisation.
- Defining a materialised format would allow an easy way to use the same
consumer code on a generator that produces events at the source of
execution and on a generator that produces events from eg. reading them
from an event log.
Note that there can be some middle way, where some data is materialised and
some is kept as reference to context (eg. THD) only. This however looses most
of the mentioned advantages for materialisation.
Also, the non-materialising interface should be a good interface on top of
which to build a materialising interface.
The design proposed here aims for as little materialisation as possible.
Default materialisation format
------------------------------
While the proposed API doesn't _require_ materialisation, we can still think
about providing the _option_ for built-in materialisation. This could be
useful if such materialisation is made suitable for transport to a different
server (eg. no endian-dependance etc). If there is a facility for such
materialisation built-in to the API, it becomes possible to write something
like a generic binlog plugin or generic network transport plugin. This would
be really useful for eg. PBXT engine-level replication, as it could be
implemented without having to re-invent a binlog format.
I added in the proposed API a simple facility to materialise every event as a
string of bytes. To use this, I still need to add a suitable facility to
de-materialise the event.
However, it is still an open question whether such a facility will be at all
useful. It still has some of the problems with materialisation mentioned
above. And I think it is likely that a good binlog implementation will need
to do more than just blindly copy opaque events from one endpoint to
another. For example, it might need different event boundaries (merge and/or
split events); it might need to augment or modify events, or inject new
events, etc.
So I think maybe it is better to add such a generic materialisation facility
on top of the basic event generator API. Such a facility would provide
materialisation of an replication event stream, not of individual events, so
would be more flexible in providing a good implementation. It would be
implemented for all generators. It would separate from both the event
generator API (so we have flexibility to put a filter class in-between
generator and materialisation), and could also be separate from the actual
transport handling stuff like fsync() of binlog files and socket connections
etc. It would be paired with a corresponding applier API which would handle
executing events on a slave.
Then we can have a default materialised event format, which is available, but
not mandatory. So there can still be other formats alongside (like legacy
MySQL 5.1 binlog event format and maybe Tungsten would have its own format).
Encapsulation
-------------
Another fundamental question about the design is the level of encapsulation
used for the API.
At the implementation level, a lot of the work is basically to pull out all of
the needed information from the THD object/context. The API I propose tries to
_not_ expose the THD to consumers. Instead it provides accessor functions for
all the bits and pieces relevant to each replication event, while the event
class itself likely will be more or less just an encapsulated THD.
So an alternative would be to have a generic event that was just (type, THD).
Then consumers could just pull out whatever information they want from the
THD. The THD implementation is already exposed to storage engines. This would
of course greatly reduce the size of the API, eliminating lots of class
definitions and accessor functions. Though arguably it wouldn't really
simplify the API, as the complexity would just be in understanding the THD
class.
Note that we do not have to take any performance hit from using encapsulated
accessors since compilers can inline them (though if inlining then we do not
get any ABI stability with respect to THD implemetation).
For now, the API is proposed without exposing the THD class. (Similar
encapsulation could be added in actual implementation to also not expose TABLE
and similar classes).
LOW-LEVEL DESIGN:
A consumer is implented as a virtual class (interface). There is one virtual
function for every event that can be received. A consumer would derive from
the base class and override methods for the events it wants to receive.
There is one consumer interface for each generator. When a generator A is
stacked on B, the consumer interface for A inherits from the interface for
B. This way, when A defers an event to B, the consumer for A will receive the
corresponding event from B.
There are methods for a consumer to register itself to receive events from
each generator. I still need to find a way for a consumer in one plugin to
register itself with a generator implemented in another plugin (eg. PBXT
engine-level replication). I also need to add a way for consumers to
de-register themselves.
The current design has consumer callbacks return 0 for success and error code
otherwise. I still need to think more about whether this is useful (ie. what
is the semantics of returning an error from a consumer callback).
Each event passed to consumers is defined as a class with public accessor
methods to a private context (which is mostly the THD).
My intension is to make all events passed around const, so that the same event
can be passed to each of multiple registered consumers (and to emphasise that
consumers do not have the ability to modify events). It still needs to be seen
whether that const-ness will be feasible in practise without very heavy
modification/constification of exiting code.
What follows is a partial draft of a possible definition of the API as
concrete C++ class definitions.
-----------------------------------------------------------------------
/*
Virtual base class for generated replication events.
This is the parent of events generated from all kinds of generators. Only
child classes can be instantiated.
This class can be used by code that wants to treat events in a generic way,
without any knowledge of event details. I still need to decide whether such
generic code is sensible.
*/
class rpl_event_base
{
/*
Maybe we will want the ability to materialise an event to a standard
binary format. This could be achieved with a base method like this. The
actual materialisation would be implemented in each deriving class. The
public methods would provide different interfaces for specifying the
buffer or for writing directly into IO_CACHE or file.
*/
/* Return 0 on success, -1 on error, -2 on out-of-buffer. */
int materialise(uchar *buffer, size_t buflen) const;
/*
Returns NULL on error or else malloc()ed buffer with materialised event,
caller must free().
*/
uchar *materialise() const;
/* Same but using passed in memroot. */
uchar *materialise(mem_root *memroot) const;
/*
Materialise to user-supplied writer function (could write directly to file
or the like).
*/
int materialise(int (*writer)(uchar *data, size_t len, void *context)) const;
/*
As to for what to do with a materialised event, there are a couple of
possibilities.
One is to have a de_materialise() method somewhere that can construct an
rpl_event_base (really a derived class of course) from a buffer or writer
function. This would require each accessor function to conditionally read
its data from either THD context or buffer (GCC is able to optimise
several such conditionals in multiple accessor function calls into one
conditional), or we can make all accessors virtual if the performance hit
is acceptable.
Another is to have different classes for accessing events read from
materialised event data.
Also, I still need to think about whether it is at all useful to be able
to generically materialise an event at this level. It may be that any
binlog/transport will in any case need to undertand more of the format of
events, so that such materialisation/transport is better done at a
different layer.
*/
protected:
/* Implementation which is the basis for materialise(). */
virtual int do_materialise(int (*writer)(uchar *data, size_t len,
void *context)) const = 0;
private:
/* Virtual base class, private constructor to prevent instantiation. */
rpl_event_base();
};
/*
These are the event types output from the transaction event generator.
This generator is not stacked on anything.
The transaction event generator marks the start and end (commit or rollback)
of transactions. It also gives information about whether the transaction was
a full transaction or autocommitted statement, whether transactional tables
were involved, whether non-transactional tables were involved, and XA
information (ToDo).
*/
/* Base class for transaction events. */
class rpl_event_transaction_base : public rpl_event_base
{
public:
/*
Get the local transaction id. This idea is only unique within one server.
It is allocated whenever a new transaction is started.
Can be used to identify events belonging to the same transaction in a
binlog-like stream of events streamed in parallel among multiple
transactions.
*/
uint64_t get_local_trx_id() const { return thd->local_trx_id; };
bool get_is_autocommit() const;
private:
/* The context is the THD. */
THD *thd;
rpl_event_transaction_base(THD *_thd) : thd(_thd) { };
};
/* Transaction start event. */
class rpl_event_transaction_start : public rpl_event_transaction_base
{
};
/* Transaction commit. */
class rpl_event_transaction_commit : public rpl_event_transaction_base
{
public:
/*
The global transaction id is unique cross-server.
It can be used to identify the position from which to start a slave
replicating from a master.
This global ID is only available once the transaction is decided to commit
by the TC manager / primary redundancy service. This TC also allocates the
ID and decides the exact semantics (can there be gaps, etc); however the
format is fixed (cluster_id, running_counter).
*/
struct global_transaction_id
{
uint32_t cluster_id;
uint64_t counter;
};
const global_transaction_id *get_global_transaction_id() const;
};
/* Transaction rollback. */
class rpl_event_transaction_rollback : public rpl_event_transaction_base
{
};
/* Base class for statement events. */
class rpl_event_statement_base : public rpl_event_base
{
public:
LEX_STRING get_current_db() const;
};
class rpl_event_statement_start : public rpl_event_statement_base
{
};
class rpl_event_statement_end : public rpl_event_statement_base
{
public:
int get_errorcode() const;
};
class rpl_event_statement_query : public rpl_event_statement_base
{
public:
LEX_STRING get_query_string();
ulong get_sql_mode();
const CHARSET_INFO *get_character_set_client();
const CHARSET_INFO *get_collation_connection();
const CHARSET_INFO *get_collation_server();
const CHARSET_INFO *get_collation_default_db();
/*
Access to relevant flags that affect query execution.
Use as if (ev->get_flags() & (uint32)ROW_FOREIGN_KEY_CHECKS) { ... }
*/
enum flag_bits
{
STMT_FOREIGN_KEY_CHECKS, // @@foreign_key_checks
STMT_UNIQUE_KEY_CHECKS, // @@unique_checks
STMT_AUTO_IS_NULL, // @@sql_auto_is_null
};
uint32_t get_flags();
ulong get_auto_increment_offset();
ulong get_auto_increment_increment();
// And so on for: time zone; day/month names; connection id; LAST_INSERT_ID;
// INSERT_ID; random seed; user variables.
//
// We probably also need get_uses_temporary_table(), get_used_user_vars(),
// get_uses_auto_increment() and so on, so a consumer can get more
// information about what kind of context information a query will need when
// executed on a slave.
};
class rpl_event_statement_load_query : public rpl_event_statement_query
{
};
/*
This event is fired with blocks of data for files read (from server-local
file or client connection) for LOAD DATA.
*/
class rpl_event_statement_load_data_block : public rpl_event_statement_base
{
public:
struct block
{
const uchar *ptr;
size_t size;
};
block get_block() const;
};
/* Base class for row-based replication events. */
class rpl_event_row_base : public rpl_event_base
{
public:
/*
Access to relevant handler extra flags and other flags that affect row
operations.
Use as if (ev->get_flags() & (uint32)ROW_WRITE_CAN_REPLACE) { ... }
*/
enum flag_bits
{
ROW_WRITE_CAN_REPLACE, // HA_EXTRA_WRITE_CAN_REPLACE
ROW_IGNORE_DUP_KEY, // HA_EXTRA_IGNORE_DUP_KEY
ROW_IGNORE_NO_KEY, // HA_EXTRA_IGNORE_NO_KEY
ROW_DISABLE_FOREIGN_KEY_CHECKS, // ! @@foreign_key_checks
ROW_DISABLE_UNIQUE_KEY_CHECKS, // ! @@unique_checks
};
uint32_t get_flags();
/* Access to list of tables modified. */
class table_iterator
{
public:
/* Returns table, NULL after last. */
const TABLE *get_next();
private:
// ...
};
table_iterator get_modified_tables() const;
private:
/* Context used to provide accessors. */
THD *thd;
protected:
rpl_event_row_base(THD *_thd) : thd(_thd) { }
};
class rpl_event_row_write : public rpl_event_row_base
{
public:
const BITMAP *get_write_set() const;
const uchar *get_after_image() const;
};
class rpl_event_row_update : public rpl_event_row_base
{
public:
const BITMAP *get_read_set() const;
const BITMAP *get_write_set() const;
const uchar *get_before_image() const;
const uchar *get_after_image() const;
};
class rpl_event_row_delete : public rpl_event_row_base
{
public:
const BITMAP *get_read_set() const;
const uchar *get_before_image() const;
};
/*
Event consumer callbacks.
An event consumer registers with an event generator to receive event
notifications from that generator.
The consumer has callbacks (in the form of virtual functions) for the
individual event types the consumer is interested in. Only callbacks that
are non-NULL will be invoked. If an event applies to multiple callbacks in a
single callback struct, it will only be passed to the most specific non-NULL
callback (so events never fire more than once per registration).
The lifetime of the memory holding the event is only for the duration of the
callback invocation, unless otherwise noted.
Callbacks return 0 for success or error code (ToDo: does this make sense?).
*/
struct rpl_event_consumer_transaction
{
virtual int trx_start(const rpl_event_transaction_start *) { return 0; }
virtual int trx_commit(const rpl_event_transaction_commit *) { return 0; }
virtual int trx_rollback(const rpl_event_transaction_rollback *) { return 0; }
};
/*
Consuming statement-based events.
The statement event generator is stacked on top of the transaction event
generator, so we can receive those events as well.
*/
struct rpl_event_consumer_statement : public rpl_event_consumer_transaction
{
virtual int stmt_start(const rpl_event_statement_start *) { return 0; }
virtual int stmt_end(const rpl_event_statement_end *) { return 0; }
virtual int stmt_query(const rpl_event_statement_query *) { return 0; }
/* Data for a file used in LOAD DATA [LOCAL] INFILE. */
virtual int stmt_load_data_block(const rpl_event_statement_load_data_block *)
{ return 0; }
/*
These are specific kinds of statements; if specified they override
consume_stmt_query() for the corresponding event.
*/
virtual int stmt_load_query(const rpl_event_statement_load_query *ev)
{ return stmt_query(ev); }
};
/*
Consuming row-based events.
The row event generator is stacked on top of the statement event generator.
*/
struct rpl_event_consumer_row : public rpl_event_consumer_statement
{
virtual int row_write(const rpl_event_row_write *) { return 0; }
virtual int row_update(const rpl_event_row_update *) { return 0; }
virtual int row_delete(const rpl_event_row_delete *) { return 0; }
};
/*
Registration functions.
ToDo: Make a way to de-register.
ToDo: Find a good way for a plugin (eg. PBXT) to publish a generator
registration method.
*/
int rpl_event_transaction_register(const rpl_event_consumer_transaction *cbs);
int rpl_event_statement_register(const rpl_event_consumer_statement *cbs);
int rpl_event_row_register(const rpl_event_consumer_row *cbs);
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Knielsen): Replication API for stacked event generators (120)
by worklog-noreply@askmonty.org 28 Jun '10
by worklog-noreply@askmonty.org 28 Jun '10
28 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Replication API for stacked event generators
CREATION DATE..: Mon, 07 Jun 2010, 13:13
SUPERVISOR.....: Knielsen
IMPLEMENTOR....: Knielsen
COPIES TO......:
CATEGORY.......: Server-Sprint
TASK ID........: 120 (http://askmonty.org/worklog/?tid=120)
VERSION........: Server-9.x
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 2
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Knielsen - Mon, 28 Jun 2010, 07:11)=-=-
High-Level Specification modified.
--- /tmp/wklog.120.old.18416 2010-06-28 07:11:09.000000000 +0000
+++ /tmp/wklog.120.new.18416 2010-06-28 07:11:09.000000000 +0000
@@ -14,10 +14,14 @@
An example of an event consumer is the writing of the binlog on a master.
-Event generators are not really plugins. Rather, there are specific points in
-the server where events are generated. However, a generator can be part of a
-plugin, for example a PBXT engine-level replication event generator would be
-part of the PBXT storage engine plugin.
+Some event generators are not really plugins. Rather, there are specific
+points in the server where events are generated. However, a generator can be
+part of a plugin, for example a PBXT engine-level replication event generator
+would be part of the PBXT storage engine plugin. And for example we could
+write a filter plugin, which would be stacked on top of an existing generator
+and provide the same event types and interfaces, but filtered in some way (for
+example by removing certain events on the master side, or by re-writing events
+in certain ways).
Event consumers on the other hand could be a plugin.
@@ -85,6 +89,9 @@
some is kept as reference to context (eg. THD) only. This however looses most
of the mentioned advantages for materialisation.
+Also, the non-materialising interface should be a good interface on top of
+which to build a materialising interface.
+
The design proposed here aims for as little materialisation as possible.
-=-=(Knielsen - Thu, 24 Jun 2010, 14:29)=-=-
Category updated.
--- /tmp/wklog.120.old.7440 2010-06-24 14:29:38.000000000 +0000
+++ /tmp/wklog.120.new.7440 2010-06-24 14:29:38.000000000 +0000
@@ -1 +1 @@
-Server-RawIdeaBin
+Server-Sprint
-=-=(Knielsen - Thu, 24 Jun 2010, 14:29)=-=-
Status updated.
--- /tmp/wklog.120.old.7440 2010-06-24 14:29:38.000000000 +0000
+++ /tmp/wklog.120.new.7440 2010-06-24 14:29:38.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Knielsen - Thu, 24 Jun 2010, 14:29)=-=-
Low Level Design modified.
--- /tmp/wklog.120.old.7355 2010-06-24 14:29:11.000000000 +0000
+++ /tmp/wklog.120.new.7355 2010-06-24 14:29:11.000000000 +0000
@@ -1 +1,385 @@
+A consumer is implented as a virtual class (interface). There is one virtual
+function for every event that can be received. A consumer would derive from
+the base class and override methods for the events it wants to receive.
+
+There is one consumer interface for each generator. When a generator A is
+stacked on B, the consumer interface for A inherits from the interface for
+B. This way, when A defers an event to B, the consumer for A will receive the
+corresponding event from B.
+
+There are methods for a consumer to register itself to receive events from
+each generator. I still need to find a way for a consumer in one plugin to
+register itself with a generator implemented in another plugin (eg. PBXT
+engine-level replication). I also need to add a way for consumers to
+de-register themselves.
+
+The current design has consumer callbacks return 0 for success and error code
+otherwise. I still need to think more about whether this is useful (ie. what
+is the semantics of returning an error from a consumer callback).
+
+Each event passed to consumers is defined as a class with public accessor
+methods to a private context (which is mostly the THD).
+
+My intension is to make all events passed around const, so that the same event
+can be passed to each of multiple registered consumers (and to emphasise that
+consumers do not have the ability to modify events). It still needs to be seen
+whether that const-ness will be feasible in practise without very heavy
+modification/constification of exiting code.
+
+What follows is a partial draft of a possible definition of the API as
+concrete C++ class definitions.
+
+-----------------------------------------------------------------------
+
+/*
+ Virtual base class for generated replication events.
+
+ This is the parent of events generated from all kinds of generators. Only
+ child classes can be instantiated.
+
+ This class can be used by code that wants to treat events in a generic way,
+ without any knowledge of event details. I still need to decide whether such
+ generic code is sensible.
+*/
+class rpl_event_base
+{
+ /*
+ Maybe we will want the ability to materialise an event to a standard
+ binary format. This could be achieved with a base method like this. The
+ actual materialisation would be implemented in each deriving class. The
+ public methods would provide different interfaces for specifying the
+ buffer or for writing directly into IO_CACHE or file.
+ */
+
+ /* Return 0 on success, -1 on error, -2 on out-of-buffer. */
+ int materialise(uchar *buffer, size_t buflen) const;
+ /*
+ Returns NULL on error or else malloc()ed buffer with materialised event,
+ caller must free().
+ */
+ uchar *materialise() const;
+ /* Same but using passed in memroot. */
+ uchar *materialise(mem_root *memroot) const;
+ /*
+ Materialise to user-supplied writer function (could write directly to file
+ or the like).
+ */
+ int materialise(int (*writer)(uchar *data, size_t len, void *context)) const;
+
+ /*
+ As to for what to do with a materialised event, there are a couple of
+ possibilities.
+
+ One is to have a de_materialise() method somewhere that can construct an
+ rpl_event_base (really a derived class of course) from a buffer or writer
+ function. This would require each accessor function to conditionally read
+ its data from either THD context or buffer (GCC is able to optimise
+ several such conditionals in multiple accessor function calls into one
+ conditional), or we can make all accessors virtual if the performance hit
+ is acceptable.
+
+ Another is to have different classes for accessing events read from
+ materialised event data.
+
+ Also, I still need to think about whether it is at all useful to be able
+ to generically materialise an event at this level. It may be that any
+ binlog/transport will in any case need to undertand more of the format of
+ events, so that such materialisation/transport is better done at a
+ different layer.
+ */
+
+protected:
+ /* Implementation which is the basis for materialise(). */
+ virtual int do_materialise(int (*writer)(uchar *data, size_t len,
+ void *context)) const = 0;
+
+private:
+ /* Virtual base class, private constructor to prevent instantiation. */
+ rpl_event_base();
+};
+
+
+/*
+ These are the event types output from the transaction event generator.
+
+ This generator is not stacked on anything.
+
+ The transaction event generator marks the start and end (commit or rollback)
+ of transactions. It also gives information about whether the transaction was
+ a full transaction or autocommitted statement, whether transactional tables
+ were involved, whether non-transactional tables were involved, and XA
+ information (ToDo).
+*/
+
+/* Base class for transaction events. */
+class rpl_event_transaction_base : public rpl_event_base
+{
+public:
+ /*
+ Get the local transaction id. This idea is only unique within one server.
+ It is allocated whenever a new transaction is started.
+ Can be used to identify events belonging to the same transaction in a
+ binlog-like stream of events streamed in parallel among multiple
+ transactions.
+ */
+ uint64_t get_local_trx_id() const { return thd->local_trx_id; };
+
+ bool get_is_autocommit() const;
+
+private:
+ /* The context is the THD. */
+ THD *thd;
+
+ rpl_event_transaction_base(THD *_thd) : thd(_thd) { };
+};
+
+/* Transaction start event. */
+class rpl_event_transaction_start : public rpl_event_transaction_base
+{
+
+};
+
+/* Transaction commit. */
+class rpl_event_transaction_commit : public rpl_event_transaction_base
+{
+public:
+ /*
+ The global transaction id is unique cross-server.
+
+ It can be used to identify the position from which to start a slave
+ replicating from a master.
+
+ This global ID is only available once the transaction is decided to commit
+ by the TC manager / primary redundancy service. This TC also allocates the
+ ID and decides the exact semantics (can there be gaps, etc); however the
+ format is fixed (cluster_id, running_counter).
+ */
+ struct global_transaction_id
+ {
+ uint32_t cluster_id;
+ uint64_t counter;
+ };
+
+ const global_transaction_id *get_global_transaction_id() const;
+};
+
+/* Transaction rollback. */
+class rpl_event_transaction_rollback : public rpl_event_transaction_base
+{
+
+};
+
+
+/* Base class for statement events. */
+class rpl_event_statement_base : public rpl_event_base
+{
+public:
+ LEX_STRING get_current_db() const;
+};
+
+class rpl_event_statement_start : public rpl_event_statement_base
+{
+
+};
+
+class rpl_event_statement_end : public rpl_event_statement_base
+{
+public:
+ int get_errorcode() const;
+};
+
+class rpl_event_statement_query : public rpl_event_statement_base
+{
+public:
+ LEX_STRING get_query_string();
+ ulong get_sql_mode();
+ const CHARSET_INFO *get_character_set_client();
+ const CHARSET_INFO *get_collation_connection();
+ const CHARSET_INFO *get_collation_server();
+ const CHARSET_INFO *get_collation_default_db();
+
+ /*
+ Access to relevant flags that affect query execution.
+
+ Use as if (ev->get_flags() & (uint32)ROW_FOREIGN_KEY_CHECKS) { ... }
+ */
+ enum flag_bits
+ {
+ STMT_FOREIGN_KEY_CHECKS, // @@foreign_key_checks
+ STMT_UNIQUE_KEY_CHECKS, // @@unique_checks
+ STMT_AUTO_IS_NULL, // @@sql_auto_is_null
+ };
+ uint32_t get_flags();
+
+ ulong get_auto_increment_offset();
+ ulong get_auto_increment_increment();
+
+ // And so on for: time zone; day/month names; connection id; LAST_INSERT_ID;
+ // INSERT_ID; random seed; user variables.
+ //
+ // We probably also need get_uses_temporary_table(), get_used_user_vars(),
+ // get_uses_auto_increment() and so on, so a consumer can get more
+ // information about what kind of context information a query will need when
+ // executed on a slave.
+};
+
+class rpl_event_statement_load_query : public rpl_event_statement_query
+{
+
+};
+
+/*
+ This event is fired with blocks of data for files read (from server-local
+ file or client connection) for LOAD DATA.
+*/
+class rpl_event_statement_load_data_block : public rpl_event_statement_base
+{
+public:
+ struct block
+ {
+ const uchar *ptr;
+ size_t size;
+ };
+ block get_block() const;
+};
+
+/* Base class for row-based replication events. */
+class rpl_event_row_base : public rpl_event_base
+{
+public:
+ /*
+ Access to relevant handler extra flags and other flags that affect row
+ operations.
+
+ Use as if (ev->get_flags() & (uint32)ROW_WRITE_CAN_REPLACE) { ... }
+ */
+ enum flag_bits
+ {
+ ROW_WRITE_CAN_REPLACE, // HA_EXTRA_WRITE_CAN_REPLACE
+ ROW_IGNORE_DUP_KEY, // HA_EXTRA_IGNORE_DUP_KEY
+ ROW_IGNORE_NO_KEY, // HA_EXTRA_IGNORE_NO_KEY
+ ROW_DISABLE_FOREIGN_KEY_CHECKS, // ! @@foreign_key_checks
+ ROW_DISABLE_UNIQUE_KEY_CHECKS, // ! @@unique_checks
+ };
+ uint32_t get_flags();
+
+ /* Access to list of tables modified. */
+ class table_iterator
+ {
+ public:
+ /* Returns table, NULL after last. */
+ const TABLE *get_next();
+ private:
+ // ...
+ };
+ table_iterator get_modified_tables() const;
+
+private:
+ /* Context used to provide accessors. */
+ THD *thd;
+
+protected:
+ rpl_event_row_base(THD *_thd) : thd(_thd) { }
+};
+
+
+class rpl_event_row_write : public rpl_event_row_base
+{
+public:
+ const BITMAP *get_write_set() const;
+ const uchar *get_after_image() const;
+};
+
+class rpl_event_row_update : public rpl_event_row_base
+{
+public:
+ const BITMAP *get_read_set() const;
+ const BITMAP *get_write_set() const;
+ const uchar *get_before_image() const;
+ const uchar *get_after_image() const;
+};
+
+class rpl_event_row_delete : public rpl_event_row_base
+{
+public:
+ const BITMAP *get_read_set() const;
+ const uchar *get_before_image() const;
+};
+
+
+/*
+ Event consumer callbacks.
+
+ An event consumer registers with an event generator to receive event
+ notifications from that generator.
+
+ The consumer has callbacks (in the form of virtual functions) for the
+ individual event types the consumer is interested in. Only callbacks that
+ are non-NULL will be invoked. If an event applies to multiple callbacks in a
+ single callback struct, it will only be passed to the most specific non-NULL
+ callback (so events never fire more than once per registration).
+
+ The lifetime of the memory holding the event is only for the duration of the
+ callback invocation, unless otherwise noted.
+
+ Callbacks return 0 for success or error code (ToDo: does this make sense?).
+*/
+
+struct rpl_event_consumer_transaction
+{
+ virtual int trx_start(const rpl_event_transaction_start *) { return 0; }
+ virtual int trx_commit(const rpl_event_transaction_commit *) { return 0; }
+ virtual int trx_rollback(const rpl_event_transaction_rollback *) { return 0; }
+};
+
+/*
+ Consuming statement-based events.
+
+ The statement event generator is stacked on top of the transaction event
+ generator, so we can receive those events as well.
+*/
+struct rpl_event_consumer_statement : public rpl_event_consumer_transaction
+{
+ virtual int stmt_start(const rpl_event_statement_start *) { return 0; }
+ virtual int stmt_end(const rpl_event_statement_end *) { return 0; }
+
+ virtual int stmt_query(const rpl_event_statement_query *) { return 0; }
+
+ /* Data for a file used in LOAD DATA [LOCAL] INFILE. */
+ virtual int stmt_load_data_block(const rpl_event_statement_load_data_block *)
+ { return 0; }
+
+ /*
+ These are specific kinds of statements; if specified they override
+ consume_stmt_query() for the corresponding event.
+ */
+ virtual int stmt_load_query(const rpl_event_statement_load_query *ev)
+ { return stmt_query(ev); }
+};
+
+/*
+ Consuming row-based events.
+
+ The row event generator is stacked on top of the statement event generator.
+*/
+struct rpl_event_consumer_row : public rpl_event_consumer_statement
+{
+ virtual int row_write(const rpl_event_row_write *) { return 0; }
+ virtual int row_update(const rpl_event_row_update *) { return 0; }
+ virtual int row_delete(const rpl_event_row_delete *) { return 0; }
+};
+
+
+/*
+ Registration functions.
+
+ ToDo: Make a way to de-register.
+
+ ToDo: Find a good way for a plugin (eg. PBXT) to publish a generator
+ registration method.
+*/
+
+int rpl_event_transaction_register(const rpl_event_consumer_transaction *cbs);
+int rpl_event_statement_register(const rpl_event_consumer_statement *cbs);
+int rpl_event_row_register(const rpl_event_consumer_row *cbs);
-=-=(Knielsen - Thu, 24 Jun 2010, 14:28)=-=-
High-Level Specification modified.
--- /tmp/wklog.120.old.7341 2010-06-24 14:28:17.000000000 +0000
+++ /tmp/wklog.120.new.7341 2010-06-24 14:28:17.000000000 +0000
@@ -1 +1,159 @@
+Generators and consumbers
+-------------------------
+
+We have the two concepts:
+
+1. Event _generators_, that produce events describing all changes to data in a
+ server.
+
+2. Event consumers, that receive such events and use them in various ways.
+
+Examples of event generators is execution of SQL statements, which generates
+events like those used for statement-based replication. Another example is
+PBXT engine-level replication.
+
+An example of an event consumer is the writing of the binlog on a master.
+
+Event generators are not really plugins. Rather, there are specific points in
+the server where events are generated. However, a generator can be part of a
+plugin, for example a PBXT engine-level replication event generator would be
+part of the PBXT storage engine plugin.
+
+Event consumers on the other hand could be a plugin.
+
+One generator can be stacked on top of another. This means that a generator on
+top (for example row-based events) will handle some events itself
+(eg. non-deterministic update in mixed-mode binlogging). Other events that it
+does not want to or cannot handle (for example deterministic delete or DDL)
+will be defered to the generator below (for example statement-based events).
+
+
+Materialisation (or not)
+------------------------
+
+A central decision is how to represent events that are generated in the API at
+the point of generation.
+
+I want to avoid making the API require that events are materialised. By
+"Materialised" I mean that all (or most) of the data for the event is written
+into memory in a struct/class used inside the server or serialised in a data
+buffer (byte buffer) in a format suitable for network transport or disk
+storage.
+
+Using a non-materialised event means storing just a reference to appropriate
+context that allows to retrieve all information for the event using
+accessors. Ie. typically this would be based on getting the event information
+from the THD pointer.
+
+Some reasons to avoid using materialised events in the API:
+
+ - Replication events have a _lot_ of detailed context information that can be
+ needed in events: user-defined variables, random seed, character sets,
+ table column names and types, etc. etc. If we make the API based on
+ materialisation, then the initial decision about which context information
+ to include with which events will have to be done in the API, while ideally
+ we want this decision to be done by the individual consumer plugin. There
+ will this be a conflict between what to include (to allow consumers access)
+ and what to exclude (to avoid excessive needless work).
+
+ - Materialising means defining a very specific format, which will tend to
+ make the API less generic and flexible.
+
+ - Unless the materialised format is made _very_ specific (and thus very
+ inflexible), it is unlikely to be directly useful for transport
+ (eg. binlog), so it will need to be re-materialised into a different format
+ anyway, wasting work.
+
+ - If a generator on top handles an event, then we want to avoid wasting work
+ materialising an event in a generator below which would be completely
+ unused. Thus there would be a need for the upper generator to somehow
+ notify the lower generator ahead of event generation time to not fire an
+ event, complicating the API.
+
+Some advantages for materialisation:
+
+ - Using an API based on passing around some well-defined struct event (or
+ byte buffer) will be simpler than the complex class hierarchy proposed here
+ with no requirement for materialisation.
+
+ - Defining a materialised format would allow an easy way to use the same
+ consumer code on a generator that produces events at the source of
+ execution and on a generator that produces events from eg. reading them
+ from an event log.
+
+Note that there can be some middle way, where some data is materialised and
+some is kept as reference to context (eg. THD) only. This however looses most
+of the mentioned advantages for materialisation.
+
+The design proposed here aims for as little materialisation as possible.
+
+
+Default materialisation format
+------------------------------
+
+
+While the proposed API doesn't _require_ materialisation, we can still think
+about providing the _option_ for built-in materialisation. This could be
+useful if such materialisation is made suitable for transport to a different
+server (eg. no endian-dependance etc). If there is a facility for such
+materialisation built-in to the API, it becomes possible to write something
+like a generic binlog plugin or generic network transport plugin. This would
+be really useful for eg. PBXT engine-level replication, as it could be
+implemented without having to re-invent a binlog format.
+
+I added in the proposed API a simple facility to materialise every event as a
+string of bytes. To use this, I still need to add a suitable facility to
+de-materialise the event.
+
+However, it is still an open question whether such a facility will be at all
+useful. It still has some of the problems with materialisation mentioned
+above. And I think it is likely that a good binlog implementation will need
+to do more than just blindly copy opaque events from one endpoint to
+another. For example, it might need different event boundaries (merge and/or
+split events); it might need to augment or modify events, or inject new
+events, etc.
+
+So I think maybe it is better to add such a generic materialisation facility
+on top of the basic event generator API. Such a facility would provide
+materialisation of an replication event stream, not of individual events, so
+would be more flexible in providing a good implementation. It would be
+implemented for all generators. It would separate from both the event
+generator API (so we have flexibility to put a filter class in-between
+generator and materialisation), and could also be separate from the actual
+transport handling stuff like fsync() of binlog files and socket connections
+etc. It would be paired with a corresponding applier API which would handle
+executing events on a slave.
+
+Then we can have a default materialised event format, which is available, but
+not mandatory. So there can still be other formats alongside (like legacy
+MySQL 5.1 binlog event format and maybe Tungsten would have its own format).
+
+
+Encapsulation
+-------------
+
+Another fundamental question about the design is the level of encapsulation
+used for the API.
+
+At the implementation level, a lot of the work is basically to pull out all of
+the needed information from the THD object/context. The API I propose tries to
+_not_ expose the THD to consumers. Instead it provides accessor functions for
+all the bits and pieces relevant to each replication event, while the event
+class itself likely will be more or less just an encapsulated THD.
+
+So an alternative would be to have a generic event that was just (type, THD).
+Then consumers could just pull out whatever information they want from the
+THD. The THD implementation is already exposed to storage engines. This would
+of course greatly reduce the size of the API, eliminating lots of class
+definitions and accessor functions. Though arguably it wouldn't really
+simplify the API, as the complexity would just be in understanding the THD
+class.
+
+Note that we do not have to take any performance hit from using encapsulated
+accessors since compilers can inline them (though if inlining then we do not
+get any ABI stability with respect to THD implemetation).
+
+For now, the API is proposed without exposing the THD class. (Similar
+encapsulation could be added in actual implementation to also not expose TABLE
+and similar classes).
-=-=(Knielsen - Thu, 24 Jun 2010, 12:04)=-=-
Dependency created: 107 now depends on 120
-=-=(Knielsen - Thu, 24 Jun 2010, 11:59)=-=-
High Level Description modified.
--- /tmp/wklog.120.old.516 2010-06-24 11:59:24.000000000 +0000
+++ /tmp/wklog.120.new.516 2010-06-24 11:59:24.000000000 +0000
@@ -11,4 +11,4 @@
Event generators can be stacked, and a generator may defer event generation to
the next one down the stack. For example, the row-level replication event
-generator may defer DLL to the statement-level replication event generator.
+generator may defer DDL to the statement-level replication event generator.
-=-=(Knielsen - Mon, 21 Jun 2010, 08:35)=-=-
Research and design thoughts.
Worked 2 hours and estimate 0 hours remain (original estimate increased by 2 hours).
DESCRIPTION:
A part of the replication project, MWL#107.
Events are produced by event Generators. Examples are
- Generation of statement-based replication events
- Generation of row-based events
- Generation of PBXT engine-level replication events
and maybe reading of events from relay log on slave may also be an example of
generating events.
Event generators can be stacked, and a generator may defer event generation to
the next one down the stack. For example, the row-level replication event
generator may defer DDL to the statement-level replication event generator.
HIGH-LEVEL SPECIFICATION:
Generators and consumbers
-------------------------
We have the two concepts:
1. Event _generators_, that produce events describing all changes to data in a
server.
2. Event consumers, that receive such events and use them in various ways.
Examples of event generators is execution of SQL statements, which generates
events like those used for statement-based replication. Another example is
PBXT engine-level replication.
An example of an event consumer is the writing of the binlog on a master.
Some event generators are not really plugins. Rather, there are specific
points in the server where events are generated. However, a generator can be
part of a plugin, for example a PBXT engine-level replication event generator
would be part of the PBXT storage engine plugin. And for example we could
write a filter plugin, which would be stacked on top of an existing generator
and provide the same event types and interfaces, but filtered in some way (for
example by removing certain events on the master side, or by re-writing events
in certain ways).
Event consumers on the other hand could be a plugin.
One generator can be stacked on top of another. This means that a generator on
top (for example row-based events) will handle some events itself
(eg. non-deterministic update in mixed-mode binlogging). Other events that it
does not want to or cannot handle (for example deterministic delete or DDL)
will be defered to the generator below (for example statement-based events).
Materialisation (or not)
------------------------
A central decision is how to represent events that are generated in the API at
the point of generation.
I want to avoid making the API require that events are materialised. By
"Materialised" I mean that all (or most) of the data for the event is written
into memory in a struct/class used inside the server or serialised in a data
buffer (byte buffer) in a format suitable for network transport or disk
storage.
Using a non-materialised event means storing just a reference to appropriate
context that allows to retrieve all information for the event using
accessors. Ie. typically this would be based on getting the event information
from the THD pointer.
Some reasons to avoid using materialised events in the API:
- Replication events have a _lot_ of detailed context information that can be
needed in events: user-defined variables, random seed, character sets,
table column names and types, etc. etc. If we make the API based on
materialisation, then the initial decision about which context information
to include with which events will have to be done in the API, while ideally
we want this decision to be done by the individual consumer plugin. There
will this be a conflict between what to include (to allow consumers access)
and what to exclude (to avoid excessive needless work).
- Materialising means defining a very specific format, which will tend to
make the API less generic and flexible.
- Unless the materialised format is made _very_ specific (and thus very
inflexible), it is unlikely to be directly useful for transport
(eg. binlog), so it will need to be re-materialised into a different format
anyway, wasting work.
- If a generator on top handles an event, then we want to avoid wasting work
materialising an event in a generator below which would be completely
unused. Thus there would be a need for the upper generator to somehow
notify the lower generator ahead of event generation time to not fire an
event, complicating the API.
Some advantages for materialisation:
- Using an API based on passing around some well-defined struct event (or
byte buffer) will be simpler than the complex class hierarchy proposed here
with no requirement for materialisation.
- Defining a materialised format would allow an easy way to use the same
consumer code on a generator that produces events at the source of
execution and on a generator that produces events from eg. reading them
from an event log.
Note that there can be some middle way, where some data is materialised and
some is kept as reference to context (eg. THD) only. This however looses most
of the mentioned advantages for materialisation.
Also, the non-materialising interface should be a good interface on top of
which to build a materialising interface.
The design proposed here aims for as little materialisation as possible.
Default materialisation format
------------------------------
While the proposed API doesn't _require_ materialisation, we can still think
about providing the _option_ for built-in materialisation. This could be
useful if such materialisation is made suitable for transport to a different
server (eg. no endian-dependance etc). If there is a facility for such
materialisation built-in to the API, it becomes possible to write something
like a generic binlog plugin or generic network transport plugin. This would
be really useful for eg. PBXT engine-level replication, as it could be
implemented without having to re-invent a binlog format.
I added in the proposed API a simple facility to materialise every event as a
string of bytes. To use this, I still need to add a suitable facility to
de-materialise the event.
However, it is still an open question whether such a facility will be at all
useful. It still has some of the problems with materialisation mentioned
above. And I think it is likely that a good binlog implementation will need
to do more than just blindly copy opaque events from one endpoint to
another. For example, it might need different event boundaries (merge and/or
split events); it might need to augment or modify events, or inject new
events, etc.
So I think maybe it is better to add such a generic materialisation facility
on top of the basic event generator API. Such a facility would provide
materialisation of an replication event stream, not of individual events, so
would be more flexible in providing a good implementation. It would be
implemented for all generators. It would separate from both the event
generator API (so we have flexibility to put a filter class in-between
generator and materialisation), and could also be separate from the actual
transport handling stuff like fsync() of binlog files and socket connections
etc. It would be paired with a corresponding applier API which would handle
executing events on a slave.
Then we can have a default materialised event format, which is available, but
not mandatory. So there can still be other formats alongside (like legacy
MySQL 5.1 binlog event format and maybe Tungsten would have its own format).
Encapsulation
-------------
Another fundamental question about the design is the level of encapsulation
used for the API.
At the implementation level, a lot of the work is basically to pull out all of
the needed information from the THD object/context. The API I propose tries to
_not_ expose the THD to consumers. Instead it provides accessor functions for
all the bits and pieces relevant to each replication event, while the event
class itself likely will be more or less just an encapsulated THD.
So an alternative would be to have a generic event that was just (type, THD).
Then consumers could just pull out whatever information they want from the
THD. The THD implementation is already exposed to storage engines. This would
of course greatly reduce the size of the API, eliminating lots of class
definitions and accessor functions. Though arguably it wouldn't really
simplify the API, as the complexity would just be in understanding the THD
class.
Note that we do not have to take any performance hit from using encapsulated
accessors since compilers can inline them (though if inlining then we do not
get any ABI stability with respect to THD implemetation).
For now, the API is proposed without exposing the THD class. (Similar
encapsulation could be added in actual implementation to also not expose TABLE
and similar classes).
LOW-LEVEL DESIGN:
A consumer is implented as a virtual class (interface). There is one virtual
function for every event that can be received. A consumer would derive from
the base class and override methods for the events it wants to receive.
There is one consumer interface for each generator. When a generator A is
stacked on B, the consumer interface for A inherits from the interface for
B. This way, when A defers an event to B, the consumer for A will receive the
corresponding event from B.
There are methods for a consumer to register itself to receive events from
each generator. I still need to find a way for a consumer in one plugin to
register itself with a generator implemented in another plugin (eg. PBXT
engine-level replication). I also need to add a way for consumers to
de-register themselves.
The current design has consumer callbacks return 0 for success and error code
otherwise. I still need to think more about whether this is useful (ie. what
is the semantics of returning an error from a consumer callback).
Each event passed to consumers is defined as a class with public accessor
methods to a private context (which is mostly the THD).
My intension is to make all events passed around const, so that the same event
can be passed to each of multiple registered consumers (and to emphasise that
consumers do not have the ability to modify events). It still needs to be seen
whether that const-ness will be feasible in practise without very heavy
modification/constification of exiting code.
What follows is a partial draft of a possible definition of the API as
concrete C++ class definitions.
-----------------------------------------------------------------------
/*
Virtual base class for generated replication events.
This is the parent of events generated from all kinds of generators. Only
child classes can be instantiated.
This class can be used by code that wants to treat events in a generic way,
without any knowledge of event details. I still need to decide whether such
generic code is sensible.
*/
class rpl_event_base
{
/*
Maybe we will want the ability to materialise an event to a standard
binary format. This could be achieved with a base method like this. The
actual materialisation would be implemented in each deriving class. The
public methods would provide different interfaces for specifying the
buffer or for writing directly into IO_CACHE or file.
*/
/* Return 0 on success, -1 on error, -2 on out-of-buffer. */
int materialise(uchar *buffer, size_t buflen) const;
/*
Returns NULL on error or else malloc()ed buffer with materialised event,
caller must free().
*/
uchar *materialise() const;
/* Same but using passed in memroot. */
uchar *materialise(mem_root *memroot) const;
/*
Materialise to user-supplied writer function (could write directly to file
or the like).
*/
int materialise(int (*writer)(uchar *data, size_t len, void *context)) const;
/*
As to for what to do with a materialised event, there are a couple of
possibilities.
One is to have a de_materialise() method somewhere that can construct an
rpl_event_base (really a derived class of course) from a buffer or writer
function. This would require each accessor function to conditionally read
its data from either THD context or buffer (GCC is able to optimise
several such conditionals in multiple accessor function calls into one
conditional), or we can make all accessors virtual if the performance hit
is acceptable.
Another is to have different classes for accessing events read from
materialised event data.
Also, I still need to think about whether it is at all useful to be able
to generically materialise an event at this level. It may be that any
binlog/transport will in any case need to undertand more of the format of
events, so that such materialisation/transport is better done at a
different layer.
*/
protected:
/* Implementation which is the basis for materialise(). */
virtual int do_materialise(int (*writer)(uchar *data, size_t len,
void *context)) const = 0;
private:
/* Virtual base class, private constructor to prevent instantiation. */
rpl_event_base();
};
/*
These are the event types output from the transaction event generator.
This generator is not stacked on anything.
The transaction event generator marks the start and end (commit or rollback)
of transactions. It also gives information about whether the transaction was
a full transaction or autocommitted statement, whether transactional tables
were involved, whether non-transactional tables were involved, and XA
information (ToDo).
*/
/* Base class for transaction events. */
class rpl_event_transaction_base : public rpl_event_base
{
public:
/*
Get the local transaction id. This idea is only unique within one server.
It is allocated whenever a new transaction is started.
Can be used to identify events belonging to the same transaction in a
binlog-like stream of events streamed in parallel among multiple
transactions.
*/
uint64_t get_local_trx_id() const { return thd->local_trx_id; };
bool get_is_autocommit() const;
private:
/* The context is the THD. */
THD *thd;
rpl_event_transaction_base(THD *_thd) : thd(_thd) { };
};
/* Transaction start event. */
class rpl_event_transaction_start : public rpl_event_transaction_base
{
};
/* Transaction commit. */
class rpl_event_transaction_commit : public rpl_event_transaction_base
{
public:
/*
The global transaction id is unique cross-server.
It can be used to identify the position from which to start a slave
replicating from a master.
This global ID is only available once the transaction is decided to commit
by the TC manager / primary redundancy service. This TC also allocates the
ID and decides the exact semantics (can there be gaps, etc); however the
format is fixed (cluster_id, running_counter).
*/
struct global_transaction_id
{
uint32_t cluster_id;
uint64_t counter;
};
const global_transaction_id *get_global_transaction_id() const;
};
/* Transaction rollback. */
class rpl_event_transaction_rollback : public rpl_event_transaction_base
{
};
/* Base class for statement events. */
class rpl_event_statement_base : public rpl_event_base
{
public:
LEX_STRING get_current_db() const;
};
class rpl_event_statement_start : public rpl_event_statement_base
{
};
class rpl_event_statement_end : public rpl_event_statement_base
{
public:
int get_errorcode() const;
};
class rpl_event_statement_query : public rpl_event_statement_base
{
public:
LEX_STRING get_query_string();
ulong get_sql_mode();
const CHARSET_INFO *get_character_set_client();
const CHARSET_INFO *get_collation_connection();
const CHARSET_INFO *get_collation_server();
const CHARSET_INFO *get_collation_default_db();
/*
Access to relevant flags that affect query execution.
Use as if (ev->get_flags() & (uint32)ROW_FOREIGN_KEY_CHECKS) { ... }
*/
enum flag_bits
{
STMT_FOREIGN_KEY_CHECKS, // @@foreign_key_checks
STMT_UNIQUE_KEY_CHECKS, // @@unique_checks
STMT_AUTO_IS_NULL, // @@sql_auto_is_null
};
uint32_t get_flags();
ulong get_auto_increment_offset();
ulong get_auto_increment_increment();
// And so on for: time zone; day/month names; connection id; LAST_INSERT_ID;
// INSERT_ID; random seed; user variables.
//
// We probably also need get_uses_temporary_table(), get_used_user_vars(),
// get_uses_auto_increment() and so on, so a consumer can get more
// information about what kind of context information a query will need when
// executed on a slave.
};
class rpl_event_statement_load_query : public rpl_event_statement_query
{
};
/*
This event is fired with blocks of data for files read (from server-local
file or client connection) for LOAD DATA.
*/
class rpl_event_statement_load_data_block : public rpl_event_statement_base
{
public:
struct block
{
const uchar *ptr;
size_t size;
};
block get_block() const;
};
/* Base class for row-based replication events. */
class rpl_event_row_base : public rpl_event_base
{
public:
/*
Access to relevant handler extra flags and other flags that affect row
operations.
Use as if (ev->get_flags() & (uint32)ROW_WRITE_CAN_REPLACE) { ... }
*/
enum flag_bits
{
ROW_WRITE_CAN_REPLACE, // HA_EXTRA_WRITE_CAN_REPLACE
ROW_IGNORE_DUP_KEY, // HA_EXTRA_IGNORE_DUP_KEY
ROW_IGNORE_NO_KEY, // HA_EXTRA_IGNORE_NO_KEY
ROW_DISABLE_FOREIGN_KEY_CHECKS, // ! @@foreign_key_checks
ROW_DISABLE_UNIQUE_KEY_CHECKS, // ! @@unique_checks
};
uint32_t get_flags();
/* Access to list of tables modified. */
class table_iterator
{
public:
/* Returns table, NULL after last. */
const TABLE *get_next();
private:
// ...
};
table_iterator get_modified_tables() const;
private:
/* Context used to provide accessors. */
THD *thd;
protected:
rpl_event_row_base(THD *_thd) : thd(_thd) { }
};
class rpl_event_row_write : public rpl_event_row_base
{
public:
const BITMAP *get_write_set() const;
const uchar *get_after_image() const;
};
class rpl_event_row_update : public rpl_event_row_base
{
public:
const BITMAP *get_read_set() const;
const BITMAP *get_write_set() const;
const uchar *get_before_image() const;
const uchar *get_after_image() const;
};
class rpl_event_row_delete : public rpl_event_row_base
{
public:
const BITMAP *get_read_set() const;
const uchar *get_before_image() const;
};
/*
Event consumer callbacks.
An event consumer registers with an event generator to receive event
notifications from that generator.
The consumer has callbacks (in the form of virtual functions) for the
individual event types the consumer is interested in. Only callbacks that
are non-NULL will be invoked. If an event applies to multiple callbacks in a
single callback struct, it will only be passed to the most specific non-NULL
callback (so events never fire more than once per registration).
The lifetime of the memory holding the event is only for the duration of the
callback invocation, unless otherwise noted.
Callbacks return 0 for success or error code (ToDo: does this make sense?).
*/
struct rpl_event_consumer_transaction
{
virtual int trx_start(const rpl_event_transaction_start *) { return 0; }
virtual int trx_commit(const rpl_event_transaction_commit *) { return 0; }
virtual int trx_rollback(const rpl_event_transaction_rollback *) { return 0; }
};
/*
Consuming statement-based events.
The statement event generator is stacked on top of the transaction event
generator, so we can receive those events as well.
*/
struct rpl_event_consumer_statement : public rpl_event_consumer_transaction
{
virtual int stmt_start(const rpl_event_statement_start *) { return 0; }
virtual int stmt_end(const rpl_event_statement_end *) { return 0; }
virtual int stmt_query(const rpl_event_statement_query *) { return 0; }
/* Data for a file used in LOAD DATA [LOCAL] INFILE. */
virtual int stmt_load_data_block(const rpl_event_statement_load_data_block *)
{ return 0; }
/*
These are specific kinds of statements; if specified they override
consume_stmt_query() for the corresponding event.
*/
virtual int stmt_load_query(const rpl_event_statement_load_query *ev)
{ return stmt_query(ev); }
};
/*
Consuming row-based events.
The row event generator is stacked on top of the statement event generator.
*/
struct rpl_event_consumer_row : public rpl_event_consumer_statement
{
virtual int row_write(const rpl_event_row_write *) { return 0; }
virtual int row_update(const rpl_event_row_update *) { return 0; }
virtual int row_delete(const rpl_event_row_delete *) { return 0; }
};
/*
Registration functions.
ToDo: Make a way to de-register.
ToDo: Find a good way for a plugin (eg. PBXT) to publish a generator
registration method.
*/
int rpl_event_transaction_register(const rpl_event_consumer_transaction *cbs);
int rpl_event_statement_register(const rpl_event_consumer_statement *cbs);
int rpl_event_row_register(const rpl_event_consumer_row *cbs);
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
Hi everyone,
After a long and intense fight with CPack and NSIS, I finally have a
solution that is functional. The one TODO I have before I consider it
really good enough is to be able to set up MariaDB as a service. That
will come later.
The big problem with the installer was how to handle the database files.
If they are just copied to the data dir and used, the uninstaller will
silently delete them. This is *bad*. So I spent a long time trying to
get around this problem and make the uninstaller ask if the user wants
to get rid of these files. I'm now completely convinced this is
impossible with the current CPack :(
I have tried several workarounds, that also wouldn't work before I came
up with this:
The installer will install the data files to data\clean. At the end of
the installer, it checks if there is a file called data\mysql\db.frm
(could have been any other file). If the file is there, the user gets a
message saying the installer have not written the clean database files
to the data directory. If the file isn't there, the installer copies all
the files in data\clean to data.
The uninstaller will of course silently delete all the files in
data\clean. But it will give the user a message that the database files
are not deleted.
So, if you install this package and uninstall it again, the database
files are still on the disk. If you reinstall the package, it will use
the existing data files.
If you upgrade to a newer version, this will be installed in a different
directory (the default directory name contains the version number), and
can copy the data files from the old directory in there if you want to.
Or you can copy the clean dir somewhere else and modify the ini file to
point at it.
IMHO, this is a reasonable solution that doesn't involve patching CMake
or some other evil scheme I've been considering.
To generate an installer: Run cmake as usual, build in visual studio,
and call "cpack" when the build is done. That's about as simple as
possible :)
Can I check this into the 5.2 branch?
Bo Thorsen.
Monty Program AB.
--
MariaDB: MySQL replacement
Community developed. Feature enhanced. Backward compatible.
1
0
Hello 5.3 developers,
We all know that 5.3 tree have some buildbot failures that
- are unlikely to be result of any 5.3 work,
- cannot be observed in 5.2
- still are somehow present.
I got suspicious about one failure, and investigated it:
https://bugs.launchpad.net/maria/+bug/597742. Long story short, it was
present in 5.2 at some earlier point but has been fixed there since then.
I think, in order to avoid spent time in a way it was spent on analyzing the
above mentioned bug, we should do a 5.2->5.3 merge. 5.2 now produces an almost
green run in buildbot (the exception is plugin_load.test), and AFAIU the
release of 5.2.1 can be interpreted as indication that 5.2's code is not going
to change much anymore.
Any objections to doing the merge?
BR
Sergey
--
Sergey Petrunia, Software Developer
Monty Program AB, http://askmonty.org
Blog: http://s.petrunia.net/blog
1
0
Re: [Maria-developers] [Commits] Rev 2817: Make MariaDB compile with VS 2010 in file:///Users/hakan/work/monty_program/maria-5.2/
by Kristian Nielsen 24 Jun '10
by Kristian Nielsen 24 Jun '10
24 Jun '10
Hakan Kuecuekyilmaz <hakan(a)askmonty.org> writes:
> === modified file 'sql/CMakeLists.txt'
> --- a/sql/CMakeLists.txt 2010-06-01 19:52:20 +0000
> +++ b/sql/CMakeLists.txt 2010-06-24 10:44:39 +0000
> @@ -17,8 +17,7 @@
> SET(CMAKE_CXX_FLAGS_DEBUG
> "${CMAKE_CXX_FLAGS_DEBUG} -DSAFEMALLOC -DSAFE_MUTEX -DUSE_SYMDIR /Zi")
> SET(CMAKE_C_FLAGS_DEBUG
> - "${CMAKE_C_FLAGS_DEBUG} -DSAFEMALLOC -DSAFE_MUTEX -DUSE_SYMDIR /Zi")
> -SET(CMAKE_EXE_LINKER_FLAGS_DEBUG "${CMAKE_EXE_LINKER_FLAGS_DEBUG} /MAP /MAPINFO:EXPORTS")
> + "${CMAKE_C_FLAGS_DEBUG} -DSAFEMALLOC -DSAFE_MUTEX -DUSE_SYMDIR /Zi")
Avoid making spurious whitespace-only changes like this (added space at end of line).
> === added file 'win/build-vs10.bat'
> --- a/win/build-vs10.bat 1970-01-01 00:00:00 +0000
> +++ b/win/build-vs10.bat 2010-06-24 10:44:39 +0000
> @@ -0,0 +1,18 @@
> +@echo off
> +
> +REM Copyright (C) 2010 Monty Program AB
> +REM
> +REM This program is free software; you can redistribute it and/or modify
> +REM it under the terms of the GNU General Public License as published by
> +REM the Free Software Foundation; version 2 of the License.
> +REM
> +REM This program is distributed in the hope that it will be useful,
> +REM but WITHOUT ANY WARRANTY; without even the implied warranty of
> +REM MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> +REM GNU General Public License for more details.
> +REM
> +REM You should have received a copy of the GNU General Public License
> +REM along with this program; if not, write to the Free Software
> +REM Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
> +cmake -G "Visual Studio 10"
> +
>
> === added file 'win/build-vs10_x64.bat'
> --- a/win/build-vs10_x64.bat 1970-01-01 00:00:00 +0000
> +++ b/win/build-vs10_x64.bat 2010-06-24 10:44:39 +0000
> @@ -0,0 +1,18 @@
> +@echo off
> +
> +REM Copyright (C) 2010 Monty Program AB
> +REM
> +REM This program is free software; you can redistribute it and/or modify
> +REM it under the terms of the GNU General Public License as published by
> +REM the Free Software Foundation; version 2 of the License.
> +REM
> +REM This program is distributed in the hope that it will be useful,
> +REM but WITHOUT ANY WARRANTY; without even the implied warranty of
> +REM MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> +REM GNU General Public License for more details.
> +REM
> +REM You should have received a copy of the GNU General Public License
> +REM along with this program; if not, write to the Free Software
> +REM Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
> +cmake -G "Visual Studio 10 Win64"
> +
You need to add these new files to EXTRA_DIST in Makefile.am.
> === modified file 'win/configure-mariadb.sh'
> --- a/win/configure-mariadb.sh 2009-10-08 19:04:12 +0000
> +++ b/win/configure-mariadb.sh 2010-06-24 10:44:39 +0000
> @@ -15,9 +15,7 @@
> WITH_FEDERATED_STORAGE_ENGINE \
> WITH_MERGE_STORAGE_ENGINE \
> WITH_PARTITION_STORAGE_ENGINE \
> - WITH_MARIA_STORAGE_ENGINE \
> - WITH_PBXT_STORAGE_ENGINE \
> - WITH_XTRADB_STORAGE_ENGINE \
> + WITH_MARIA_STORAGE_ENGINE \
> + WITH_PBXT_STORAGE_ENGINE \
> + WITH_XTRADB_STORAGE_ENGINE \
> WITH_EMBEDDED_SERVER
> -
> -
Why?
- Kristian.
2
1
[Maria-developers] Updated (by Knielsen): Replication API for stacked event generators (120)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Replication API for stacked event generators
CREATION DATE..: Mon, 07 Jun 2010, 13:13
SUPERVISOR.....: Knielsen
IMPLEMENTOR....: Knielsen
COPIES TO......:
CATEGORY.......: Server-Sprint
TASK ID........: 120 (http://askmonty.org/worklog/?tid=120)
VERSION........: Server-9.x
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 2
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Knielsen - Thu, 24 Jun 2010, 14:29)=-=-
Category updated.
--- /tmp/wklog.120.old.7440 2010-06-24 14:29:38.000000000 +0000
+++ /tmp/wklog.120.new.7440 2010-06-24 14:29:38.000000000 +0000
@@ -1 +1 @@
-Server-RawIdeaBin
+Server-Sprint
-=-=(Knielsen - Thu, 24 Jun 2010, 14:29)=-=-
Status updated.
--- /tmp/wklog.120.old.7440 2010-06-24 14:29:38.000000000 +0000
+++ /tmp/wklog.120.new.7440 2010-06-24 14:29:38.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Knielsen - Thu, 24 Jun 2010, 14:29)=-=-
Low Level Design modified.
--- /tmp/wklog.120.old.7355 2010-06-24 14:29:11.000000000 +0000
+++ /tmp/wklog.120.new.7355 2010-06-24 14:29:11.000000000 +0000
@@ -1 +1,385 @@
+A consumer is implented as a virtual class (interface). There is one virtual
+function for every event that can be received. A consumer would derive from
+the base class and override methods for the events it wants to receive.
+
+There is one consumer interface for each generator. When a generator A is
+stacked on B, the consumer interface for A inherits from the interface for
+B. This way, when A defers an event to B, the consumer for A will receive the
+corresponding event from B.
+
+There are methods for a consumer to register itself to receive events from
+each generator. I still need to find a way for a consumer in one plugin to
+register itself with a generator implemented in another plugin (eg. PBXT
+engine-level replication). I also need to add a way for consumers to
+de-register themselves.
+
+The current design has consumer callbacks return 0 for success and error code
+otherwise. I still need to think more about whether this is useful (ie. what
+is the semantics of returning an error from a consumer callback).
+
+Each event passed to consumers is defined as a class with public accessor
+methods to a private context (which is mostly the THD).
+
+My intension is to make all events passed around const, so that the same event
+can be passed to each of multiple registered consumers (and to emphasise that
+consumers do not have the ability to modify events). It still needs to be seen
+whether that const-ness will be feasible in practise without very heavy
+modification/constification of exiting code.
+
+What follows is a partial draft of a possible definition of the API as
+concrete C++ class definitions.
+
+-----------------------------------------------------------------------
+
+/*
+ Virtual base class for generated replication events.
+
+ This is the parent of events generated from all kinds of generators. Only
+ child classes can be instantiated.
+
+ This class can be used by code that wants to treat events in a generic way,
+ without any knowledge of event details. I still need to decide whether such
+ generic code is sensible.
+*/
+class rpl_event_base
+{
+ /*
+ Maybe we will want the ability to materialise an event to a standard
+ binary format. This could be achieved with a base method like this. The
+ actual materialisation would be implemented in each deriving class. The
+ public methods would provide different interfaces for specifying the
+ buffer or for writing directly into IO_CACHE or file.
+ */
+
+ /* Return 0 on success, -1 on error, -2 on out-of-buffer. */
+ int materialise(uchar *buffer, size_t buflen) const;
+ /*
+ Returns NULL on error or else malloc()ed buffer with materialised event,
+ caller must free().
+ */
+ uchar *materialise() const;
+ /* Same but using passed in memroot. */
+ uchar *materialise(mem_root *memroot) const;
+ /*
+ Materialise to user-supplied writer function (could write directly to file
+ or the like).
+ */
+ int materialise(int (*writer)(uchar *data, size_t len, void *context)) const;
+
+ /*
+ As to for what to do with a materialised event, there are a couple of
+ possibilities.
+
+ One is to have a de_materialise() method somewhere that can construct an
+ rpl_event_base (really a derived class of course) from a buffer or writer
+ function. This would require each accessor function to conditionally read
+ its data from either THD context or buffer (GCC is able to optimise
+ several such conditionals in multiple accessor function calls into one
+ conditional), or we can make all accessors virtual if the performance hit
+ is acceptable.
+
+ Another is to have different classes for accessing events read from
+ materialised event data.
+
+ Also, I still need to think about whether it is at all useful to be able
+ to generically materialise an event at this level. It may be that any
+ binlog/transport will in any case need to undertand more of the format of
+ events, so that such materialisation/transport is better done at a
+ different layer.
+ */
+
+protected:
+ /* Implementation which is the basis for materialise(). */
+ virtual int do_materialise(int (*writer)(uchar *data, size_t len,
+ void *context)) const = 0;
+
+private:
+ /* Virtual base class, private constructor to prevent instantiation. */
+ rpl_event_base();
+};
+
+
+/*
+ These are the event types output from the transaction event generator.
+
+ This generator is not stacked on anything.
+
+ The transaction event generator marks the start and end (commit or rollback)
+ of transactions. It also gives information about whether the transaction was
+ a full transaction or autocommitted statement, whether transactional tables
+ were involved, whether non-transactional tables were involved, and XA
+ information (ToDo).
+*/
+
+/* Base class for transaction events. */
+class rpl_event_transaction_base : public rpl_event_base
+{
+public:
+ /*
+ Get the local transaction id. This idea is only unique within one server.
+ It is allocated whenever a new transaction is started.
+ Can be used to identify events belonging to the same transaction in a
+ binlog-like stream of events streamed in parallel among multiple
+ transactions.
+ */
+ uint64_t get_local_trx_id() const { return thd->local_trx_id; };
+
+ bool get_is_autocommit() const;
+
+private:
+ /* The context is the THD. */
+ THD *thd;
+
+ rpl_event_transaction_base(THD *_thd) : thd(_thd) { };
+};
+
+/* Transaction start event. */
+class rpl_event_transaction_start : public rpl_event_transaction_base
+{
+
+};
+
+/* Transaction commit. */
+class rpl_event_transaction_commit : public rpl_event_transaction_base
+{
+public:
+ /*
+ The global transaction id is unique cross-server.
+
+ It can be used to identify the position from which to start a slave
+ replicating from a master.
+
+ This global ID is only available once the transaction is decided to commit
+ by the TC manager / primary redundancy service. This TC also allocates the
+ ID and decides the exact semantics (can there be gaps, etc); however the
+ format is fixed (cluster_id, running_counter).
+ */
+ struct global_transaction_id
+ {
+ uint32_t cluster_id;
+ uint64_t counter;
+ };
+
+ const global_transaction_id *get_global_transaction_id() const;
+};
+
+/* Transaction rollback. */
+class rpl_event_transaction_rollback : public rpl_event_transaction_base
+{
+
+};
+
+
+/* Base class for statement events. */
+class rpl_event_statement_base : public rpl_event_base
+{
+public:
+ LEX_STRING get_current_db() const;
+};
+
+class rpl_event_statement_start : public rpl_event_statement_base
+{
+
+};
+
+class rpl_event_statement_end : public rpl_event_statement_base
+{
+public:
+ int get_errorcode() const;
+};
+
+class rpl_event_statement_query : public rpl_event_statement_base
+{
+public:
+ LEX_STRING get_query_string();
+ ulong get_sql_mode();
+ const CHARSET_INFO *get_character_set_client();
+ const CHARSET_INFO *get_collation_connection();
+ const CHARSET_INFO *get_collation_server();
+ const CHARSET_INFO *get_collation_default_db();
+
+ /*
+ Access to relevant flags that affect query execution.
+
+ Use as if (ev->get_flags() & (uint32)ROW_FOREIGN_KEY_CHECKS) { ... }
+ */
+ enum flag_bits
+ {
+ STMT_FOREIGN_KEY_CHECKS, // @@foreign_key_checks
+ STMT_UNIQUE_KEY_CHECKS, // @@unique_checks
+ STMT_AUTO_IS_NULL, // @@sql_auto_is_null
+ };
+ uint32_t get_flags();
+
+ ulong get_auto_increment_offset();
+ ulong get_auto_increment_increment();
+
+ // And so on for: time zone; day/month names; connection id; LAST_INSERT_ID;
+ // INSERT_ID; random seed; user variables.
+ //
+ // We probably also need get_uses_temporary_table(), get_used_user_vars(),
+ // get_uses_auto_increment() and so on, so a consumer can get more
+ // information about what kind of context information a query will need when
+ // executed on a slave.
+};
+
+class rpl_event_statement_load_query : public rpl_event_statement_query
+{
+
+};
+
+/*
+ This event is fired with blocks of data for files read (from server-local
+ file or client connection) for LOAD DATA.
+*/
+class rpl_event_statement_load_data_block : public rpl_event_statement_base
+{
+public:
+ struct block
+ {
+ const uchar *ptr;
+ size_t size;
+ };
+ block get_block() const;
+};
+
+/* Base class for row-based replication events. */
+class rpl_event_row_base : public rpl_event_base
+{
+public:
+ /*
+ Access to relevant handler extra flags and other flags that affect row
+ operations.
+
+ Use as if (ev->get_flags() & (uint32)ROW_WRITE_CAN_REPLACE) { ... }
+ */
+ enum flag_bits
+ {
+ ROW_WRITE_CAN_REPLACE, // HA_EXTRA_WRITE_CAN_REPLACE
+ ROW_IGNORE_DUP_KEY, // HA_EXTRA_IGNORE_DUP_KEY
+ ROW_IGNORE_NO_KEY, // HA_EXTRA_IGNORE_NO_KEY
+ ROW_DISABLE_FOREIGN_KEY_CHECKS, // ! @@foreign_key_checks
+ ROW_DISABLE_UNIQUE_KEY_CHECKS, // ! @@unique_checks
+ };
+ uint32_t get_flags();
+
+ /* Access to list of tables modified. */
+ class table_iterator
+ {
+ public:
+ /* Returns table, NULL after last. */
+ const TABLE *get_next();
+ private:
+ // ...
+ };
+ table_iterator get_modified_tables() const;
+
+private:
+ /* Context used to provide accessors. */
+ THD *thd;
+
+protected:
+ rpl_event_row_base(THD *_thd) : thd(_thd) { }
+};
+
+
+class rpl_event_row_write : public rpl_event_row_base
+{
+public:
+ const BITMAP *get_write_set() const;
+ const uchar *get_after_image() const;
+};
+
+class rpl_event_row_update : public rpl_event_row_base
+{
+public:
+ const BITMAP *get_read_set() const;
+ const BITMAP *get_write_set() const;
+ const uchar *get_before_image() const;
+ const uchar *get_after_image() const;
+};
+
+class rpl_event_row_delete : public rpl_event_row_base
+{
+public:
+ const BITMAP *get_read_set() const;
+ const uchar *get_before_image() const;
+};
+
+
+/*
+ Event consumer callbacks.
+
+ An event consumer registers with an event generator to receive event
+ notifications from that generator.
+
+ The consumer has callbacks (in the form of virtual functions) for the
+ individual event types the consumer is interested in. Only callbacks that
+ are non-NULL will be invoked. If an event applies to multiple callbacks in a
+ single callback struct, it will only be passed to the most specific non-NULL
+ callback (so events never fire more than once per registration).
+
+ The lifetime of the memory holding the event is only for the duration of the
+ callback invocation, unless otherwise noted.
+
+ Callbacks return 0 for success or error code (ToDo: does this make sense?).
+*/
+
+struct rpl_event_consumer_transaction
+{
+ virtual int trx_start(const rpl_event_transaction_start *) { return 0; }
+ virtual int trx_commit(const rpl_event_transaction_commit *) { return 0; }
+ virtual int trx_rollback(const rpl_event_transaction_rollback *) { return 0; }
+};
+
+/*
+ Consuming statement-based events.
+
+ The statement event generator is stacked on top of the transaction event
+ generator, so we can receive those events as well.
+*/
+struct rpl_event_consumer_statement : public rpl_event_consumer_transaction
+{
+ virtual int stmt_start(const rpl_event_statement_start *) { return 0; }
+ virtual int stmt_end(const rpl_event_statement_end *) { return 0; }
+
+ virtual int stmt_query(const rpl_event_statement_query *) { return 0; }
+
+ /* Data for a file used in LOAD DATA [LOCAL] INFILE. */
+ virtual int stmt_load_data_block(const rpl_event_statement_load_data_block *)
+ { return 0; }
+
+ /*
+ These are specific kinds of statements; if specified they override
+ consume_stmt_query() for the corresponding event.
+ */
+ virtual int stmt_load_query(const rpl_event_statement_load_query *ev)
+ { return stmt_query(ev); }
+};
+
+/*
+ Consuming row-based events.
+
+ The row event generator is stacked on top of the statement event generator.
+*/
+struct rpl_event_consumer_row : public rpl_event_consumer_statement
+{
+ virtual int row_write(const rpl_event_row_write *) { return 0; }
+ virtual int row_update(const rpl_event_row_update *) { return 0; }
+ virtual int row_delete(const rpl_event_row_delete *) { return 0; }
+};
+
+
+/*
+ Registration functions.
+
+ ToDo: Make a way to de-register.
+
+ ToDo: Find a good way for a plugin (eg. PBXT) to publish a generator
+ registration method.
+*/
+
+int rpl_event_transaction_register(const rpl_event_consumer_transaction *cbs);
+int rpl_event_statement_register(const rpl_event_consumer_statement *cbs);
+int rpl_event_row_register(const rpl_event_consumer_row *cbs);
-=-=(Knielsen - Thu, 24 Jun 2010, 14:28)=-=-
High-Level Specification modified.
--- /tmp/wklog.120.old.7341 2010-06-24 14:28:17.000000000 +0000
+++ /tmp/wklog.120.new.7341 2010-06-24 14:28:17.000000000 +0000
@@ -1 +1,159 @@
+Generators and consumbers
+-------------------------
+
+We have the two concepts:
+
+1. Event _generators_, that produce events describing all changes to data in a
+ server.
+
+2. Event consumers, that receive such events and use them in various ways.
+
+Examples of event generators is execution of SQL statements, which generates
+events like those used for statement-based replication. Another example is
+PBXT engine-level replication.
+
+An example of an event consumer is the writing of the binlog on a master.
+
+Event generators are not really plugins. Rather, there are specific points in
+the server where events are generated. However, a generator can be part of a
+plugin, for example a PBXT engine-level replication event generator would be
+part of the PBXT storage engine plugin.
+
+Event consumers on the other hand could be a plugin.
+
+One generator can be stacked on top of another. This means that a generator on
+top (for example row-based events) will handle some events itself
+(eg. non-deterministic update in mixed-mode binlogging). Other events that it
+does not want to or cannot handle (for example deterministic delete or DDL)
+will be defered to the generator below (for example statement-based events).
+
+
+Materialisation (or not)
+------------------------
+
+A central decision is how to represent events that are generated in the API at
+the point of generation.
+
+I want to avoid making the API require that events are materialised. By
+"Materialised" I mean that all (or most) of the data for the event is written
+into memory in a struct/class used inside the server or serialised in a data
+buffer (byte buffer) in a format suitable for network transport or disk
+storage.
+
+Using a non-materialised event means storing just a reference to appropriate
+context that allows to retrieve all information for the event using
+accessors. Ie. typically this would be based on getting the event information
+from the THD pointer.
+
+Some reasons to avoid using materialised events in the API:
+
+ - Replication events have a _lot_ of detailed context information that can be
+ needed in events: user-defined variables, random seed, character sets,
+ table column names and types, etc. etc. If we make the API based on
+ materialisation, then the initial decision about which context information
+ to include with which events will have to be done in the API, while ideally
+ we want this decision to be done by the individual consumer plugin. There
+ will this be a conflict between what to include (to allow consumers access)
+ and what to exclude (to avoid excessive needless work).
+
+ - Materialising means defining a very specific format, which will tend to
+ make the API less generic and flexible.
+
+ - Unless the materialised format is made _very_ specific (and thus very
+ inflexible), it is unlikely to be directly useful for transport
+ (eg. binlog), so it will need to be re-materialised into a different format
+ anyway, wasting work.
+
+ - If a generator on top handles an event, then we want to avoid wasting work
+ materialising an event in a generator below which would be completely
+ unused. Thus there would be a need for the upper generator to somehow
+ notify the lower generator ahead of event generation time to not fire an
+ event, complicating the API.
+
+Some advantages for materialisation:
+
+ - Using an API based on passing around some well-defined struct event (or
+ byte buffer) will be simpler than the complex class hierarchy proposed here
+ with no requirement for materialisation.
+
+ - Defining a materialised format would allow an easy way to use the same
+ consumer code on a generator that produces events at the source of
+ execution and on a generator that produces events from eg. reading them
+ from an event log.
+
+Note that there can be some middle way, where some data is materialised and
+some is kept as reference to context (eg. THD) only. This however looses most
+of the mentioned advantages for materialisation.
+
+The design proposed here aims for as little materialisation as possible.
+
+
+Default materialisation format
+------------------------------
+
+
+While the proposed API doesn't _require_ materialisation, we can still think
+about providing the _option_ for built-in materialisation. This could be
+useful if such materialisation is made suitable for transport to a different
+server (eg. no endian-dependance etc). If there is a facility for such
+materialisation built-in to the API, it becomes possible to write something
+like a generic binlog plugin or generic network transport plugin. This would
+be really useful for eg. PBXT engine-level replication, as it could be
+implemented without having to re-invent a binlog format.
+
+I added in the proposed API a simple facility to materialise every event as a
+string of bytes. To use this, I still need to add a suitable facility to
+de-materialise the event.
+
+However, it is still an open question whether such a facility will be at all
+useful. It still has some of the problems with materialisation mentioned
+above. And I think it is likely that a good binlog implementation will need
+to do more than just blindly copy opaque events from one endpoint to
+another. For example, it might need different event boundaries (merge and/or
+split events); it might need to augment or modify events, or inject new
+events, etc.
+
+So I think maybe it is better to add such a generic materialisation facility
+on top of the basic event generator API. Such a facility would provide
+materialisation of an replication event stream, not of individual events, so
+would be more flexible in providing a good implementation. It would be
+implemented for all generators. It would separate from both the event
+generator API (so we have flexibility to put a filter class in-between
+generator and materialisation), and could also be separate from the actual
+transport handling stuff like fsync() of binlog files and socket connections
+etc. It would be paired with a corresponding applier API which would handle
+executing events on a slave.
+
+Then we can have a default materialised event format, which is available, but
+not mandatory. So there can still be other formats alongside (like legacy
+MySQL 5.1 binlog event format and maybe Tungsten would have its own format).
+
+
+Encapsulation
+-------------
+
+Another fundamental question about the design is the level of encapsulation
+used for the API.
+
+At the implementation level, a lot of the work is basically to pull out all of
+the needed information from the THD object/context. The API I propose tries to
+_not_ expose the THD to consumers. Instead it provides accessor functions for
+all the bits and pieces relevant to each replication event, while the event
+class itself likely will be more or less just an encapsulated THD.
+
+So an alternative would be to have a generic event that was just (type, THD).
+Then consumers could just pull out whatever information they want from the
+THD. The THD implementation is already exposed to storage engines. This would
+of course greatly reduce the size of the API, eliminating lots of class
+definitions and accessor functions. Though arguably it wouldn't really
+simplify the API, as the complexity would just be in understanding the THD
+class.
+
+Note that we do not have to take any performance hit from using encapsulated
+accessors since compilers can inline them (though if inlining then we do not
+get any ABI stability with respect to THD implemetation).
+
+For now, the API is proposed without exposing the THD class. (Similar
+encapsulation could be added in actual implementation to also not expose TABLE
+and similar classes).
-=-=(Knielsen - Thu, 24 Jun 2010, 12:04)=-=-
Dependency created: 107 now depends on 120
-=-=(Knielsen - Thu, 24 Jun 2010, 11:59)=-=-
High Level Description modified.
--- /tmp/wklog.120.old.516 2010-06-24 11:59:24.000000000 +0000
+++ /tmp/wklog.120.new.516 2010-06-24 11:59:24.000000000 +0000
@@ -11,4 +11,4 @@
Event generators can be stacked, and a generator may defer event generation to
the next one down the stack. For example, the row-level replication event
-generator may defer DLL to the statement-level replication event generator.
+generator may defer DDL to the statement-level replication event generator.
-=-=(Knielsen - Mon, 21 Jun 2010, 08:35)=-=-
Research and design thoughts.
Worked 2 hours and estimate 0 hours remain (original estimate increased by 2 hours).
DESCRIPTION:
A part of the replication project, MWL#107.
Events are produced by event Generators. Examples are
- Generation of statement-based replication events
- Generation of row-based events
- Generation of PBXT engine-level replication events
and maybe reading of events from relay log on slave may also be an example of
generating events.
Event generators can be stacked, and a generator may defer event generation to
the next one down the stack. For example, the row-level replication event
generator may defer DDL to the statement-level replication event generator.
HIGH-LEVEL SPECIFICATION:
Generators and consumbers
-------------------------
We have the two concepts:
1. Event _generators_, that produce events describing all changes to data in a
server.
2. Event consumers, that receive such events and use them in various ways.
Examples of event generators is execution of SQL statements, which generates
events like those used for statement-based replication. Another example is
PBXT engine-level replication.
An example of an event consumer is the writing of the binlog on a master.
Event generators are not really plugins. Rather, there are specific points in
the server where events are generated. However, a generator can be part of a
plugin, for example a PBXT engine-level replication event generator would be
part of the PBXT storage engine plugin.
Event consumers on the other hand could be a plugin.
One generator can be stacked on top of another. This means that a generator on
top (for example row-based events) will handle some events itself
(eg. non-deterministic update in mixed-mode binlogging). Other events that it
does not want to or cannot handle (for example deterministic delete or DDL)
will be defered to the generator below (for example statement-based events).
Materialisation (or not)
------------------------
A central decision is how to represent events that are generated in the API at
the point of generation.
I want to avoid making the API require that events are materialised. By
"Materialised" I mean that all (or most) of the data for the event is written
into memory in a struct/class used inside the server or serialised in a data
buffer (byte buffer) in a format suitable for network transport or disk
storage.
Using a non-materialised event means storing just a reference to appropriate
context that allows to retrieve all information for the event using
accessors. Ie. typically this would be based on getting the event information
from the THD pointer.
Some reasons to avoid using materialised events in the API:
- Replication events have a _lot_ of detailed context information that can be
needed in events: user-defined variables, random seed, character sets,
table column names and types, etc. etc. If we make the API based on
materialisation, then the initial decision about which context information
to include with which events will have to be done in the API, while ideally
we want this decision to be done by the individual consumer plugin. There
will this be a conflict between what to include (to allow consumers access)
and what to exclude (to avoid excessive needless work).
- Materialising means defining a very specific format, which will tend to
make the API less generic and flexible.
- Unless the materialised format is made _very_ specific (and thus very
inflexible), it is unlikely to be directly useful for transport
(eg. binlog), so it will need to be re-materialised into a different format
anyway, wasting work.
- If a generator on top handles an event, then we want to avoid wasting work
materialising an event in a generator below which would be completely
unused. Thus there would be a need for the upper generator to somehow
notify the lower generator ahead of event generation time to not fire an
event, complicating the API.
Some advantages for materialisation:
- Using an API based on passing around some well-defined struct event (or
byte buffer) will be simpler than the complex class hierarchy proposed here
with no requirement for materialisation.
- Defining a materialised format would allow an easy way to use the same
consumer code on a generator that produces events at the source of
execution and on a generator that produces events from eg. reading them
from an event log.
Note that there can be some middle way, where some data is materialised and
some is kept as reference to context (eg. THD) only. This however looses most
of the mentioned advantages for materialisation.
The design proposed here aims for as little materialisation as possible.
Default materialisation format
------------------------------
While the proposed API doesn't _require_ materialisation, we can still think
about providing the _option_ for built-in materialisation. This could be
useful if such materialisation is made suitable for transport to a different
server (eg. no endian-dependance etc). If there is a facility for such
materialisation built-in to the API, it becomes possible to write something
like a generic binlog plugin or generic network transport plugin. This would
be really useful for eg. PBXT engine-level replication, as it could be
implemented without having to re-invent a binlog format.
I added in the proposed API a simple facility to materialise every event as a
string of bytes. To use this, I still need to add a suitable facility to
de-materialise the event.
However, it is still an open question whether such a facility will be at all
useful. It still has some of the problems with materialisation mentioned
above. And I think it is likely that a good binlog implementation will need
to do more than just blindly copy opaque events from one endpoint to
another. For example, it might need different event boundaries (merge and/or
split events); it might need to augment or modify events, or inject new
events, etc.
So I think maybe it is better to add such a generic materialisation facility
on top of the basic event generator API. Such a facility would provide
materialisation of an replication event stream, not of individual events, so
would be more flexible in providing a good implementation. It would be
implemented for all generators. It would separate from both the event
generator API (so we have flexibility to put a filter class in-between
generator and materialisation), and could also be separate from the actual
transport handling stuff like fsync() of binlog files and socket connections
etc. It would be paired with a corresponding applier API which would handle
executing events on a slave.
Then we can have a default materialised event format, which is available, but
not mandatory. So there can still be other formats alongside (like legacy
MySQL 5.1 binlog event format and maybe Tungsten would have its own format).
Encapsulation
-------------
Another fundamental question about the design is the level of encapsulation
used for the API.
At the implementation level, a lot of the work is basically to pull out all of
the needed information from the THD object/context. The API I propose tries to
_not_ expose the THD to consumers. Instead it provides accessor functions for
all the bits and pieces relevant to each replication event, while the event
class itself likely will be more or less just an encapsulated THD.
So an alternative would be to have a generic event that was just (type, THD).
Then consumers could just pull out whatever information they want from the
THD. The THD implementation is already exposed to storage engines. This would
of course greatly reduce the size of the API, eliminating lots of class
definitions and accessor functions. Though arguably it wouldn't really
simplify the API, as the complexity would just be in understanding the THD
class.
Note that we do not have to take any performance hit from using encapsulated
accessors since compilers can inline them (though if inlining then we do not
get any ABI stability with respect to THD implemetation).
For now, the API is proposed without exposing the THD class. (Similar
encapsulation could be added in actual implementation to also not expose TABLE
and similar classes).
LOW-LEVEL DESIGN:
A consumer is implented as a virtual class (interface). There is one virtual
function for every event that can be received. A consumer would derive from
the base class and override methods for the events it wants to receive.
There is one consumer interface for each generator. When a generator A is
stacked on B, the consumer interface for A inherits from the interface for
B. This way, when A defers an event to B, the consumer for A will receive the
corresponding event from B.
There are methods for a consumer to register itself to receive events from
each generator. I still need to find a way for a consumer in one plugin to
register itself with a generator implemented in another plugin (eg. PBXT
engine-level replication). I also need to add a way for consumers to
de-register themselves.
The current design has consumer callbacks return 0 for success and error code
otherwise. I still need to think more about whether this is useful (ie. what
is the semantics of returning an error from a consumer callback).
Each event passed to consumers is defined as a class with public accessor
methods to a private context (which is mostly the THD).
My intension is to make all events passed around const, so that the same event
can be passed to each of multiple registered consumers (and to emphasise that
consumers do not have the ability to modify events). It still needs to be seen
whether that const-ness will be feasible in practise without very heavy
modification/constification of exiting code.
What follows is a partial draft of a possible definition of the API as
concrete C++ class definitions.
-----------------------------------------------------------------------
/*
Virtual base class for generated replication events.
This is the parent of events generated from all kinds of generators. Only
child classes can be instantiated.
This class can be used by code that wants to treat events in a generic way,
without any knowledge of event details. I still need to decide whether such
generic code is sensible.
*/
class rpl_event_base
{
/*
Maybe we will want the ability to materialise an event to a standard
binary format. This could be achieved with a base method like this. The
actual materialisation would be implemented in each deriving class. The
public methods would provide different interfaces for specifying the
buffer or for writing directly into IO_CACHE or file.
*/
/* Return 0 on success, -1 on error, -2 on out-of-buffer. */
int materialise(uchar *buffer, size_t buflen) const;
/*
Returns NULL on error or else malloc()ed buffer with materialised event,
caller must free().
*/
uchar *materialise() const;
/* Same but using passed in memroot. */
uchar *materialise(mem_root *memroot) const;
/*
Materialise to user-supplied writer function (could write directly to file
or the like).
*/
int materialise(int (*writer)(uchar *data, size_t len, void *context)) const;
/*
As to for what to do with a materialised event, there are a couple of
possibilities.
One is to have a de_materialise() method somewhere that can construct an
rpl_event_base (really a derived class of course) from a buffer or writer
function. This would require each accessor function to conditionally read
its data from either THD context or buffer (GCC is able to optimise
several such conditionals in multiple accessor function calls into one
conditional), or we can make all accessors virtual if the performance hit
is acceptable.
Another is to have different classes for accessing events read from
materialised event data.
Also, I still need to think about whether it is at all useful to be able
to generically materialise an event at this level. It may be that any
binlog/transport will in any case need to undertand more of the format of
events, so that such materialisation/transport is better done at a
different layer.
*/
protected:
/* Implementation which is the basis for materialise(). */
virtual int do_materialise(int (*writer)(uchar *data, size_t len,
void *context)) const = 0;
private:
/* Virtual base class, private constructor to prevent instantiation. */
rpl_event_base();
};
/*
These are the event types output from the transaction event generator.
This generator is not stacked on anything.
The transaction event generator marks the start and end (commit or rollback)
of transactions. It also gives information about whether the transaction was
a full transaction or autocommitted statement, whether transactional tables
were involved, whether non-transactional tables were involved, and XA
information (ToDo).
*/
/* Base class for transaction events. */
class rpl_event_transaction_base : public rpl_event_base
{
public:
/*
Get the local transaction id. This idea is only unique within one server.
It is allocated whenever a new transaction is started.
Can be used to identify events belonging to the same transaction in a
binlog-like stream of events streamed in parallel among multiple
transactions.
*/
uint64_t get_local_trx_id() const { return thd->local_trx_id; };
bool get_is_autocommit() const;
private:
/* The context is the THD. */
THD *thd;
rpl_event_transaction_base(THD *_thd) : thd(_thd) { };
};
/* Transaction start event. */
class rpl_event_transaction_start : public rpl_event_transaction_base
{
};
/* Transaction commit. */
class rpl_event_transaction_commit : public rpl_event_transaction_base
{
public:
/*
The global transaction id is unique cross-server.
It can be used to identify the position from which to start a slave
replicating from a master.
This global ID is only available once the transaction is decided to commit
by the TC manager / primary redundancy service. This TC also allocates the
ID and decides the exact semantics (can there be gaps, etc); however the
format is fixed (cluster_id, running_counter).
*/
struct global_transaction_id
{
uint32_t cluster_id;
uint64_t counter;
};
const global_transaction_id *get_global_transaction_id() const;
};
/* Transaction rollback. */
class rpl_event_transaction_rollback : public rpl_event_transaction_base
{
};
/* Base class for statement events. */
class rpl_event_statement_base : public rpl_event_base
{
public:
LEX_STRING get_current_db() const;
};
class rpl_event_statement_start : public rpl_event_statement_base
{
};
class rpl_event_statement_end : public rpl_event_statement_base
{
public:
int get_errorcode() const;
};
class rpl_event_statement_query : public rpl_event_statement_base
{
public:
LEX_STRING get_query_string();
ulong get_sql_mode();
const CHARSET_INFO *get_character_set_client();
const CHARSET_INFO *get_collation_connection();
const CHARSET_INFO *get_collation_server();
const CHARSET_INFO *get_collation_default_db();
/*
Access to relevant flags that affect query execution.
Use as if (ev->get_flags() & (uint32)ROW_FOREIGN_KEY_CHECKS) { ... }
*/
enum flag_bits
{
STMT_FOREIGN_KEY_CHECKS, // @@foreign_key_checks
STMT_UNIQUE_KEY_CHECKS, // @@unique_checks
STMT_AUTO_IS_NULL, // @@sql_auto_is_null
};
uint32_t get_flags();
ulong get_auto_increment_offset();
ulong get_auto_increment_increment();
// And so on for: time zone; day/month names; connection id; LAST_INSERT_ID;
// INSERT_ID; random seed; user variables.
//
// We probably also need get_uses_temporary_table(), get_used_user_vars(),
// get_uses_auto_increment() and so on, so a consumer can get more
// information about what kind of context information a query will need when
// executed on a slave.
};
class rpl_event_statement_load_query : public rpl_event_statement_query
{
};
/*
This event is fired with blocks of data for files read (from server-local
file or client connection) for LOAD DATA.
*/
class rpl_event_statement_load_data_block : public rpl_event_statement_base
{
public:
struct block
{
const uchar *ptr;
size_t size;
};
block get_block() const;
};
/* Base class for row-based replication events. */
class rpl_event_row_base : public rpl_event_base
{
public:
/*
Access to relevant handler extra flags and other flags that affect row
operations.
Use as if (ev->get_flags() & (uint32)ROW_WRITE_CAN_REPLACE) { ... }
*/
enum flag_bits
{
ROW_WRITE_CAN_REPLACE, // HA_EXTRA_WRITE_CAN_REPLACE
ROW_IGNORE_DUP_KEY, // HA_EXTRA_IGNORE_DUP_KEY
ROW_IGNORE_NO_KEY, // HA_EXTRA_IGNORE_NO_KEY
ROW_DISABLE_FOREIGN_KEY_CHECKS, // ! @@foreign_key_checks
ROW_DISABLE_UNIQUE_KEY_CHECKS, // ! @@unique_checks
};
uint32_t get_flags();
/* Access to list of tables modified. */
class table_iterator
{
public:
/* Returns table, NULL after last. */
const TABLE *get_next();
private:
// ...
};
table_iterator get_modified_tables() const;
private:
/* Context used to provide accessors. */
THD *thd;
protected:
rpl_event_row_base(THD *_thd) : thd(_thd) { }
};
class rpl_event_row_write : public rpl_event_row_base
{
public:
const BITMAP *get_write_set() const;
const uchar *get_after_image() const;
};
class rpl_event_row_update : public rpl_event_row_base
{
public:
const BITMAP *get_read_set() const;
const BITMAP *get_write_set() const;
const uchar *get_before_image() const;
const uchar *get_after_image() const;
};
class rpl_event_row_delete : public rpl_event_row_base
{
public:
const BITMAP *get_read_set() const;
const uchar *get_before_image() const;
};
/*
Event consumer callbacks.
An event consumer registers with an event generator to receive event
notifications from that generator.
The consumer has callbacks (in the form of virtual functions) for the
individual event types the consumer is interested in. Only callbacks that
are non-NULL will be invoked. If an event applies to multiple callbacks in a
single callback struct, it will only be passed to the most specific non-NULL
callback (so events never fire more than once per registration).
The lifetime of the memory holding the event is only for the duration of the
callback invocation, unless otherwise noted.
Callbacks return 0 for success or error code (ToDo: does this make sense?).
*/
struct rpl_event_consumer_transaction
{
virtual int trx_start(const rpl_event_transaction_start *) { return 0; }
virtual int trx_commit(const rpl_event_transaction_commit *) { return 0; }
virtual int trx_rollback(const rpl_event_transaction_rollback *) { return 0; }
};
/*
Consuming statement-based events.
The statement event generator is stacked on top of the transaction event
generator, so we can receive those events as well.
*/
struct rpl_event_consumer_statement : public rpl_event_consumer_transaction
{
virtual int stmt_start(const rpl_event_statement_start *) { return 0; }
virtual int stmt_end(const rpl_event_statement_end *) { return 0; }
virtual int stmt_query(const rpl_event_statement_query *) { return 0; }
/* Data for a file used in LOAD DATA [LOCAL] INFILE. */
virtual int stmt_load_data_block(const rpl_event_statement_load_data_block *)
{ return 0; }
/*
These are specific kinds of statements; if specified they override
consume_stmt_query() for the corresponding event.
*/
virtual int stmt_load_query(const rpl_event_statement_load_query *ev)
{ return stmt_query(ev); }
};
/*
Consuming row-based events.
The row event generator is stacked on top of the statement event generator.
*/
struct rpl_event_consumer_row : public rpl_event_consumer_statement
{
virtual int row_write(const rpl_event_row_write *) { return 0; }
virtual int row_update(const rpl_event_row_update *) { return 0; }
virtual int row_delete(const rpl_event_row_delete *) { return 0; }
};
/*
Registration functions.
ToDo: Make a way to de-register.
ToDo: Find a good way for a plugin (eg. PBXT) to publish a generator
registration method.
*/
int rpl_event_transaction_register(const rpl_event_consumer_transaction *cbs);
int rpl_event_statement_register(const rpl_event_consumer_statement *cbs);
int rpl_event_row_register(const rpl_event_consumer_row *cbs);
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Knielsen): Replication API for stacked event generators (120)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Replication API for stacked event generators
CREATION DATE..: Mon, 07 Jun 2010, 13:13
SUPERVISOR.....: Knielsen
IMPLEMENTOR....: Knielsen
COPIES TO......:
CATEGORY.......: Server-Sprint
TASK ID........: 120 (http://askmonty.org/worklog/?tid=120)
VERSION........: Server-9.x
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 2
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Knielsen - Thu, 24 Jun 2010, 14:29)=-=-
Category updated.
--- /tmp/wklog.120.old.7440 2010-06-24 14:29:38.000000000 +0000
+++ /tmp/wklog.120.new.7440 2010-06-24 14:29:38.000000000 +0000
@@ -1 +1 @@
-Server-RawIdeaBin
+Server-Sprint
-=-=(Knielsen - Thu, 24 Jun 2010, 14:29)=-=-
Status updated.
--- /tmp/wklog.120.old.7440 2010-06-24 14:29:38.000000000 +0000
+++ /tmp/wklog.120.new.7440 2010-06-24 14:29:38.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Knielsen - Thu, 24 Jun 2010, 14:29)=-=-
Low Level Design modified.
--- /tmp/wklog.120.old.7355 2010-06-24 14:29:11.000000000 +0000
+++ /tmp/wklog.120.new.7355 2010-06-24 14:29:11.000000000 +0000
@@ -1 +1,385 @@
+A consumer is implented as a virtual class (interface). There is one virtual
+function for every event that can be received. A consumer would derive from
+the base class and override methods for the events it wants to receive.
+
+There is one consumer interface for each generator. When a generator A is
+stacked on B, the consumer interface for A inherits from the interface for
+B. This way, when A defers an event to B, the consumer for A will receive the
+corresponding event from B.
+
+There are methods for a consumer to register itself to receive events from
+each generator. I still need to find a way for a consumer in one plugin to
+register itself with a generator implemented in another plugin (eg. PBXT
+engine-level replication). I also need to add a way for consumers to
+de-register themselves.
+
+The current design has consumer callbacks return 0 for success and error code
+otherwise. I still need to think more about whether this is useful (ie. what
+is the semantics of returning an error from a consumer callback).
+
+Each event passed to consumers is defined as a class with public accessor
+methods to a private context (which is mostly the THD).
+
+My intension is to make all events passed around const, so that the same event
+can be passed to each of multiple registered consumers (and to emphasise that
+consumers do not have the ability to modify events). It still needs to be seen
+whether that const-ness will be feasible in practise without very heavy
+modification/constification of exiting code.
+
+What follows is a partial draft of a possible definition of the API as
+concrete C++ class definitions.
+
+-----------------------------------------------------------------------
+
+/*
+ Virtual base class for generated replication events.
+
+ This is the parent of events generated from all kinds of generators. Only
+ child classes can be instantiated.
+
+ This class can be used by code that wants to treat events in a generic way,
+ without any knowledge of event details. I still need to decide whether such
+ generic code is sensible.
+*/
+class rpl_event_base
+{
+ /*
+ Maybe we will want the ability to materialise an event to a standard
+ binary format. This could be achieved with a base method like this. The
+ actual materialisation would be implemented in each deriving class. The
+ public methods would provide different interfaces for specifying the
+ buffer or for writing directly into IO_CACHE or file.
+ */
+
+ /* Return 0 on success, -1 on error, -2 on out-of-buffer. */
+ int materialise(uchar *buffer, size_t buflen) const;
+ /*
+ Returns NULL on error or else malloc()ed buffer with materialised event,
+ caller must free().
+ */
+ uchar *materialise() const;
+ /* Same but using passed in memroot. */
+ uchar *materialise(mem_root *memroot) const;
+ /*
+ Materialise to user-supplied writer function (could write directly to file
+ or the like).
+ */
+ int materialise(int (*writer)(uchar *data, size_t len, void *context)) const;
+
+ /*
+ As to for what to do with a materialised event, there are a couple of
+ possibilities.
+
+ One is to have a de_materialise() method somewhere that can construct an
+ rpl_event_base (really a derived class of course) from a buffer or writer
+ function. This would require each accessor function to conditionally read
+ its data from either THD context or buffer (GCC is able to optimise
+ several such conditionals in multiple accessor function calls into one
+ conditional), or we can make all accessors virtual if the performance hit
+ is acceptable.
+
+ Another is to have different classes for accessing events read from
+ materialised event data.
+
+ Also, I still need to think about whether it is at all useful to be able
+ to generically materialise an event at this level. It may be that any
+ binlog/transport will in any case need to undertand more of the format of
+ events, so that such materialisation/transport is better done at a
+ different layer.
+ */
+
+protected:
+ /* Implementation which is the basis for materialise(). */
+ virtual int do_materialise(int (*writer)(uchar *data, size_t len,
+ void *context)) const = 0;
+
+private:
+ /* Virtual base class, private constructor to prevent instantiation. */
+ rpl_event_base();
+};
+
+
+/*
+ These are the event types output from the transaction event generator.
+
+ This generator is not stacked on anything.
+
+ The transaction event generator marks the start and end (commit or rollback)
+ of transactions. It also gives information about whether the transaction was
+ a full transaction or autocommitted statement, whether transactional tables
+ were involved, whether non-transactional tables were involved, and XA
+ information (ToDo).
+*/
+
+/* Base class for transaction events. */
+class rpl_event_transaction_base : public rpl_event_base
+{
+public:
+ /*
+ Get the local transaction id. This idea is only unique within one server.
+ It is allocated whenever a new transaction is started.
+ Can be used to identify events belonging to the same transaction in a
+ binlog-like stream of events streamed in parallel among multiple
+ transactions.
+ */
+ uint64_t get_local_trx_id() const { return thd->local_trx_id; };
+
+ bool get_is_autocommit() const;
+
+private:
+ /* The context is the THD. */
+ THD *thd;
+
+ rpl_event_transaction_base(THD *_thd) : thd(_thd) { };
+};
+
+/* Transaction start event. */
+class rpl_event_transaction_start : public rpl_event_transaction_base
+{
+
+};
+
+/* Transaction commit. */
+class rpl_event_transaction_commit : public rpl_event_transaction_base
+{
+public:
+ /*
+ The global transaction id is unique cross-server.
+
+ It can be used to identify the position from which to start a slave
+ replicating from a master.
+
+ This global ID is only available once the transaction is decided to commit
+ by the TC manager / primary redundancy service. This TC also allocates the
+ ID and decides the exact semantics (can there be gaps, etc); however the
+ format is fixed (cluster_id, running_counter).
+ */
+ struct global_transaction_id
+ {
+ uint32_t cluster_id;
+ uint64_t counter;
+ };
+
+ const global_transaction_id *get_global_transaction_id() const;
+};
+
+/* Transaction rollback. */
+class rpl_event_transaction_rollback : public rpl_event_transaction_base
+{
+
+};
+
+
+/* Base class for statement events. */
+class rpl_event_statement_base : public rpl_event_base
+{
+public:
+ LEX_STRING get_current_db() const;
+};
+
+class rpl_event_statement_start : public rpl_event_statement_base
+{
+
+};
+
+class rpl_event_statement_end : public rpl_event_statement_base
+{
+public:
+ int get_errorcode() const;
+};
+
+class rpl_event_statement_query : public rpl_event_statement_base
+{
+public:
+ LEX_STRING get_query_string();
+ ulong get_sql_mode();
+ const CHARSET_INFO *get_character_set_client();
+ const CHARSET_INFO *get_collation_connection();
+ const CHARSET_INFO *get_collation_server();
+ const CHARSET_INFO *get_collation_default_db();
+
+ /*
+ Access to relevant flags that affect query execution.
+
+ Use as if (ev->get_flags() & (uint32)ROW_FOREIGN_KEY_CHECKS) { ... }
+ */
+ enum flag_bits
+ {
+ STMT_FOREIGN_KEY_CHECKS, // @@foreign_key_checks
+ STMT_UNIQUE_KEY_CHECKS, // @@unique_checks
+ STMT_AUTO_IS_NULL, // @@sql_auto_is_null
+ };
+ uint32_t get_flags();
+
+ ulong get_auto_increment_offset();
+ ulong get_auto_increment_increment();
+
+ // And so on for: time zone; day/month names; connection id; LAST_INSERT_ID;
+ // INSERT_ID; random seed; user variables.
+ //
+ // We probably also need get_uses_temporary_table(), get_used_user_vars(),
+ // get_uses_auto_increment() and so on, so a consumer can get more
+ // information about what kind of context information a query will need when
+ // executed on a slave.
+};
+
+class rpl_event_statement_load_query : public rpl_event_statement_query
+{
+
+};
+
+/*
+ This event is fired with blocks of data for files read (from server-local
+ file or client connection) for LOAD DATA.
+*/
+class rpl_event_statement_load_data_block : public rpl_event_statement_base
+{
+public:
+ struct block
+ {
+ const uchar *ptr;
+ size_t size;
+ };
+ block get_block() const;
+};
+
+/* Base class for row-based replication events. */
+class rpl_event_row_base : public rpl_event_base
+{
+public:
+ /*
+ Access to relevant handler extra flags and other flags that affect row
+ operations.
+
+ Use as if (ev->get_flags() & (uint32)ROW_WRITE_CAN_REPLACE) { ... }
+ */
+ enum flag_bits
+ {
+ ROW_WRITE_CAN_REPLACE, // HA_EXTRA_WRITE_CAN_REPLACE
+ ROW_IGNORE_DUP_KEY, // HA_EXTRA_IGNORE_DUP_KEY
+ ROW_IGNORE_NO_KEY, // HA_EXTRA_IGNORE_NO_KEY
+ ROW_DISABLE_FOREIGN_KEY_CHECKS, // ! @@foreign_key_checks
+ ROW_DISABLE_UNIQUE_KEY_CHECKS, // ! @@unique_checks
+ };
+ uint32_t get_flags();
+
+ /* Access to list of tables modified. */
+ class table_iterator
+ {
+ public:
+ /* Returns table, NULL after last. */
+ const TABLE *get_next();
+ private:
+ // ...
+ };
+ table_iterator get_modified_tables() const;
+
+private:
+ /* Context used to provide accessors. */
+ THD *thd;
+
+protected:
+ rpl_event_row_base(THD *_thd) : thd(_thd) { }
+};
+
+
+class rpl_event_row_write : public rpl_event_row_base
+{
+public:
+ const BITMAP *get_write_set() const;
+ const uchar *get_after_image() const;
+};
+
+class rpl_event_row_update : public rpl_event_row_base
+{
+public:
+ const BITMAP *get_read_set() const;
+ const BITMAP *get_write_set() const;
+ const uchar *get_before_image() const;
+ const uchar *get_after_image() const;
+};
+
+class rpl_event_row_delete : public rpl_event_row_base
+{
+public:
+ const BITMAP *get_read_set() const;
+ const uchar *get_before_image() const;
+};
+
+
+/*
+ Event consumer callbacks.
+
+ An event consumer registers with an event generator to receive event
+ notifications from that generator.
+
+ The consumer has callbacks (in the form of virtual functions) for the
+ individual event types the consumer is interested in. Only callbacks that
+ are non-NULL will be invoked. If an event applies to multiple callbacks in a
+ single callback struct, it will only be passed to the most specific non-NULL
+ callback (so events never fire more than once per registration).
+
+ The lifetime of the memory holding the event is only for the duration of the
+ callback invocation, unless otherwise noted.
+
+ Callbacks return 0 for success or error code (ToDo: does this make sense?).
+*/
+
+struct rpl_event_consumer_transaction
+{
+ virtual int trx_start(const rpl_event_transaction_start *) { return 0; }
+ virtual int trx_commit(const rpl_event_transaction_commit *) { return 0; }
+ virtual int trx_rollback(const rpl_event_transaction_rollback *) { return 0; }
+};
+
+/*
+ Consuming statement-based events.
+
+ The statement event generator is stacked on top of the transaction event
+ generator, so we can receive those events as well.
+*/
+struct rpl_event_consumer_statement : public rpl_event_consumer_transaction
+{
+ virtual int stmt_start(const rpl_event_statement_start *) { return 0; }
+ virtual int stmt_end(const rpl_event_statement_end *) { return 0; }
+
+ virtual int stmt_query(const rpl_event_statement_query *) { return 0; }
+
+ /* Data for a file used in LOAD DATA [LOCAL] INFILE. */
+ virtual int stmt_load_data_block(const rpl_event_statement_load_data_block *)
+ { return 0; }
+
+ /*
+ These are specific kinds of statements; if specified they override
+ consume_stmt_query() for the corresponding event.
+ */
+ virtual int stmt_load_query(const rpl_event_statement_load_query *ev)
+ { return stmt_query(ev); }
+};
+
+/*
+ Consuming row-based events.
+
+ The row event generator is stacked on top of the statement event generator.
+*/
+struct rpl_event_consumer_row : public rpl_event_consumer_statement
+{
+ virtual int row_write(const rpl_event_row_write *) { return 0; }
+ virtual int row_update(const rpl_event_row_update *) { return 0; }
+ virtual int row_delete(const rpl_event_row_delete *) { return 0; }
+};
+
+
+/*
+ Registration functions.
+
+ ToDo: Make a way to de-register.
+
+ ToDo: Find a good way for a plugin (eg. PBXT) to publish a generator
+ registration method.
+*/
+
+int rpl_event_transaction_register(const rpl_event_consumer_transaction *cbs);
+int rpl_event_statement_register(const rpl_event_consumer_statement *cbs);
+int rpl_event_row_register(const rpl_event_consumer_row *cbs);
-=-=(Knielsen - Thu, 24 Jun 2010, 14:28)=-=-
High-Level Specification modified.
--- /tmp/wklog.120.old.7341 2010-06-24 14:28:17.000000000 +0000
+++ /tmp/wklog.120.new.7341 2010-06-24 14:28:17.000000000 +0000
@@ -1 +1,159 @@
+Generators and consumbers
+-------------------------
+
+We have the two concepts:
+
+1. Event _generators_, that produce events describing all changes to data in a
+ server.
+
+2. Event consumers, that receive such events and use them in various ways.
+
+Examples of event generators is execution of SQL statements, which generates
+events like those used for statement-based replication. Another example is
+PBXT engine-level replication.
+
+An example of an event consumer is the writing of the binlog on a master.
+
+Event generators are not really plugins. Rather, there are specific points in
+the server where events are generated. However, a generator can be part of a
+plugin, for example a PBXT engine-level replication event generator would be
+part of the PBXT storage engine plugin.
+
+Event consumers on the other hand could be a plugin.
+
+One generator can be stacked on top of another. This means that a generator on
+top (for example row-based events) will handle some events itself
+(eg. non-deterministic update in mixed-mode binlogging). Other events that it
+does not want to or cannot handle (for example deterministic delete or DDL)
+will be defered to the generator below (for example statement-based events).
+
+
+Materialisation (or not)
+------------------------
+
+A central decision is how to represent events that are generated in the API at
+the point of generation.
+
+I want to avoid making the API require that events are materialised. By
+"Materialised" I mean that all (or most) of the data for the event is written
+into memory in a struct/class used inside the server or serialised in a data
+buffer (byte buffer) in a format suitable for network transport or disk
+storage.
+
+Using a non-materialised event means storing just a reference to appropriate
+context that allows to retrieve all information for the event using
+accessors. Ie. typically this would be based on getting the event information
+from the THD pointer.
+
+Some reasons to avoid using materialised events in the API:
+
+ - Replication events have a _lot_ of detailed context information that can be
+ needed in events: user-defined variables, random seed, character sets,
+ table column names and types, etc. etc. If we make the API based on
+ materialisation, then the initial decision about which context information
+ to include with which events will have to be done in the API, while ideally
+ we want this decision to be done by the individual consumer plugin. There
+ will this be a conflict between what to include (to allow consumers access)
+ and what to exclude (to avoid excessive needless work).
+
+ - Materialising means defining a very specific format, which will tend to
+ make the API less generic and flexible.
+
+ - Unless the materialised format is made _very_ specific (and thus very
+ inflexible), it is unlikely to be directly useful for transport
+ (eg. binlog), so it will need to be re-materialised into a different format
+ anyway, wasting work.
+
+ - If a generator on top handles an event, then we want to avoid wasting work
+ materialising an event in a generator below which would be completely
+ unused. Thus there would be a need for the upper generator to somehow
+ notify the lower generator ahead of event generation time to not fire an
+ event, complicating the API.
+
+Some advantages for materialisation:
+
+ - Using an API based on passing around some well-defined struct event (or
+ byte buffer) will be simpler than the complex class hierarchy proposed here
+ with no requirement for materialisation.
+
+ - Defining a materialised format would allow an easy way to use the same
+ consumer code on a generator that produces events at the source of
+ execution and on a generator that produces events from eg. reading them
+ from an event log.
+
+Note that there can be some middle way, where some data is materialised and
+some is kept as reference to context (eg. THD) only. This however looses most
+of the mentioned advantages for materialisation.
+
+The design proposed here aims for as little materialisation as possible.
+
+
+Default materialisation format
+------------------------------
+
+
+While the proposed API doesn't _require_ materialisation, we can still think
+about providing the _option_ for built-in materialisation. This could be
+useful if such materialisation is made suitable for transport to a different
+server (eg. no endian-dependance etc). If there is a facility for such
+materialisation built-in to the API, it becomes possible to write something
+like a generic binlog plugin or generic network transport plugin. This would
+be really useful for eg. PBXT engine-level replication, as it could be
+implemented without having to re-invent a binlog format.
+
+I added in the proposed API a simple facility to materialise every event as a
+string of bytes. To use this, I still need to add a suitable facility to
+de-materialise the event.
+
+However, it is still an open question whether such a facility will be at all
+useful. It still has some of the problems with materialisation mentioned
+above. And I think it is likely that a good binlog implementation will need
+to do more than just blindly copy opaque events from one endpoint to
+another. For example, it might need different event boundaries (merge and/or
+split events); it might need to augment or modify events, or inject new
+events, etc.
+
+So I think maybe it is better to add such a generic materialisation facility
+on top of the basic event generator API. Such a facility would provide
+materialisation of an replication event stream, not of individual events, so
+would be more flexible in providing a good implementation. It would be
+implemented for all generators. It would separate from both the event
+generator API (so we have flexibility to put a filter class in-between
+generator and materialisation), and could also be separate from the actual
+transport handling stuff like fsync() of binlog files and socket connections
+etc. It would be paired with a corresponding applier API which would handle
+executing events on a slave.
+
+Then we can have a default materialised event format, which is available, but
+not mandatory. So there can still be other formats alongside (like legacy
+MySQL 5.1 binlog event format and maybe Tungsten would have its own format).
+
+
+Encapsulation
+-------------
+
+Another fundamental question about the design is the level of encapsulation
+used for the API.
+
+At the implementation level, a lot of the work is basically to pull out all of
+the needed information from the THD object/context. The API I propose tries to
+_not_ expose the THD to consumers. Instead it provides accessor functions for
+all the bits and pieces relevant to each replication event, while the event
+class itself likely will be more or less just an encapsulated THD.
+
+So an alternative would be to have a generic event that was just (type, THD).
+Then consumers could just pull out whatever information they want from the
+THD. The THD implementation is already exposed to storage engines. This would
+of course greatly reduce the size of the API, eliminating lots of class
+definitions and accessor functions. Though arguably it wouldn't really
+simplify the API, as the complexity would just be in understanding the THD
+class.
+
+Note that we do not have to take any performance hit from using encapsulated
+accessors since compilers can inline them (though if inlining then we do not
+get any ABI stability with respect to THD implemetation).
+
+For now, the API is proposed without exposing the THD class. (Similar
+encapsulation could be added in actual implementation to also not expose TABLE
+and similar classes).
-=-=(Knielsen - Thu, 24 Jun 2010, 12:04)=-=-
Dependency created: 107 now depends on 120
-=-=(Knielsen - Thu, 24 Jun 2010, 11:59)=-=-
High Level Description modified.
--- /tmp/wklog.120.old.516 2010-06-24 11:59:24.000000000 +0000
+++ /tmp/wklog.120.new.516 2010-06-24 11:59:24.000000000 +0000
@@ -11,4 +11,4 @@
Event generators can be stacked, and a generator may defer event generation to
the next one down the stack. For example, the row-level replication event
-generator may defer DLL to the statement-level replication event generator.
+generator may defer DDL to the statement-level replication event generator.
-=-=(Knielsen - Mon, 21 Jun 2010, 08:35)=-=-
Research and design thoughts.
Worked 2 hours and estimate 0 hours remain (original estimate increased by 2 hours).
DESCRIPTION:
A part of the replication project, MWL#107.
Events are produced by event Generators. Examples are
- Generation of statement-based replication events
- Generation of row-based events
- Generation of PBXT engine-level replication events
and maybe reading of events from relay log on slave may also be an example of
generating events.
Event generators can be stacked, and a generator may defer event generation to
the next one down the stack. For example, the row-level replication event
generator may defer DDL to the statement-level replication event generator.
HIGH-LEVEL SPECIFICATION:
Generators and consumbers
-------------------------
We have the two concepts:
1. Event _generators_, that produce events describing all changes to data in a
server.
2. Event consumers, that receive such events and use them in various ways.
Examples of event generators is execution of SQL statements, which generates
events like those used for statement-based replication. Another example is
PBXT engine-level replication.
An example of an event consumer is the writing of the binlog on a master.
Event generators are not really plugins. Rather, there are specific points in
the server where events are generated. However, a generator can be part of a
plugin, for example a PBXT engine-level replication event generator would be
part of the PBXT storage engine plugin.
Event consumers on the other hand could be a plugin.
One generator can be stacked on top of another. This means that a generator on
top (for example row-based events) will handle some events itself
(eg. non-deterministic update in mixed-mode binlogging). Other events that it
does not want to or cannot handle (for example deterministic delete or DDL)
will be defered to the generator below (for example statement-based events).
Materialisation (or not)
------------------------
A central decision is how to represent events that are generated in the API at
the point of generation.
I want to avoid making the API require that events are materialised. By
"Materialised" I mean that all (or most) of the data for the event is written
into memory in a struct/class used inside the server or serialised in a data
buffer (byte buffer) in a format suitable for network transport or disk
storage.
Using a non-materialised event means storing just a reference to appropriate
context that allows to retrieve all information for the event using
accessors. Ie. typically this would be based on getting the event information
from the THD pointer.
Some reasons to avoid using materialised events in the API:
- Replication events have a _lot_ of detailed context information that can be
needed in events: user-defined variables, random seed, character sets,
table column names and types, etc. etc. If we make the API based on
materialisation, then the initial decision about which context information
to include with which events will have to be done in the API, while ideally
we want this decision to be done by the individual consumer plugin. There
will this be a conflict between what to include (to allow consumers access)
and what to exclude (to avoid excessive needless work).
- Materialising means defining a very specific format, which will tend to
make the API less generic and flexible.
- Unless the materialised format is made _very_ specific (and thus very
inflexible), it is unlikely to be directly useful for transport
(eg. binlog), so it will need to be re-materialised into a different format
anyway, wasting work.
- If a generator on top handles an event, then we want to avoid wasting work
materialising an event in a generator below which would be completely
unused. Thus there would be a need for the upper generator to somehow
notify the lower generator ahead of event generation time to not fire an
event, complicating the API.
Some advantages for materialisation:
- Using an API based on passing around some well-defined struct event (or
byte buffer) will be simpler than the complex class hierarchy proposed here
with no requirement for materialisation.
- Defining a materialised format would allow an easy way to use the same
consumer code on a generator that produces events at the source of
execution and on a generator that produces events from eg. reading them
from an event log.
Note that there can be some middle way, where some data is materialised and
some is kept as reference to context (eg. THD) only. This however looses most
of the mentioned advantages for materialisation.
The design proposed here aims for as little materialisation as possible.
Default materialisation format
------------------------------
While the proposed API doesn't _require_ materialisation, we can still think
about providing the _option_ for built-in materialisation. This could be
useful if such materialisation is made suitable for transport to a different
server (eg. no endian-dependance etc). If there is a facility for such
materialisation built-in to the API, it becomes possible to write something
like a generic binlog plugin or generic network transport plugin. This would
be really useful for eg. PBXT engine-level replication, as it could be
implemented without having to re-invent a binlog format.
I added in the proposed API a simple facility to materialise every event as a
string of bytes. To use this, I still need to add a suitable facility to
de-materialise the event.
However, it is still an open question whether such a facility will be at all
useful. It still has some of the problems with materialisation mentioned
above. And I think it is likely that a good binlog implementation will need
to do more than just blindly copy opaque events from one endpoint to
another. For example, it might need different event boundaries (merge and/or
split events); it might need to augment or modify events, or inject new
events, etc.
So I think maybe it is better to add such a generic materialisation facility
on top of the basic event generator API. Such a facility would provide
materialisation of an replication event stream, not of individual events, so
would be more flexible in providing a good implementation. It would be
implemented for all generators. It would separate from both the event
generator API (so we have flexibility to put a filter class in-between
generator and materialisation), and could also be separate from the actual
transport handling stuff like fsync() of binlog files and socket connections
etc. It would be paired with a corresponding applier API which would handle
executing events on a slave.
Then we can have a default materialised event format, which is available, but
not mandatory. So there can still be other formats alongside (like legacy
MySQL 5.1 binlog event format and maybe Tungsten would have its own format).
Encapsulation
-------------
Another fundamental question about the design is the level of encapsulation
used for the API.
At the implementation level, a lot of the work is basically to pull out all of
the needed information from the THD object/context. The API I propose tries to
_not_ expose the THD to consumers. Instead it provides accessor functions for
all the bits and pieces relevant to each replication event, while the event
class itself likely will be more or less just an encapsulated THD.
So an alternative would be to have a generic event that was just (type, THD).
Then consumers could just pull out whatever information they want from the
THD. The THD implementation is already exposed to storage engines. This would
of course greatly reduce the size of the API, eliminating lots of class
definitions and accessor functions. Though arguably it wouldn't really
simplify the API, as the complexity would just be in understanding the THD
class.
Note that we do not have to take any performance hit from using encapsulated
accessors since compilers can inline them (though if inlining then we do not
get any ABI stability with respect to THD implemetation).
For now, the API is proposed without exposing the THD class. (Similar
encapsulation could be added in actual implementation to also not expose TABLE
and similar classes).
LOW-LEVEL DESIGN:
A consumer is implented as a virtual class (interface). There is one virtual
function for every event that can be received. A consumer would derive from
the base class and override methods for the events it wants to receive.
There is one consumer interface for each generator. When a generator A is
stacked on B, the consumer interface for A inherits from the interface for
B. This way, when A defers an event to B, the consumer for A will receive the
corresponding event from B.
There are methods for a consumer to register itself to receive events from
each generator. I still need to find a way for a consumer in one plugin to
register itself with a generator implemented in another plugin (eg. PBXT
engine-level replication). I also need to add a way for consumers to
de-register themselves.
The current design has consumer callbacks return 0 for success and error code
otherwise. I still need to think more about whether this is useful (ie. what
is the semantics of returning an error from a consumer callback).
Each event passed to consumers is defined as a class with public accessor
methods to a private context (which is mostly the THD).
My intension is to make all events passed around const, so that the same event
can be passed to each of multiple registered consumers (and to emphasise that
consumers do not have the ability to modify events). It still needs to be seen
whether that const-ness will be feasible in practise without very heavy
modification/constification of exiting code.
What follows is a partial draft of a possible definition of the API as
concrete C++ class definitions.
-----------------------------------------------------------------------
/*
Virtual base class for generated replication events.
This is the parent of events generated from all kinds of generators. Only
child classes can be instantiated.
This class can be used by code that wants to treat events in a generic way,
without any knowledge of event details. I still need to decide whether such
generic code is sensible.
*/
class rpl_event_base
{
/*
Maybe we will want the ability to materialise an event to a standard
binary format. This could be achieved with a base method like this. The
actual materialisation would be implemented in each deriving class. The
public methods would provide different interfaces for specifying the
buffer or for writing directly into IO_CACHE or file.
*/
/* Return 0 on success, -1 on error, -2 on out-of-buffer. */
int materialise(uchar *buffer, size_t buflen) const;
/*
Returns NULL on error or else malloc()ed buffer with materialised event,
caller must free().
*/
uchar *materialise() const;
/* Same but using passed in memroot. */
uchar *materialise(mem_root *memroot) const;
/*
Materialise to user-supplied writer function (could write directly to file
or the like).
*/
int materialise(int (*writer)(uchar *data, size_t len, void *context)) const;
/*
As to for what to do with a materialised event, there are a couple of
possibilities.
One is to have a de_materialise() method somewhere that can construct an
rpl_event_base (really a derived class of course) from a buffer or writer
function. This would require each accessor function to conditionally read
its data from either THD context or buffer (GCC is able to optimise
several such conditionals in multiple accessor function calls into one
conditional), or we can make all accessors virtual if the performance hit
is acceptable.
Another is to have different classes for accessing events read from
materialised event data.
Also, I still need to think about whether it is at all useful to be able
to generically materialise an event at this level. It may be that any
binlog/transport will in any case need to undertand more of the format of
events, so that such materialisation/transport is better done at a
different layer.
*/
protected:
/* Implementation which is the basis for materialise(). */
virtual int do_materialise(int (*writer)(uchar *data, size_t len,
void *context)) const = 0;
private:
/* Virtual base class, private constructor to prevent instantiation. */
rpl_event_base();
};
/*
These are the event types output from the transaction event generator.
This generator is not stacked on anything.
The transaction event generator marks the start and end (commit or rollback)
of transactions. It also gives information about whether the transaction was
a full transaction or autocommitted statement, whether transactional tables
were involved, whether non-transactional tables were involved, and XA
information (ToDo).
*/
/* Base class for transaction events. */
class rpl_event_transaction_base : public rpl_event_base
{
public:
/*
Get the local transaction id. This idea is only unique within one server.
It is allocated whenever a new transaction is started.
Can be used to identify events belonging to the same transaction in a
binlog-like stream of events streamed in parallel among multiple
transactions.
*/
uint64_t get_local_trx_id() const { return thd->local_trx_id; };
bool get_is_autocommit() const;
private:
/* The context is the THD. */
THD *thd;
rpl_event_transaction_base(THD *_thd) : thd(_thd) { };
};
/* Transaction start event. */
class rpl_event_transaction_start : public rpl_event_transaction_base
{
};
/* Transaction commit. */
class rpl_event_transaction_commit : public rpl_event_transaction_base
{
public:
/*
The global transaction id is unique cross-server.
It can be used to identify the position from which to start a slave
replicating from a master.
This global ID is only available once the transaction is decided to commit
by the TC manager / primary redundancy service. This TC also allocates the
ID and decides the exact semantics (can there be gaps, etc); however the
format is fixed (cluster_id, running_counter).
*/
struct global_transaction_id
{
uint32_t cluster_id;
uint64_t counter;
};
const global_transaction_id *get_global_transaction_id() const;
};
/* Transaction rollback. */
class rpl_event_transaction_rollback : public rpl_event_transaction_base
{
};
/* Base class for statement events. */
class rpl_event_statement_base : public rpl_event_base
{
public:
LEX_STRING get_current_db() const;
};
class rpl_event_statement_start : public rpl_event_statement_base
{
};
class rpl_event_statement_end : public rpl_event_statement_base
{
public:
int get_errorcode() const;
};
class rpl_event_statement_query : public rpl_event_statement_base
{
public:
LEX_STRING get_query_string();
ulong get_sql_mode();
const CHARSET_INFO *get_character_set_client();
const CHARSET_INFO *get_collation_connection();
const CHARSET_INFO *get_collation_server();
const CHARSET_INFO *get_collation_default_db();
/*
Access to relevant flags that affect query execution.
Use as if (ev->get_flags() & (uint32)ROW_FOREIGN_KEY_CHECKS) { ... }
*/
enum flag_bits
{
STMT_FOREIGN_KEY_CHECKS, // @@foreign_key_checks
STMT_UNIQUE_KEY_CHECKS, // @@unique_checks
STMT_AUTO_IS_NULL, // @@sql_auto_is_null
};
uint32_t get_flags();
ulong get_auto_increment_offset();
ulong get_auto_increment_increment();
// And so on for: time zone; day/month names; connection id; LAST_INSERT_ID;
// INSERT_ID; random seed; user variables.
//
// We probably also need get_uses_temporary_table(), get_used_user_vars(),
// get_uses_auto_increment() and so on, so a consumer can get more
// information about what kind of context information a query will need when
// executed on a slave.
};
class rpl_event_statement_load_query : public rpl_event_statement_query
{
};
/*
This event is fired with blocks of data for files read (from server-local
file or client connection) for LOAD DATA.
*/
class rpl_event_statement_load_data_block : public rpl_event_statement_base
{
public:
struct block
{
const uchar *ptr;
size_t size;
};
block get_block() const;
};
/* Base class for row-based replication events. */
class rpl_event_row_base : public rpl_event_base
{
public:
/*
Access to relevant handler extra flags and other flags that affect row
operations.
Use as if (ev->get_flags() & (uint32)ROW_WRITE_CAN_REPLACE) { ... }
*/
enum flag_bits
{
ROW_WRITE_CAN_REPLACE, // HA_EXTRA_WRITE_CAN_REPLACE
ROW_IGNORE_DUP_KEY, // HA_EXTRA_IGNORE_DUP_KEY
ROW_IGNORE_NO_KEY, // HA_EXTRA_IGNORE_NO_KEY
ROW_DISABLE_FOREIGN_KEY_CHECKS, // ! @@foreign_key_checks
ROW_DISABLE_UNIQUE_KEY_CHECKS, // ! @@unique_checks
};
uint32_t get_flags();
/* Access to list of tables modified. */
class table_iterator
{
public:
/* Returns table, NULL after last. */
const TABLE *get_next();
private:
// ...
};
table_iterator get_modified_tables() const;
private:
/* Context used to provide accessors. */
THD *thd;
protected:
rpl_event_row_base(THD *_thd) : thd(_thd) { }
};
class rpl_event_row_write : public rpl_event_row_base
{
public:
const BITMAP *get_write_set() const;
const uchar *get_after_image() const;
};
class rpl_event_row_update : public rpl_event_row_base
{
public:
const BITMAP *get_read_set() const;
const BITMAP *get_write_set() const;
const uchar *get_before_image() const;
const uchar *get_after_image() const;
};
class rpl_event_row_delete : public rpl_event_row_base
{
public:
const BITMAP *get_read_set() const;
const uchar *get_before_image() const;
};
/*
Event consumer callbacks.
An event consumer registers with an event generator to receive event
notifications from that generator.
The consumer has callbacks (in the form of virtual functions) for the
individual event types the consumer is interested in. Only callbacks that
are non-NULL will be invoked. If an event applies to multiple callbacks in a
single callback struct, it will only be passed to the most specific non-NULL
callback (so events never fire more than once per registration).
The lifetime of the memory holding the event is only for the duration of the
callback invocation, unless otherwise noted.
Callbacks return 0 for success or error code (ToDo: does this make sense?).
*/
struct rpl_event_consumer_transaction
{
virtual int trx_start(const rpl_event_transaction_start *) { return 0; }
virtual int trx_commit(const rpl_event_transaction_commit *) { return 0; }
virtual int trx_rollback(const rpl_event_transaction_rollback *) { return 0; }
};
/*
Consuming statement-based events.
The statement event generator is stacked on top of the transaction event
generator, so we can receive those events as well.
*/
struct rpl_event_consumer_statement : public rpl_event_consumer_transaction
{
virtual int stmt_start(const rpl_event_statement_start *) { return 0; }
virtual int stmt_end(const rpl_event_statement_end *) { return 0; }
virtual int stmt_query(const rpl_event_statement_query *) { return 0; }
/* Data for a file used in LOAD DATA [LOCAL] INFILE. */
virtual int stmt_load_data_block(const rpl_event_statement_load_data_block *)
{ return 0; }
/*
These are specific kinds of statements; if specified they override
consume_stmt_query() for the corresponding event.
*/
virtual int stmt_load_query(const rpl_event_statement_load_query *ev)
{ return stmt_query(ev); }
};
/*
Consuming row-based events.
The row event generator is stacked on top of the statement event generator.
*/
struct rpl_event_consumer_row : public rpl_event_consumer_statement
{
virtual int row_write(const rpl_event_row_write *) { return 0; }
virtual int row_update(const rpl_event_row_update *) { return 0; }
virtual int row_delete(const rpl_event_row_delete *) { return 0; }
};
/*
Registration functions.
ToDo: Make a way to de-register.
ToDo: Find a good way for a plugin (eg. PBXT) to publish a generator
registration method.
*/
int rpl_event_transaction_register(const rpl_event_consumer_transaction *cbs);
int rpl_event_statement_register(const rpl_event_consumer_statement *cbs);
int rpl_event_row_register(const rpl_event_consumer_row *cbs);
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Knielsen): Replication API for stacked event generators (120)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Replication API for stacked event generators
CREATION DATE..: Mon, 07 Jun 2010, 13:13
SUPERVISOR.....: Knielsen
IMPLEMENTOR....:
COPIES TO......:
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 120 (http://askmonty.org/worklog/?tid=120)
VERSION........: Server-9.x
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 2
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Knielsen - Thu, 24 Jun 2010, 14:29)=-=-
Low Level Design modified.
--- /tmp/wklog.120.old.7355 2010-06-24 14:29:11.000000000 +0000
+++ /tmp/wklog.120.new.7355 2010-06-24 14:29:11.000000000 +0000
@@ -1 +1,385 @@
+A consumer is implented as a virtual class (interface). There is one virtual
+function for every event that can be received. A consumer would derive from
+the base class and override methods for the events it wants to receive.
+
+There is one consumer interface for each generator. When a generator A is
+stacked on B, the consumer interface for A inherits from the interface for
+B. This way, when A defers an event to B, the consumer for A will receive the
+corresponding event from B.
+
+There are methods for a consumer to register itself to receive events from
+each generator. I still need to find a way for a consumer in one plugin to
+register itself with a generator implemented in another plugin (eg. PBXT
+engine-level replication). I also need to add a way for consumers to
+de-register themselves.
+
+The current design has consumer callbacks return 0 for success and error code
+otherwise. I still need to think more about whether this is useful (ie. what
+is the semantics of returning an error from a consumer callback).
+
+Each event passed to consumers is defined as a class with public accessor
+methods to a private context (which is mostly the THD).
+
+My intension is to make all events passed around const, so that the same event
+can be passed to each of multiple registered consumers (and to emphasise that
+consumers do not have the ability to modify events). It still needs to be seen
+whether that const-ness will be feasible in practise without very heavy
+modification/constification of exiting code.
+
+What follows is a partial draft of a possible definition of the API as
+concrete C++ class definitions.
+
+-----------------------------------------------------------------------
+
+/*
+ Virtual base class for generated replication events.
+
+ This is the parent of events generated from all kinds of generators. Only
+ child classes can be instantiated.
+
+ This class can be used by code that wants to treat events in a generic way,
+ without any knowledge of event details. I still need to decide whether such
+ generic code is sensible.
+*/
+class rpl_event_base
+{
+ /*
+ Maybe we will want the ability to materialise an event to a standard
+ binary format. This could be achieved with a base method like this. The
+ actual materialisation would be implemented in each deriving class. The
+ public methods would provide different interfaces for specifying the
+ buffer or for writing directly into IO_CACHE or file.
+ */
+
+ /* Return 0 on success, -1 on error, -2 on out-of-buffer. */
+ int materialise(uchar *buffer, size_t buflen) const;
+ /*
+ Returns NULL on error or else malloc()ed buffer with materialised event,
+ caller must free().
+ */
+ uchar *materialise() const;
+ /* Same but using passed in memroot. */
+ uchar *materialise(mem_root *memroot) const;
+ /*
+ Materialise to user-supplied writer function (could write directly to file
+ or the like).
+ */
+ int materialise(int (*writer)(uchar *data, size_t len, void *context)) const;
+
+ /*
+ As to for what to do with a materialised event, there are a couple of
+ possibilities.
+
+ One is to have a de_materialise() method somewhere that can construct an
+ rpl_event_base (really a derived class of course) from a buffer or writer
+ function. This would require each accessor function to conditionally read
+ its data from either THD context or buffer (GCC is able to optimise
+ several such conditionals in multiple accessor function calls into one
+ conditional), or we can make all accessors virtual if the performance hit
+ is acceptable.
+
+ Another is to have different classes for accessing events read from
+ materialised event data.
+
+ Also, I still need to think about whether it is at all useful to be able
+ to generically materialise an event at this level. It may be that any
+ binlog/transport will in any case need to undertand more of the format of
+ events, so that such materialisation/transport is better done at a
+ different layer.
+ */
+
+protected:
+ /* Implementation which is the basis for materialise(). */
+ virtual int do_materialise(int (*writer)(uchar *data, size_t len,
+ void *context)) const = 0;
+
+private:
+ /* Virtual base class, private constructor to prevent instantiation. */
+ rpl_event_base();
+};
+
+
+/*
+ These are the event types output from the transaction event generator.
+
+ This generator is not stacked on anything.
+
+ The transaction event generator marks the start and end (commit or rollback)
+ of transactions. It also gives information about whether the transaction was
+ a full transaction or autocommitted statement, whether transactional tables
+ were involved, whether non-transactional tables were involved, and XA
+ information (ToDo).
+*/
+
+/* Base class for transaction events. */
+class rpl_event_transaction_base : public rpl_event_base
+{
+public:
+ /*
+ Get the local transaction id. This idea is only unique within one server.
+ It is allocated whenever a new transaction is started.
+ Can be used to identify events belonging to the same transaction in a
+ binlog-like stream of events streamed in parallel among multiple
+ transactions.
+ */
+ uint64_t get_local_trx_id() const { return thd->local_trx_id; };
+
+ bool get_is_autocommit() const;
+
+private:
+ /* The context is the THD. */
+ THD *thd;
+
+ rpl_event_transaction_base(THD *_thd) : thd(_thd) { };
+};
+
+/* Transaction start event. */
+class rpl_event_transaction_start : public rpl_event_transaction_base
+{
+
+};
+
+/* Transaction commit. */
+class rpl_event_transaction_commit : public rpl_event_transaction_base
+{
+public:
+ /*
+ The global transaction id is unique cross-server.
+
+ It can be used to identify the position from which to start a slave
+ replicating from a master.
+
+ This global ID is only available once the transaction is decided to commit
+ by the TC manager / primary redundancy service. This TC also allocates the
+ ID and decides the exact semantics (can there be gaps, etc); however the
+ format is fixed (cluster_id, running_counter).
+ */
+ struct global_transaction_id
+ {
+ uint32_t cluster_id;
+ uint64_t counter;
+ };
+
+ const global_transaction_id *get_global_transaction_id() const;
+};
+
+/* Transaction rollback. */
+class rpl_event_transaction_rollback : public rpl_event_transaction_base
+{
+
+};
+
+
+/* Base class for statement events. */
+class rpl_event_statement_base : public rpl_event_base
+{
+public:
+ LEX_STRING get_current_db() const;
+};
+
+class rpl_event_statement_start : public rpl_event_statement_base
+{
+
+};
+
+class rpl_event_statement_end : public rpl_event_statement_base
+{
+public:
+ int get_errorcode() const;
+};
+
+class rpl_event_statement_query : public rpl_event_statement_base
+{
+public:
+ LEX_STRING get_query_string();
+ ulong get_sql_mode();
+ const CHARSET_INFO *get_character_set_client();
+ const CHARSET_INFO *get_collation_connection();
+ const CHARSET_INFO *get_collation_server();
+ const CHARSET_INFO *get_collation_default_db();
+
+ /*
+ Access to relevant flags that affect query execution.
+
+ Use as if (ev->get_flags() & (uint32)ROW_FOREIGN_KEY_CHECKS) { ... }
+ */
+ enum flag_bits
+ {
+ STMT_FOREIGN_KEY_CHECKS, // @@foreign_key_checks
+ STMT_UNIQUE_KEY_CHECKS, // @@unique_checks
+ STMT_AUTO_IS_NULL, // @@sql_auto_is_null
+ };
+ uint32_t get_flags();
+
+ ulong get_auto_increment_offset();
+ ulong get_auto_increment_increment();
+
+ // And so on for: time zone; day/month names; connection id; LAST_INSERT_ID;
+ // INSERT_ID; random seed; user variables.
+ //
+ // We probably also need get_uses_temporary_table(), get_used_user_vars(),
+ // get_uses_auto_increment() and so on, so a consumer can get more
+ // information about what kind of context information a query will need when
+ // executed on a slave.
+};
+
+class rpl_event_statement_load_query : public rpl_event_statement_query
+{
+
+};
+
+/*
+ This event is fired with blocks of data for files read (from server-local
+ file or client connection) for LOAD DATA.
+*/
+class rpl_event_statement_load_data_block : public rpl_event_statement_base
+{
+public:
+ struct block
+ {
+ const uchar *ptr;
+ size_t size;
+ };
+ block get_block() const;
+};
+
+/* Base class for row-based replication events. */
+class rpl_event_row_base : public rpl_event_base
+{
+public:
+ /*
+ Access to relevant handler extra flags and other flags that affect row
+ operations.
+
+ Use as if (ev->get_flags() & (uint32)ROW_WRITE_CAN_REPLACE) { ... }
+ */
+ enum flag_bits
+ {
+ ROW_WRITE_CAN_REPLACE, // HA_EXTRA_WRITE_CAN_REPLACE
+ ROW_IGNORE_DUP_KEY, // HA_EXTRA_IGNORE_DUP_KEY
+ ROW_IGNORE_NO_KEY, // HA_EXTRA_IGNORE_NO_KEY
+ ROW_DISABLE_FOREIGN_KEY_CHECKS, // ! @@foreign_key_checks
+ ROW_DISABLE_UNIQUE_KEY_CHECKS, // ! @@unique_checks
+ };
+ uint32_t get_flags();
+
+ /* Access to list of tables modified. */
+ class table_iterator
+ {
+ public:
+ /* Returns table, NULL after last. */
+ const TABLE *get_next();
+ private:
+ // ...
+ };
+ table_iterator get_modified_tables() const;
+
+private:
+ /* Context used to provide accessors. */
+ THD *thd;
+
+protected:
+ rpl_event_row_base(THD *_thd) : thd(_thd) { }
+};
+
+
+class rpl_event_row_write : public rpl_event_row_base
+{
+public:
+ const BITMAP *get_write_set() const;
+ const uchar *get_after_image() const;
+};
+
+class rpl_event_row_update : public rpl_event_row_base
+{
+public:
+ const BITMAP *get_read_set() const;
+ const BITMAP *get_write_set() const;
+ const uchar *get_before_image() const;
+ const uchar *get_after_image() const;
+};
+
+class rpl_event_row_delete : public rpl_event_row_base
+{
+public:
+ const BITMAP *get_read_set() const;
+ const uchar *get_before_image() const;
+};
+
+
+/*
+ Event consumer callbacks.
+
+ An event consumer registers with an event generator to receive event
+ notifications from that generator.
+
+ The consumer has callbacks (in the form of virtual functions) for the
+ individual event types the consumer is interested in. Only callbacks that
+ are non-NULL will be invoked. If an event applies to multiple callbacks in a
+ single callback struct, it will only be passed to the most specific non-NULL
+ callback (so events never fire more than once per registration).
+
+ The lifetime of the memory holding the event is only for the duration of the
+ callback invocation, unless otherwise noted.
+
+ Callbacks return 0 for success or error code (ToDo: does this make sense?).
+*/
+
+struct rpl_event_consumer_transaction
+{
+ virtual int trx_start(const rpl_event_transaction_start *) { return 0; }
+ virtual int trx_commit(const rpl_event_transaction_commit *) { return 0; }
+ virtual int trx_rollback(const rpl_event_transaction_rollback *) { return 0; }
+};
+
+/*
+ Consuming statement-based events.
+
+ The statement event generator is stacked on top of the transaction event
+ generator, so we can receive those events as well.
+*/
+struct rpl_event_consumer_statement : public rpl_event_consumer_transaction
+{
+ virtual int stmt_start(const rpl_event_statement_start *) { return 0; }
+ virtual int stmt_end(const rpl_event_statement_end *) { return 0; }
+
+ virtual int stmt_query(const rpl_event_statement_query *) { return 0; }
+
+ /* Data for a file used in LOAD DATA [LOCAL] INFILE. */
+ virtual int stmt_load_data_block(const rpl_event_statement_load_data_block *)
+ { return 0; }
+
+ /*
+ These are specific kinds of statements; if specified they override
+ consume_stmt_query() for the corresponding event.
+ */
+ virtual int stmt_load_query(const rpl_event_statement_load_query *ev)
+ { return stmt_query(ev); }
+};
+
+/*
+ Consuming row-based events.
+
+ The row event generator is stacked on top of the statement event generator.
+*/
+struct rpl_event_consumer_row : public rpl_event_consumer_statement
+{
+ virtual int row_write(const rpl_event_row_write *) { return 0; }
+ virtual int row_update(const rpl_event_row_update *) { return 0; }
+ virtual int row_delete(const rpl_event_row_delete *) { return 0; }
+};
+
+
+/*
+ Registration functions.
+
+ ToDo: Make a way to de-register.
+
+ ToDo: Find a good way for a plugin (eg. PBXT) to publish a generator
+ registration method.
+*/
+
+int rpl_event_transaction_register(const rpl_event_consumer_transaction *cbs);
+int rpl_event_statement_register(const rpl_event_consumer_statement *cbs);
+int rpl_event_row_register(const rpl_event_consumer_row *cbs);
-=-=(Knielsen - Thu, 24 Jun 2010, 14:28)=-=-
High-Level Specification modified.
--- /tmp/wklog.120.old.7341 2010-06-24 14:28:17.000000000 +0000
+++ /tmp/wklog.120.new.7341 2010-06-24 14:28:17.000000000 +0000
@@ -1 +1,159 @@
+Generators and consumbers
+-------------------------
+
+We have the two concepts:
+
+1. Event _generators_, that produce events describing all changes to data in a
+ server.
+
+2. Event consumers, that receive such events and use them in various ways.
+
+Examples of event generators is execution of SQL statements, which generates
+events like those used for statement-based replication. Another example is
+PBXT engine-level replication.
+
+An example of an event consumer is the writing of the binlog on a master.
+
+Event generators are not really plugins. Rather, there are specific points in
+the server where events are generated. However, a generator can be part of a
+plugin, for example a PBXT engine-level replication event generator would be
+part of the PBXT storage engine plugin.
+
+Event consumers on the other hand could be a plugin.
+
+One generator can be stacked on top of another. This means that a generator on
+top (for example row-based events) will handle some events itself
+(eg. non-deterministic update in mixed-mode binlogging). Other events that it
+does not want to or cannot handle (for example deterministic delete or DDL)
+will be defered to the generator below (for example statement-based events).
+
+
+Materialisation (or not)
+------------------------
+
+A central decision is how to represent events that are generated in the API at
+the point of generation.
+
+I want to avoid making the API require that events are materialised. By
+"Materialised" I mean that all (or most) of the data for the event is written
+into memory in a struct/class used inside the server or serialised in a data
+buffer (byte buffer) in a format suitable for network transport or disk
+storage.
+
+Using a non-materialised event means storing just a reference to appropriate
+context that allows to retrieve all information for the event using
+accessors. Ie. typically this would be based on getting the event information
+from the THD pointer.
+
+Some reasons to avoid using materialised events in the API:
+
+ - Replication events have a _lot_ of detailed context information that can be
+ needed in events: user-defined variables, random seed, character sets,
+ table column names and types, etc. etc. If we make the API based on
+ materialisation, then the initial decision about which context information
+ to include with which events will have to be done in the API, while ideally
+ we want this decision to be done by the individual consumer plugin. There
+ will this be a conflict between what to include (to allow consumers access)
+ and what to exclude (to avoid excessive needless work).
+
+ - Materialising means defining a very specific format, which will tend to
+ make the API less generic and flexible.
+
+ - Unless the materialised format is made _very_ specific (and thus very
+ inflexible), it is unlikely to be directly useful for transport
+ (eg. binlog), so it will need to be re-materialised into a different format
+ anyway, wasting work.
+
+ - If a generator on top handles an event, then we want to avoid wasting work
+ materialising an event in a generator below which would be completely
+ unused. Thus there would be a need for the upper generator to somehow
+ notify the lower generator ahead of event generation time to not fire an
+ event, complicating the API.
+
+Some advantages for materialisation:
+
+ - Using an API based on passing around some well-defined struct event (or
+ byte buffer) will be simpler than the complex class hierarchy proposed here
+ with no requirement for materialisation.
+
+ - Defining a materialised format would allow an easy way to use the same
+ consumer code on a generator that produces events at the source of
+ execution and on a generator that produces events from eg. reading them
+ from an event log.
+
+Note that there can be some middle way, where some data is materialised and
+some is kept as reference to context (eg. THD) only. This however looses most
+of the mentioned advantages for materialisation.
+
+The design proposed here aims for as little materialisation as possible.
+
+
+Default materialisation format
+------------------------------
+
+
+While the proposed API doesn't _require_ materialisation, we can still think
+about providing the _option_ for built-in materialisation. This could be
+useful if such materialisation is made suitable for transport to a different
+server (eg. no endian-dependance etc). If there is a facility for such
+materialisation built-in to the API, it becomes possible to write something
+like a generic binlog plugin or generic network transport plugin. This would
+be really useful for eg. PBXT engine-level replication, as it could be
+implemented without having to re-invent a binlog format.
+
+I added in the proposed API a simple facility to materialise every event as a
+string of bytes. To use this, I still need to add a suitable facility to
+de-materialise the event.
+
+However, it is still an open question whether such a facility will be at all
+useful. It still has some of the problems with materialisation mentioned
+above. And I think it is likely that a good binlog implementation will need
+to do more than just blindly copy opaque events from one endpoint to
+another. For example, it might need different event boundaries (merge and/or
+split events); it might need to augment or modify events, or inject new
+events, etc.
+
+So I think maybe it is better to add such a generic materialisation facility
+on top of the basic event generator API. Such a facility would provide
+materialisation of an replication event stream, not of individual events, so
+would be more flexible in providing a good implementation. It would be
+implemented for all generators. It would separate from both the event
+generator API (so we have flexibility to put a filter class in-between
+generator and materialisation), and could also be separate from the actual
+transport handling stuff like fsync() of binlog files and socket connections
+etc. It would be paired with a corresponding applier API which would handle
+executing events on a slave.
+
+Then we can have a default materialised event format, which is available, but
+not mandatory. So there can still be other formats alongside (like legacy
+MySQL 5.1 binlog event format and maybe Tungsten would have its own format).
+
+
+Encapsulation
+-------------
+
+Another fundamental question about the design is the level of encapsulation
+used for the API.
+
+At the implementation level, a lot of the work is basically to pull out all of
+the needed information from the THD object/context. The API I propose tries to
+_not_ expose the THD to consumers. Instead it provides accessor functions for
+all the bits and pieces relevant to each replication event, while the event
+class itself likely will be more or less just an encapsulated THD.
+
+So an alternative would be to have a generic event that was just (type, THD).
+Then consumers could just pull out whatever information they want from the
+THD. The THD implementation is already exposed to storage engines. This would
+of course greatly reduce the size of the API, eliminating lots of class
+definitions and accessor functions. Though arguably it wouldn't really
+simplify the API, as the complexity would just be in understanding the THD
+class.
+
+Note that we do not have to take any performance hit from using encapsulated
+accessors since compilers can inline them (though if inlining then we do not
+get any ABI stability with respect to THD implemetation).
+
+For now, the API is proposed without exposing the THD class. (Similar
+encapsulation could be added in actual implementation to also not expose TABLE
+and similar classes).
-=-=(Knielsen - Thu, 24 Jun 2010, 12:04)=-=-
Dependency created: 107 now depends on 120
-=-=(Knielsen - Thu, 24 Jun 2010, 11:59)=-=-
High Level Description modified.
--- /tmp/wklog.120.old.516 2010-06-24 11:59:24.000000000 +0000
+++ /tmp/wklog.120.new.516 2010-06-24 11:59:24.000000000 +0000
@@ -11,4 +11,4 @@
Event generators can be stacked, and a generator may defer event generation to
the next one down the stack. For example, the row-level replication event
-generator may defer DLL to the statement-level replication event generator.
+generator may defer DDL to the statement-level replication event generator.
-=-=(Knielsen - Mon, 21 Jun 2010, 08:35)=-=-
Research and design thoughts.
Worked 2 hours and estimate 0 hours remain (original estimate increased by 2 hours).
DESCRIPTION:
A part of the replication project, MWL#107.
Events are produced by event Generators. Examples are
- Generation of statement-based replication events
- Generation of row-based events
- Generation of PBXT engine-level replication events
and maybe reading of events from relay log on slave may also be an example of
generating events.
Event generators can be stacked, and a generator may defer event generation to
the next one down the stack. For example, the row-level replication event
generator may defer DDL to the statement-level replication event generator.
HIGH-LEVEL SPECIFICATION:
Generators and consumbers
-------------------------
We have the two concepts:
1. Event _generators_, that produce events describing all changes to data in a
server.
2. Event consumers, that receive such events and use them in various ways.
Examples of event generators is execution of SQL statements, which generates
events like those used for statement-based replication. Another example is
PBXT engine-level replication.
An example of an event consumer is the writing of the binlog on a master.
Event generators are not really plugins. Rather, there are specific points in
the server where events are generated. However, a generator can be part of a
plugin, for example a PBXT engine-level replication event generator would be
part of the PBXT storage engine plugin.
Event consumers on the other hand could be a plugin.
One generator can be stacked on top of another. This means that a generator on
top (for example row-based events) will handle some events itself
(eg. non-deterministic update in mixed-mode binlogging). Other events that it
does not want to or cannot handle (for example deterministic delete or DDL)
will be defered to the generator below (for example statement-based events).
Materialisation (or not)
------------------------
A central decision is how to represent events that are generated in the API at
the point of generation.
I want to avoid making the API require that events are materialised. By
"Materialised" I mean that all (or most) of the data for the event is written
into memory in a struct/class used inside the server or serialised in a data
buffer (byte buffer) in a format suitable for network transport or disk
storage.
Using a non-materialised event means storing just a reference to appropriate
context that allows to retrieve all information for the event using
accessors. Ie. typically this would be based on getting the event information
from the THD pointer.
Some reasons to avoid using materialised events in the API:
- Replication events have a _lot_ of detailed context information that can be
needed in events: user-defined variables, random seed, character sets,
table column names and types, etc. etc. If we make the API based on
materialisation, then the initial decision about which context information
to include with which events will have to be done in the API, while ideally
we want this decision to be done by the individual consumer plugin. There
will this be a conflict between what to include (to allow consumers access)
and what to exclude (to avoid excessive needless work).
- Materialising means defining a very specific format, which will tend to
make the API less generic and flexible.
- Unless the materialised format is made _very_ specific (and thus very
inflexible), it is unlikely to be directly useful for transport
(eg. binlog), so it will need to be re-materialised into a different format
anyway, wasting work.
- If a generator on top handles an event, then we want to avoid wasting work
materialising an event in a generator below which would be completely
unused. Thus there would be a need for the upper generator to somehow
notify the lower generator ahead of event generation time to not fire an
event, complicating the API.
Some advantages for materialisation:
- Using an API based on passing around some well-defined struct event (or
byte buffer) will be simpler than the complex class hierarchy proposed here
with no requirement for materialisation.
- Defining a materialised format would allow an easy way to use the same
consumer code on a generator that produces events at the source of
execution and on a generator that produces events from eg. reading them
from an event log.
Note that there can be some middle way, where some data is materialised and
some is kept as reference to context (eg. THD) only. This however looses most
of the mentioned advantages for materialisation.
The design proposed here aims for as little materialisation as possible.
Default materialisation format
------------------------------
While the proposed API doesn't _require_ materialisation, we can still think
about providing the _option_ for built-in materialisation. This could be
useful if such materialisation is made suitable for transport to a different
server (eg. no endian-dependance etc). If there is a facility for such
materialisation built-in to the API, it becomes possible to write something
like a generic binlog plugin or generic network transport plugin. This would
be really useful for eg. PBXT engine-level replication, as it could be
implemented without having to re-invent a binlog format.
I added in the proposed API a simple facility to materialise every event as a
string of bytes. To use this, I still need to add a suitable facility to
de-materialise the event.
However, it is still an open question whether such a facility will be at all
useful. It still has some of the problems with materialisation mentioned
above. And I think it is likely that a good binlog implementation will need
to do more than just blindly copy opaque events from one endpoint to
another. For example, it might need different event boundaries (merge and/or
split events); it might need to augment or modify events, or inject new
events, etc.
So I think maybe it is better to add such a generic materialisation facility
on top of the basic event generator API. Such a facility would provide
materialisation of an replication event stream, not of individual events, so
would be more flexible in providing a good implementation. It would be
implemented for all generators. It would separate from both the event
generator API (so we have flexibility to put a filter class in-between
generator and materialisation), and could also be separate from the actual
transport handling stuff like fsync() of binlog files and socket connections
etc. It would be paired with a corresponding applier API which would handle
executing events on a slave.
Then we can have a default materialised event format, which is available, but
not mandatory. So there can still be other formats alongside (like legacy
MySQL 5.1 binlog event format and maybe Tungsten would have its own format).
Encapsulation
-------------
Another fundamental question about the design is the level of encapsulation
used for the API.
At the implementation level, a lot of the work is basically to pull out all of
the needed information from the THD object/context. The API I propose tries to
_not_ expose the THD to consumers. Instead it provides accessor functions for
all the bits and pieces relevant to each replication event, while the event
class itself likely will be more or less just an encapsulated THD.
So an alternative would be to have a generic event that was just (type, THD).
Then consumers could just pull out whatever information they want from the
THD. The THD implementation is already exposed to storage engines. This would
of course greatly reduce the size of the API, eliminating lots of class
definitions and accessor functions. Though arguably it wouldn't really
simplify the API, as the complexity would just be in understanding the THD
class.
Note that we do not have to take any performance hit from using encapsulated
accessors since compilers can inline them (though if inlining then we do not
get any ABI stability with respect to THD implemetation).
For now, the API is proposed without exposing the THD class. (Similar
encapsulation could be added in actual implementation to also not expose TABLE
and similar classes).
LOW-LEVEL DESIGN:
A consumer is implented as a virtual class (interface). There is one virtual
function for every event that can be received. A consumer would derive from
the base class and override methods for the events it wants to receive.
There is one consumer interface for each generator. When a generator A is
stacked on B, the consumer interface for A inherits from the interface for
B. This way, when A defers an event to B, the consumer for A will receive the
corresponding event from B.
There are methods for a consumer to register itself to receive events from
each generator. I still need to find a way for a consumer in one plugin to
register itself with a generator implemented in another plugin (eg. PBXT
engine-level replication). I also need to add a way for consumers to
de-register themselves.
The current design has consumer callbacks return 0 for success and error code
otherwise. I still need to think more about whether this is useful (ie. what
is the semantics of returning an error from a consumer callback).
Each event passed to consumers is defined as a class with public accessor
methods to a private context (which is mostly the THD).
My intension is to make all events passed around const, so that the same event
can be passed to each of multiple registered consumers (and to emphasise that
consumers do not have the ability to modify events). It still needs to be seen
whether that const-ness will be feasible in practise without very heavy
modification/constification of exiting code.
What follows is a partial draft of a possible definition of the API as
concrete C++ class definitions.
-----------------------------------------------------------------------
/*
Virtual base class for generated replication events.
This is the parent of events generated from all kinds of generators. Only
child classes can be instantiated.
This class can be used by code that wants to treat events in a generic way,
without any knowledge of event details. I still need to decide whether such
generic code is sensible.
*/
class rpl_event_base
{
/*
Maybe we will want the ability to materialise an event to a standard
binary format. This could be achieved with a base method like this. The
actual materialisation would be implemented in each deriving class. The
public methods would provide different interfaces for specifying the
buffer or for writing directly into IO_CACHE or file.
*/
/* Return 0 on success, -1 on error, -2 on out-of-buffer. */
int materialise(uchar *buffer, size_t buflen) const;
/*
Returns NULL on error or else malloc()ed buffer with materialised event,
caller must free().
*/
uchar *materialise() const;
/* Same but using passed in memroot. */
uchar *materialise(mem_root *memroot) const;
/*
Materialise to user-supplied writer function (could write directly to file
or the like).
*/
int materialise(int (*writer)(uchar *data, size_t len, void *context)) const;
/*
As to for what to do with a materialised event, there are a couple of
possibilities.
One is to have a de_materialise() method somewhere that can construct an
rpl_event_base (really a derived class of course) from a buffer or writer
function. This would require each accessor function to conditionally read
its data from either THD context or buffer (GCC is able to optimise
several such conditionals in multiple accessor function calls into one
conditional), or we can make all accessors virtual if the performance hit
is acceptable.
Another is to have different classes for accessing events read from
materialised event data.
Also, I still need to think about whether it is at all useful to be able
to generically materialise an event at this level. It may be that any
binlog/transport will in any case need to undertand more of the format of
events, so that such materialisation/transport is better done at a
different layer.
*/
protected:
/* Implementation which is the basis for materialise(). */
virtual int do_materialise(int (*writer)(uchar *data, size_t len,
void *context)) const = 0;
private:
/* Virtual base class, private constructor to prevent instantiation. */
rpl_event_base();
};
/*
These are the event types output from the transaction event generator.
This generator is not stacked on anything.
The transaction event generator marks the start and end (commit or rollback)
of transactions. It also gives information about whether the transaction was
a full transaction or autocommitted statement, whether transactional tables
were involved, whether non-transactional tables were involved, and XA
information (ToDo).
*/
/* Base class for transaction events. */
class rpl_event_transaction_base : public rpl_event_base
{
public:
/*
Get the local transaction id. This idea is only unique within one server.
It is allocated whenever a new transaction is started.
Can be used to identify events belonging to the same transaction in a
binlog-like stream of events streamed in parallel among multiple
transactions.
*/
uint64_t get_local_trx_id() const { return thd->local_trx_id; };
bool get_is_autocommit() const;
private:
/* The context is the THD. */
THD *thd;
rpl_event_transaction_base(THD *_thd) : thd(_thd) { };
};
/* Transaction start event. */
class rpl_event_transaction_start : public rpl_event_transaction_base
{
};
/* Transaction commit. */
class rpl_event_transaction_commit : public rpl_event_transaction_base
{
public:
/*
The global transaction id is unique cross-server.
It can be used to identify the position from which to start a slave
replicating from a master.
This global ID is only available once the transaction is decided to commit
by the TC manager / primary redundancy service. This TC also allocates the
ID and decides the exact semantics (can there be gaps, etc); however the
format is fixed (cluster_id, running_counter).
*/
struct global_transaction_id
{
uint32_t cluster_id;
uint64_t counter;
};
const global_transaction_id *get_global_transaction_id() const;
};
/* Transaction rollback. */
class rpl_event_transaction_rollback : public rpl_event_transaction_base
{
};
/* Base class for statement events. */
class rpl_event_statement_base : public rpl_event_base
{
public:
LEX_STRING get_current_db() const;
};
class rpl_event_statement_start : public rpl_event_statement_base
{
};
class rpl_event_statement_end : public rpl_event_statement_base
{
public:
int get_errorcode() const;
};
class rpl_event_statement_query : public rpl_event_statement_base
{
public:
LEX_STRING get_query_string();
ulong get_sql_mode();
const CHARSET_INFO *get_character_set_client();
const CHARSET_INFO *get_collation_connection();
const CHARSET_INFO *get_collation_server();
const CHARSET_INFO *get_collation_default_db();
/*
Access to relevant flags that affect query execution.
Use as if (ev->get_flags() & (uint32)ROW_FOREIGN_KEY_CHECKS) { ... }
*/
enum flag_bits
{
STMT_FOREIGN_KEY_CHECKS, // @@foreign_key_checks
STMT_UNIQUE_KEY_CHECKS, // @@unique_checks
STMT_AUTO_IS_NULL, // @@sql_auto_is_null
};
uint32_t get_flags();
ulong get_auto_increment_offset();
ulong get_auto_increment_increment();
// And so on for: time zone; day/month names; connection id; LAST_INSERT_ID;
// INSERT_ID; random seed; user variables.
//
// We probably also need get_uses_temporary_table(), get_used_user_vars(),
// get_uses_auto_increment() and so on, so a consumer can get more
// information about what kind of context information a query will need when
// executed on a slave.
};
class rpl_event_statement_load_query : public rpl_event_statement_query
{
};
/*
This event is fired with blocks of data for files read (from server-local
file or client connection) for LOAD DATA.
*/
class rpl_event_statement_load_data_block : public rpl_event_statement_base
{
public:
struct block
{
const uchar *ptr;
size_t size;
};
block get_block() const;
};
/* Base class for row-based replication events. */
class rpl_event_row_base : public rpl_event_base
{
public:
/*
Access to relevant handler extra flags and other flags that affect row
operations.
Use as if (ev->get_flags() & (uint32)ROW_WRITE_CAN_REPLACE) { ... }
*/
enum flag_bits
{
ROW_WRITE_CAN_REPLACE, // HA_EXTRA_WRITE_CAN_REPLACE
ROW_IGNORE_DUP_KEY, // HA_EXTRA_IGNORE_DUP_KEY
ROW_IGNORE_NO_KEY, // HA_EXTRA_IGNORE_NO_KEY
ROW_DISABLE_FOREIGN_KEY_CHECKS, // ! @@foreign_key_checks
ROW_DISABLE_UNIQUE_KEY_CHECKS, // ! @@unique_checks
};
uint32_t get_flags();
/* Access to list of tables modified. */
class table_iterator
{
public:
/* Returns table, NULL after last. */
const TABLE *get_next();
private:
// ...
};
table_iterator get_modified_tables() const;
private:
/* Context used to provide accessors. */
THD *thd;
protected:
rpl_event_row_base(THD *_thd) : thd(_thd) { }
};
class rpl_event_row_write : public rpl_event_row_base
{
public:
const BITMAP *get_write_set() const;
const uchar *get_after_image() const;
};
class rpl_event_row_update : public rpl_event_row_base
{
public:
const BITMAP *get_read_set() const;
const BITMAP *get_write_set() const;
const uchar *get_before_image() const;
const uchar *get_after_image() const;
};
class rpl_event_row_delete : public rpl_event_row_base
{
public:
const BITMAP *get_read_set() const;
const uchar *get_before_image() const;
};
/*
Event consumer callbacks.
An event consumer registers with an event generator to receive event
notifications from that generator.
The consumer has callbacks (in the form of virtual functions) for the
individual event types the consumer is interested in. Only callbacks that
are non-NULL will be invoked. If an event applies to multiple callbacks in a
single callback struct, it will only be passed to the most specific non-NULL
callback (so events never fire more than once per registration).
The lifetime of the memory holding the event is only for the duration of the
callback invocation, unless otherwise noted.
Callbacks return 0 for success or error code (ToDo: does this make sense?).
*/
struct rpl_event_consumer_transaction
{
virtual int trx_start(const rpl_event_transaction_start *) { return 0; }
virtual int trx_commit(const rpl_event_transaction_commit *) { return 0; }
virtual int trx_rollback(const rpl_event_transaction_rollback *) { return 0; }
};
/*
Consuming statement-based events.
The statement event generator is stacked on top of the transaction event
generator, so we can receive those events as well.
*/
struct rpl_event_consumer_statement : public rpl_event_consumer_transaction
{
virtual int stmt_start(const rpl_event_statement_start *) { return 0; }
virtual int stmt_end(const rpl_event_statement_end *) { return 0; }
virtual int stmt_query(const rpl_event_statement_query *) { return 0; }
/* Data for a file used in LOAD DATA [LOCAL] INFILE. */
virtual int stmt_load_data_block(const rpl_event_statement_load_data_block *)
{ return 0; }
/*
These are specific kinds of statements; if specified they override
consume_stmt_query() for the corresponding event.
*/
virtual int stmt_load_query(const rpl_event_statement_load_query *ev)
{ return stmt_query(ev); }
};
/*
Consuming row-based events.
The row event generator is stacked on top of the statement event generator.
*/
struct rpl_event_consumer_row : public rpl_event_consumer_statement
{
virtual int row_write(const rpl_event_row_write *) { return 0; }
virtual int row_update(const rpl_event_row_update *) { return 0; }
virtual int row_delete(const rpl_event_row_delete *) { return 0; }
};
/*
Registration functions.
ToDo: Make a way to de-register.
ToDo: Find a good way for a plugin (eg. PBXT) to publish a generator
registration method.
*/
int rpl_event_transaction_register(const rpl_event_consumer_transaction *cbs);
int rpl_event_statement_register(const rpl_event_consumer_statement *cbs);
int rpl_event_row_register(const rpl_event_consumer_row *cbs);
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Knielsen): Replication API for stacked event generators (120)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Replication API for stacked event generators
CREATION DATE..: Mon, 07 Jun 2010, 13:13
SUPERVISOR.....: Knielsen
IMPLEMENTOR....:
COPIES TO......:
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 120 (http://askmonty.org/worklog/?tid=120)
VERSION........: Server-9.x
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 2
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Knielsen - Thu, 24 Jun 2010, 14:29)=-=-
Low Level Design modified.
--- /tmp/wklog.120.old.7355 2010-06-24 14:29:11.000000000 +0000
+++ /tmp/wklog.120.new.7355 2010-06-24 14:29:11.000000000 +0000
@@ -1 +1,385 @@
+A consumer is implented as a virtual class (interface). There is one virtual
+function for every event that can be received. A consumer would derive from
+the base class and override methods for the events it wants to receive.
+
+There is one consumer interface for each generator. When a generator A is
+stacked on B, the consumer interface for A inherits from the interface for
+B. This way, when A defers an event to B, the consumer for A will receive the
+corresponding event from B.
+
+There are methods for a consumer to register itself to receive events from
+each generator. I still need to find a way for a consumer in one plugin to
+register itself with a generator implemented in another plugin (eg. PBXT
+engine-level replication). I also need to add a way for consumers to
+de-register themselves.
+
+The current design has consumer callbacks return 0 for success and error code
+otherwise. I still need to think more about whether this is useful (ie. what
+is the semantics of returning an error from a consumer callback).
+
+Each event passed to consumers is defined as a class with public accessor
+methods to a private context (which is mostly the THD).
+
+My intension is to make all events passed around const, so that the same event
+can be passed to each of multiple registered consumers (and to emphasise that
+consumers do not have the ability to modify events). It still needs to be seen
+whether that const-ness will be feasible in practise without very heavy
+modification/constification of exiting code.
+
+What follows is a partial draft of a possible definition of the API as
+concrete C++ class definitions.
+
+-----------------------------------------------------------------------
+
+/*
+ Virtual base class for generated replication events.
+
+ This is the parent of events generated from all kinds of generators. Only
+ child classes can be instantiated.
+
+ This class can be used by code that wants to treat events in a generic way,
+ without any knowledge of event details. I still need to decide whether such
+ generic code is sensible.
+*/
+class rpl_event_base
+{
+ /*
+ Maybe we will want the ability to materialise an event to a standard
+ binary format. This could be achieved with a base method like this. The
+ actual materialisation would be implemented in each deriving class. The
+ public methods would provide different interfaces for specifying the
+ buffer or for writing directly into IO_CACHE or file.
+ */
+
+ /* Return 0 on success, -1 on error, -2 on out-of-buffer. */
+ int materialise(uchar *buffer, size_t buflen) const;
+ /*
+ Returns NULL on error or else malloc()ed buffer with materialised event,
+ caller must free().
+ */
+ uchar *materialise() const;
+ /* Same but using passed in memroot. */
+ uchar *materialise(mem_root *memroot) const;
+ /*
+ Materialise to user-supplied writer function (could write directly to file
+ or the like).
+ */
+ int materialise(int (*writer)(uchar *data, size_t len, void *context)) const;
+
+ /*
+ As to for what to do with a materialised event, there are a couple of
+ possibilities.
+
+ One is to have a de_materialise() method somewhere that can construct an
+ rpl_event_base (really a derived class of course) from a buffer or writer
+ function. This would require each accessor function to conditionally read
+ its data from either THD context or buffer (GCC is able to optimise
+ several such conditionals in multiple accessor function calls into one
+ conditional), or we can make all accessors virtual if the performance hit
+ is acceptable.
+
+ Another is to have different classes for accessing events read from
+ materialised event data.
+
+ Also, I still need to think about whether it is at all useful to be able
+ to generically materialise an event at this level. It may be that any
+ binlog/transport will in any case need to undertand more of the format of
+ events, so that such materialisation/transport is better done at a
+ different layer.
+ */
+
+protected:
+ /* Implementation which is the basis for materialise(). */
+ virtual int do_materialise(int (*writer)(uchar *data, size_t len,
+ void *context)) const = 0;
+
+private:
+ /* Virtual base class, private constructor to prevent instantiation. */
+ rpl_event_base();
+};
+
+
+/*
+ These are the event types output from the transaction event generator.
+
+ This generator is not stacked on anything.
+
+ The transaction event generator marks the start and end (commit or rollback)
+ of transactions. It also gives information about whether the transaction was
+ a full transaction or autocommitted statement, whether transactional tables
+ were involved, whether non-transactional tables were involved, and XA
+ information (ToDo).
+*/
+
+/* Base class for transaction events. */
+class rpl_event_transaction_base : public rpl_event_base
+{
+public:
+ /*
+ Get the local transaction id. This idea is only unique within one server.
+ It is allocated whenever a new transaction is started.
+ Can be used to identify events belonging to the same transaction in a
+ binlog-like stream of events streamed in parallel among multiple
+ transactions.
+ */
+ uint64_t get_local_trx_id() const { return thd->local_trx_id; };
+
+ bool get_is_autocommit() const;
+
+private:
+ /* The context is the THD. */
+ THD *thd;
+
+ rpl_event_transaction_base(THD *_thd) : thd(_thd) { };
+};
+
+/* Transaction start event. */
+class rpl_event_transaction_start : public rpl_event_transaction_base
+{
+
+};
+
+/* Transaction commit. */
+class rpl_event_transaction_commit : public rpl_event_transaction_base
+{
+public:
+ /*
+ The global transaction id is unique cross-server.
+
+ It can be used to identify the position from which to start a slave
+ replicating from a master.
+
+ This global ID is only available once the transaction is decided to commit
+ by the TC manager / primary redundancy service. This TC also allocates the
+ ID and decides the exact semantics (can there be gaps, etc); however the
+ format is fixed (cluster_id, running_counter).
+ */
+ struct global_transaction_id
+ {
+ uint32_t cluster_id;
+ uint64_t counter;
+ };
+
+ const global_transaction_id *get_global_transaction_id() const;
+};
+
+/* Transaction rollback. */
+class rpl_event_transaction_rollback : public rpl_event_transaction_base
+{
+
+};
+
+
+/* Base class for statement events. */
+class rpl_event_statement_base : public rpl_event_base
+{
+public:
+ LEX_STRING get_current_db() const;
+};
+
+class rpl_event_statement_start : public rpl_event_statement_base
+{
+
+};
+
+class rpl_event_statement_end : public rpl_event_statement_base
+{
+public:
+ int get_errorcode() const;
+};
+
+class rpl_event_statement_query : public rpl_event_statement_base
+{
+public:
+ LEX_STRING get_query_string();
+ ulong get_sql_mode();
+ const CHARSET_INFO *get_character_set_client();
+ const CHARSET_INFO *get_collation_connection();
+ const CHARSET_INFO *get_collation_server();
+ const CHARSET_INFO *get_collation_default_db();
+
+ /*
+ Access to relevant flags that affect query execution.
+
+ Use as if (ev->get_flags() & (uint32)ROW_FOREIGN_KEY_CHECKS) { ... }
+ */
+ enum flag_bits
+ {
+ STMT_FOREIGN_KEY_CHECKS, // @@foreign_key_checks
+ STMT_UNIQUE_KEY_CHECKS, // @@unique_checks
+ STMT_AUTO_IS_NULL, // @@sql_auto_is_null
+ };
+ uint32_t get_flags();
+
+ ulong get_auto_increment_offset();
+ ulong get_auto_increment_increment();
+
+ // And so on for: time zone; day/month names; connection id; LAST_INSERT_ID;
+ // INSERT_ID; random seed; user variables.
+ //
+ // We probably also need get_uses_temporary_table(), get_used_user_vars(),
+ // get_uses_auto_increment() and so on, so a consumer can get more
+ // information about what kind of context information a query will need when
+ // executed on a slave.
+};
+
+class rpl_event_statement_load_query : public rpl_event_statement_query
+{
+
+};
+
+/*
+ This event is fired with blocks of data for files read (from server-local
+ file or client connection) for LOAD DATA.
+*/
+class rpl_event_statement_load_data_block : public rpl_event_statement_base
+{
+public:
+ struct block
+ {
+ const uchar *ptr;
+ size_t size;
+ };
+ block get_block() const;
+};
+
+/* Base class for row-based replication events. */
+class rpl_event_row_base : public rpl_event_base
+{
+public:
+ /*
+ Access to relevant handler extra flags and other flags that affect row
+ operations.
+
+ Use as if (ev->get_flags() & (uint32)ROW_WRITE_CAN_REPLACE) { ... }
+ */
+ enum flag_bits
+ {
+ ROW_WRITE_CAN_REPLACE, // HA_EXTRA_WRITE_CAN_REPLACE
+ ROW_IGNORE_DUP_KEY, // HA_EXTRA_IGNORE_DUP_KEY
+ ROW_IGNORE_NO_KEY, // HA_EXTRA_IGNORE_NO_KEY
+ ROW_DISABLE_FOREIGN_KEY_CHECKS, // ! @@foreign_key_checks
+ ROW_DISABLE_UNIQUE_KEY_CHECKS, // ! @@unique_checks
+ };
+ uint32_t get_flags();
+
+ /* Access to list of tables modified. */
+ class table_iterator
+ {
+ public:
+ /* Returns table, NULL after last. */
+ const TABLE *get_next();
+ private:
+ // ...
+ };
+ table_iterator get_modified_tables() const;
+
+private:
+ /* Context used to provide accessors. */
+ THD *thd;
+
+protected:
+ rpl_event_row_base(THD *_thd) : thd(_thd) { }
+};
+
+
+class rpl_event_row_write : public rpl_event_row_base
+{
+public:
+ const BITMAP *get_write_set() const;
+ const uchar *get_after_image() const;
+};
+
+class rpl_event_row_update : public rpl_event_row_base
+{
+public:
+ const BITMAP *get_read_set() const;
+ const BITMAP *get_write_set() const;
+ const uchar *get_before_image() const;
+ const uchar *get_after_image() const;
+};
+
+class rpl_event_row_delete : public rpl_event_row_base
+{
+public:
+ const BITMAP *get_read_set() const;
+ const uchar *get_before_image() const;
+};
+
+
+/*
+ Event consumer callbacks.
+
+ An event consumer registers with an event generator to receive event
+ notifications from that generator.
+
+ The consumer has callbacks (in the form of virtual functions) for the
+ individual event types the consumer is interested in. Only callbacks that
+ are non-NULL will be invoked. If an event applies to multiple callbacks in a
+ single callback struct, it will only be passed to the most specific non-NULL
+ callback (so events never fire more than once per registration).
+
+ The lifetime of the memory holding the event is only for the duration of the
+ callback invocation, unless otherwise noted.
+
+ Callbacks return 0 for success or error code (ToDo: does this make sense?).
+*/
+
+struct rpl_event_consumer_transaction
+{
+ virtual int trx_start(const rpl_event_transaction_start *) { return 0; }
+ virtual int trx_commit(const rpl_event_transaction_commit *) { return 0; }
+ virtual int trx_rollback(const rpl_event_transaction_rollback *) { return 0; }
+};
+
+/*
+ Consuming statement-based events.
+
+ The statement event generator is stacked on top of the transaction event
+ generator, so we can receive those events as well.
+*/
+struct rpl_event_consumer_statement : public rpl_event_consumer_transaction
+{
+ virtual int stmt_start(const rpl_event_statement_start *) { return 0; }
+ virtual int stmt_end(const rpl_event_statement_end *) { return 0; }
+
+ virtual int stmt_query(const rpl_event_statement_query *) { return 0; }
+
+ /* Data for a file used in LOAD DATA [LOCAL] INFILE. */
+ virtual int stmt_load_data_block(const rpl_event_statement_load_data_block *)
+ { return 0; }
+
+ /*
+ These are specific kinds of statements; if specified they override
+ consume_stmt_query() for the corresponding event.
+ */
+ virtual int stmt_load_query(const rpl_event_statement_load_query *ev)
+ { return stmt_query(ev); }
+};
+
+/*
+ Consuming row-based events.
+
+ The row event generator is stacked on top of the statement event generator.
+*/
+struct rpl_event_consumer_row : public rpl_event_consumer_statement
+{
+ virtual int row_write(const rpl_event_row_write *) { return 0; }
+ virtual int row_update(const rpl_event_row_update *) { return 0; }
+ virtual int row_delete(const rpl_event_row_delete *) { return 0; }
+};
+
+
+/*
+ Registration functions.
+
+ ToDo: Make a way to de-register.
+
+ ToDo: Find a good way for a plugin (eg. PBXT) to publish a generator
+ registration method.
+*/
+
+int rpl_event_transaction_register(const rpl_event_consumer_transaction *cbs);
+int rpl_event_statement_register(const rpl_event_consumer_statement *cbs);
+int rpl_event_row_register(const rpl_event_consumer_row *cbs);
-=-=(Knielsen - Thu, 24 Jun 2010, 14:28)=-=-
High-Level Specification modified.
--- /tmp/wklog.120.old.7341 2010-06-24 14:28:17.000000000 +0000
+++ /tmp/wklog.120.new.7341 2010-06-24 14:28:17.000000000 +0000
@@ -1 +1,159 @@
+Generators and consumbers
+-------------------------
+
+We have the two concepts:
+
+1. Event _generators_, that produce events describing all changes to data in a
+ server.
+
+2. Event consumers, that receive such events and use them in various ways.
+
+Examples of event generators is execution of SQL statements, which generates
+events like those used for statement-based replication. Another example is
+PBXT engine-level replication.
+
+An example of an event consumer is the writing of the binlog on a master.
+
+Event generators are not really plugins. Rather, there are specific points in
+the server where events are generated. However, a generator can be part of a
+plugin, for example a PBXT engine-level replication event generator would be
+part of the PBXT storage engine plugin.
+
+Event consumers on the other hand could be a plugin.
+
+One generator can be stacked on top of another. This means that a generator on
+top (for example row-based events) will handle some events itself
+(eg. non-deterministic update in mixed-mode binlogging). Other events that it
+does not want to or cannot handle (for example deterministic delete or DDL)
+will be defered to the generator below (for example statement-based events).
+
+
+Materialisation (or not)
+------------------------
+
+A central decision is how to represent events that are generated in the API at
+the point of generation.
+
+I want to avoid making the API require that events are materialised. By
+"Materialised" I mean that all (or most) of the data for the event is written
+into memory in a struct/class used inside the server or serialised in a data
+buffer (byte buffer) in a format suitable for network transport or disk
+storage.
+
+Using a non-materialised event means storing just a reference to appropriate
+context that allows to retrieve all information for the event using
+accessors. Ie. typically this would be based on getting the event information
+from the THD pointer.
+
+Some reasons to avoid using materialised events in the API:
+
+ - Replication events have a _lot_ of detailed context information that can be
+ needed in events: user-defined variables, random seed, character sets,
+ table column names and types, etc. etc. If we make the API based on
+ materialisation, then the initial decision about which context information
+ to include with which events will have to be done in the API, while ideally
+ we want this decision to be done by the individual consumer plugin. There
+ will this be a conflict between what to include (to allow consumers access)
+ and what to exclude (to avoid excessive needless work).
+
+ - Materialising means defining a very specific format, which will tend to
+ make the API less generic and flexible.
+
+ - Unless the materialised format is made _very_ specific (and thus very
+ inflexible), it is unlikely to be directly useful for transport
+ (eg. binlog), so it will need to be re-materialised into a different format
+ anyway, wasting work.
+
+ - If a generator on top handles an event, then we want to avoid wasting work
+ materialising an event in a generator below which would be completely
+ unused. Thus there would be a need for the upper generator to somehow
+ notify the lower generator ahead of event generation time to not fire an
+ event, complicating the API.
+
+Some advantages for materialisation:
+
+ - Using an API based on passing around some well-defined struct event (or
+ byte buffer) will be simpler than the complex class hierarchy proposed here
+ with no requirement for materialisation.
+
+ - Defining a materialised format would allow an easy way to use the same
+ consumer code on a generator that produces events at the source of
+ execution and on a generator that produces events from eg. reading them
+ from an event log.
+
+Note that there can be some middle way, where some data is materialised and
+some is kept as reference to context (eg. THD) only. This however looses most
+of the mentioned advantages for materialisation.
+
+The design proposed here aims for as little materialisation as possible.
+
+
+Default materialisation format
+------------------------------
+
+
+While the proposed API doesn't _require_ materialisation, we can still think
+about providing the _option_ for built-in materialisation. This could be
+useful if such materialisation is made suitable for transport to a different
+server (eg. no endian-dependance etc). If there is a facility for such
+materialisation built-in to the API, it becomes possible to write something
+like a generic binlog plugin or generic network transport plugin. This would
+be really useful for eg. PBXT engine-level replication, as it could be
+implemented without having to re-invent a binlog format.
+
+I added in the proposed API a simple facility to materialise every event as a
+string of bytes. To use this, I still need to add a suitable facility to
+de-materialise the event.
+
+However, it is still an open question whether such a facility will be at all
+useful. It still has some of the problems with materialisation mentioned
+above. And I think it is likely that a good binlog implementation will need
+to do more than just blindly copy opaque events from one endpoint to
+another. For example, it might need different event boundaries (merge and/or
+split events); it might need to augment or modify events, or inject new
+events, etc.
+
+So I think maybe it is better to add such a generic materialisation facility
+on top of the basic event generator API. Such a facility would provide
+materialisation of an replication event stream, not of individual events, so
+would be more flexible in providing a good implementation. It would be
+implemented for all generators. It would separate from both the event
+generator API (so we have flexibility to put a filter class in-between
+generator and materialisation), and could also be separate from the actual
+transport handling stuff like fsync() of binlog files and socket connections
+etc. It would be paired with a corresponding applier API which would handle
+executing events on a slave.
+
+Then we can have a default materialised event format, which is available, but
+not mandatory. So there can still be other formats alongside (like legacy
+MySQL 5.1 binlog event format and maybe Tungsten would have its own format).
+
+
+Encapsulation
+-------------
+
+Another fundamental question about the design is the level of encapsulation
+used for the API.
+
+At the implementation level, a lot of the work is basically to pull out all of
+the needed information from the THD object/context. The API I propose tries to
+_not_ expose the THD to consumers. Instead it provides accessor functions for
+all the bits and pieces relevant to each replication event, while the event
+class itself likely will be more or less just an encapsulated THD.
+
+So an alternative would be to have a generic event that was just (type, THD).
+Then consumers could just pull out whatever information they want from the
+THD. The THD implementation is already exposed to storage engines. This would
+of course greatly reduce the size of the API, eliminating lots of class
+definitions and accessor functions. Though arguably it wouldn't really
+simplify the API, as the complexity would just be in understanding the THD
+class.
+
+Note that we do not have to take any performance hit from using encapsulated
+accessors since compilers can inline them (though if inlining then we do not
+get any ABI stability with respect to THD implemetation).
+
+For now, the API is proposed without exposing the THD class. (Similar
+encapsulation could be added in actual implementation to also not expose TABLE
+and similar classes).
-=-=(Knielsen - Thu, 24 Jun 2010, 12:04)=-=-
Dependency created: 107 now depends on 120
-=-=(Knielsen - Thu, 24 Jun 2010, 11:59)=-=-
High Level Description modified.
--- /tmp/wklog.120.old.516 2010-06-24 11:59:24.000000000 +0000
+++ /tmp/wklog.120.new.516 2010-06-24 11:59:24.000000000 +0000
@@ -11,4 +11,4 @@
Event generators can be stacked, and a generator may defer event generation to
the next one down the stack. For example, the row-level replication event
-generator may defer DLL to the statement-level replication event generator.
+generator may defer DDL to the statement-level replication event generator.
-=-=(Knielsen - Mon, 21 Jun 2010, 08:35)=-=-
Research and design thoughts.
Worked 2 hours and estimate 0 hours remain (original estimate increased by 2 hours).
DESCRIPTION:
A part of the replication project, MWL#107.
Events are produced by event Generators. Examples are
- Generation of statement-based replication events
- Generation of row-based events
- Generation of PBXT engine-level replication events
and maybe reading of events from relay log on slave may also be an example of
generating events.
Event generators can be stacked, and a generator may defer event generation to
the next one down the stack. For example, the row-level replication event
generator may defer DDL to the statement-level replication event generator.
HIGH-LEVEL SPECIFICATION:
Generators and consumbers
-------------------------
We have the two concepts:
1. Event _generators_, that produce events describing all changes to data in a
server.
2. Event consumers, that receive such events and use them in various ways.
Examples of event generators is execution of SQL statements, which generates
events like those used for statement-based replication. Another example is
PBXT engine-level replication.
An example of an event consumer is the writing of the binlog on a master.
Event generators are not really plugins. Rather, there are specific points in
the server where events are generated. However, a generator can be part of a
plugin, for example a PBXT engine-level replication event generator would be
part of the PBXT storage engine plugin.
Event consumers on the other hand could be a plugin.
One generator can be stacked on top of another. This means that a generator on
top (for example row-based events) will handle some events itself
(eg. non-deterministic update in mixed-mode binlogging). Other events that it
does not want to or cannot handle (for example deterministic delete or DDL)
will be defered to the generator below (for example statement-based events).
Materialisation (or not)
------------------------
A central decision is how to represent events that are generated in the API at
the point of generation.
I want to avoid making the API require that events are materialised. By
"Materialised" I mean that all (or most) of the data for the event is written
into memory in a struct/class used inside the server or serialised in a data
buffer (byte buffer) in a format suitable for network transport or disk
storage.
Using a non-materialised event means storing just a reference to appropriate
context that allows to retrieve all information for the event using
accessors. Ie. typically this would be based on getting the event information
from the THD pointer.
Some reasons to avoid using materialised events in the API:
- Replication events have a _lot_ of detailed context information that can be
needed in events: user-defined variables, random seed, character sets,
table column names and types, etc. etc. If we make the API based on
materialisation, then the initial decision about which context information
to include with which events will have to be done in the API, while ideally
we want this decision to be done by the individual consumer plugin. There
will this be a conflict between what to include (to allow consumers access)
and what to exclude (to avoid excessive needless work).
- Materialising means defining a very specific format, which will tend to
make the API less generic and flexible.
- Unless the materialised format is made _very_ specific (and thus very
inflexible), it is unlikely to be directly useful for transport
(eg. binlog), so it will need to be re-materialised into a different format
anyway, wasting work.
- If a generator on top handles an event, then we want to avoid wasting work
materialising an event in a generator below which would be completely
unused. Thus there would be a need for the upper generator to somehow
notify the lower generator ahead of event generation time to not fire an
event, complicating the API.
Some advantages for materialisation:
- Using an API based on passing around some well-defined struct event (or
byte buffer) will be simpler than the complex class hierarchy proposed here
with no requirement for materialisation.
- Defining a materialised format would allow an easy way to use the same
consumer code on a generator that produces events at the source of
execution and on a generator that produces events from eg. reading them
from an event log.
Note that there can be some middle way, where some data is materialised and
some is kept as reference to context (eg. THD) only. This however looses most
of the mentioned advantages for materialisation.
The design proposed here aims for as little materialisation as possible.
Default materialisation format
------------------------------
While the proposed API doesn't _require_ materialisation, we can still think
about providing the _option_ for built-in materialisation. This could be
useful if such materialisation is made suitable for transport to a different
server (eg. no endian-dependance etc). If there is a facility for such
materialisation built-in to the API, it becomes possible to write something
like a generic binlog plugin or generic network transport plugin. This would
be really useful for eg. PBXT engine-level replication, as it could be
implemented without having to re-invent a binlog format.
I added in the proposed API a simple facility to materialise every event as a
string of bytes. To use this, I still need to add a suitable facility to
de-materialise the event.
However, it is still an open question whether such a facility will be at all
useful. It still has some of the problems with materialisation mentioned
above. And I think it is likely that a good binlog implementation will need
to do more than just blindly copy opaque events from one endpoint to
another. For example, it might need different event boundaries (merge and/or
split events); it might need to augment or modify events, or inject new
events, etc.
So I think maybe it is better to add such a generic materialisation facility
on top of the basic event generator API. Such a facility would provide
materialisation of an replication event stream, not of individual events, so
would be more flexible in providing a good implementation. It would be
implemented for all generators. It would separate from both the event
generator API (so we have flexibility to put a filter class in-between
generator and materialisation), and could also be separate from the actual
transport handling stuff like fsync() of binlog files and socket connections
etc. It would be paired with a corresponding applier API which would handle
executing events on a slave.
Then we can have a default materialised event format, which is available, but
not mandatory. So there can still be other formats alongside (like legacy
MySQL 5.1 binlog event format and maybe Tungsten would have its own format).
Encapsulation
-------------
Another fundamental question about the design is the level of encapsulation
used for the API.
At the implementation level, a lot of the work is basically to pull out all of
the needed information from the THD object/context. The API I propose tries to
_not_ expose the THD to consumers. Instead it provides accessor functions for
all the bits and pieces relevant to each replication event, while the event
class itself likely will be more or less just an encapsulated THD.
So an alternative would be to have a generic event that was just (type, THD).
Then consumers could just pull out whatever information they want from the
THD. The THD implementation is already exposed to storage engines. This would
of course greatly reduce the size of the API, eliminating lots of class
definitions and accessor functions. Though arguably it wouldn't really
simplify the API, as the complexity would just be in understanding the THD
class.
Note that we do not have to take any performance hit from using encapsulated
accessors since compilers can inline them (though if inlining then we do not
get any ABI stability with respect to THD implemetation).
For now, the API is proposed without exposing the THD class. (Similar
encapsulation could be added in actual implementation to also not expose TABLE
and similar classes).
LOW-LEVEL DESIGN:
A consumer is implented as a virtual class (interface). There is one virtual
function for every event that can be received. A consumer would derive from
the base class and override methods for the events it wants to receive.
There is one consumer interface for each generator. When a generator A is
stacked on B, the consumer interface for A inherits from the interface for
B. This way, when A defers an event to B, the consumer for A will receive the
corresponding event from B.
There are methods for a consumer to register itself to receive events from
each generator. I still need to find a way for a consumer in one plugin to
register itself with a generator implemented in another plugin (eg. PBXT
engine-level replication). I also need to add a way for consumers to
de-register themselves.
The current design has consumer callbacks return 0 for success and error code
otherwise. I still need to think more about whether this is useful (ie. what
is the semantics of returning an error from a consumer callback).
Each event passed to consumers is defined as a class with public accessor
methods to a private context (which is mostly the THD).
My intension is to make all events passed around const, so that the same event
can be passed to each of multiple registered consumers (and to emphasise that
consumers do not have the ability to modify events). It still needs to be seen
whether that const-ness will be feasible in practise without very heavy
modification/constification of exiting code.
What follows is a partial draft of a possible definition of the API as
concrete C++ class definitions.
-----------------------------------------------------------------------
/*
Virtual base class for generated replication events.
This is the parent of events generated from all kinds of generators. Only
child classes can be instantiated.
This class can be used by code that wants to treat events in a generic way,
without any knowledge of event details. I still need to decide whether such
generic code is sensible.
*/
class rpl_event_base
{
/*
Maybe we will want the ability to materialise an event to a standard
binary format. This could be achieved with a base method like this. The
actual materialisation would be implemented in each deriving class. The
public methods would provide different interfaces for specifying the
buffer or for writing directly into IO_CACHE or file.
*/
/* Return 0 on success, -1 on error, -2 on out-of-buffer. */
int materialise(uchar *buffer, size_t buflen) const;
/*
Returns NULL on error or else malloc()ed buffer with materialised event,
caller must free().
*/
uchar *materialise() const;
/* Same but using passed in memroot. */
uchar *materialise(mem_root *memroot) const;
/*
Materialise to user-supplied writer function (could write directly to file
or the like).
*/
int materialise(int (*writer)(uchar *data, size_t len, void *context)) const;
/*
As to for what to do with a materialised event, there are a couple of
possibilities.
One is to have a de_materialise() method somewhere that can construct an
rpl_event_base (really a derived class of course) from a buffer or writer
function. This would require each accessor function to conditionally read
its data from either THD context or buffer (GCC is able to optimise
several such conditionals in multiple accessor function calls into one
conditional), or we can make all accessors virtual if the performance hit
is acceptable.
Another is to have different classes for accessing events read from
materialised event data.
Also, I still need to think about whether it is at all useful to be able
to generically materialise an event at this level. It may be that any
binlog/transport will in any case need to undertand more of the format of
events, so that such materialisation/transport is better done at a
different layer.
*/
protected:
/* Implementation which is the basis for materialise(). */
virtual int do_materialise(int (*writer)(uchar *data, size_t len,
void *context)) const = 0;
private:
/* Virtual base class, private constructor to prevent instantiation. */
rpl_event_base();
};
/*
These are the event types output from the transaction event generator.
This generator is not stacked on anything.
The transaction event generator marks the start and end (commit or rollback)
of transactions. It also gives information about whether the transaction was
a full transaction or autocommitted statement, whether transactional tables
were involved, whether non-transactional tables were involved, and XA
information (ToDo).
*/
/* Base class for transaction events. */
class rpl_event_transaction_base : public rpl_event_base
{
public:
/*
Get the local transaction id. This idea is only unique within one server.
It is allocated whenever a new transaction is started.
Can be used to identify events belonging to the same transaction in a
binlog-like stream of events streamed in parallel among multiple
transactions.
*/
uint64_t get_local_trx_id() const { return thd->local_trx_id; };
bool get_is_autocommit() const;
private:
/* The context is the THD. */
THD *thd;
rpl_event_transaction_base(THD *_thd) : thd(_thd) { };
};
/* Transaction start event. */
class rpl_event_transaction_start : public rpl_event_transaction_base
{
};
/* Transaction commit. */
class rpl_event_transaction_commit : public rpl_event_transaction_base
{
public:
/*
The global transaction id is unique cross-server.
It can be used to identify the position from which to start a slave
replicating from a master.
This global ID is only available once the transaction is decided to commit
by the TC manager / primary redundancy service. This TC also allocates the
ID and decides the exact semantics (can there be gaps, etc); however the
format is fixed (cluster_id, running_counter).
*/
struct global_transaction_id
{
uint32_t cluster_id;
uint64_t counter;
};
const global_transaction_id *get_global_transaction_id() const;
};
/* Transaction rollback. */
class rpl_event_transaction_rollback : public rpl_event_transaction_base
{
};
/* Base class for statement events. */
class rpl_event_statement_base : public rpl_event_base
{
public:
LEX_STRING get_current_db() const;
};
class rpl_event_statement_start : public rpl_event_statement_base
{
};
class rpl_event_statement_end : public rpl_event_statement_base
{
public:
int get_errorcode() const;
};
class rpl_event_statement_query : public rpl_event_statement_base
{
public:
LEX_STRING get_query_string();
ulong get_sql_mode();
const CHARSET_INFO *get_character_set_client();
const CHARSET_INFO *get_collation_connection();
const CHARSET_INFO *get_collation_server();
const CHARSET_INFO *get_collation_default_db();
/*
Access to relevant flags that affect query execution.
Use as if (ev->get_flags() & (uint32)ROW_FOREIGN_KEY_CHECKS) { ... }
*/
enum flag_bits
{
STMT_FOREIGN_KEY_CHECKS, // @@foreign_key_checks
STMT_UNIQUE_KEY_CHECKS, // @@unique_checks
STMT_AUTO_IS_NULL, // @@sql_auto_is_null
};
uint32_t get_flags();
ulong get_auto_increment_offset();
ulong get_auto_increment_increment();
// And so on for: time zone; day/month names; connection id; LAST_INSERT_ID;
// INSERT_ID; random seed; user variables.
//
// We probably also need get_uses_temporary_table(), get_used_user_vars(),
// get_uses_auto_increment() and so on, so a consumer can get more
// information about what kind of context information a query will need when
// executed on a slave.
};
class rpl_event_statement_load_query : public rpl_event_statement_query
{
};
/*
This event is fired with blocks of data for files read (from server-local
file or client connection) for LOAD DATA.
*/
class rpl_event_statement_load_data_block : public rpl_event_statement_base
{
public:
struct block
{
const uchar *ptr;
size_t size;
};
block get_block() const;
};
/* Base class for row-based replication events. */
class rpl_event_row_base : public rpl_event_base
{
public:
/*
Access to relevant handler extra flags and other flags that affect row
operations.
Use as if (ev->get_flags() & (uint32)ROW_WRITE_CAN_REPLACE) { ... }
*/
enum flag_bits
{
ROW_WRITE_CAN_REPLACE, // HA_EXTRA_WRITE_CAN_REPLACE
ROW_IGNORE_DUP_KEY, // HA_EXTRA_IGNORE_DUP_KEY
ROW_IGNORE_NO_KEY, // HA_EXTRA_IGNORE_NO_KEY
ROW_DISABLE_FOREIGN_KEY_CHECKS, // ! @@foreign_key_checks
ROW_DISABLE_UNIQUE_KEY_CHECKS, // ! @@unique_checks
};
uint32_t get_flags();
/* Access to list of tables modified. */
class table_iterator
{
public:
/* Returns table, NULL after last. */
const TABLE *get_next();
private:
// ...
};
table_iterator get_modified_tables() const;
private:
/* Context used to provide accessors. */
THD *thd;
protected:
rpl_event_row_base(THD *_thd) : thd(_thd) { }
};
class rpl_event_row_write : public rpl_event_row_base
{
public:
const BITMAP *get_write_set() const;
const uchar *get_after_image() const;
};
class rpl_event_row_update : public rpl_event_row_base
{
public:
const BITMAP *get_read_set() const;
const BITMAP *get_write_set() const;
const uchar *get_before_image() const;
const uchar *get_after_image() const;
};
class rpl_event_row_delete : public rpl_event_row_base
{
public:
const BITMAP *get_read_set() const;
const uchar *get_before_image() const;
};
/*
Event consumer callbacks.
An event consumer registers with an event generator to receive event
notifications from that generator.
The consumer has callbacks (in the form of virtual functions) for the
individual event types the consumer is interested in. Only callbacks that
are non-NULL will be invoked. If an event applies to multiple callbacks in a
single callback struct, it will only be passed to the most specific non-NULL
callback (so events never fire more than once per registration).
The lifetime of the memory holding the event is only for the duration of the
callback invocation, unless otherwise noted.
Callbacks return 0 for success or error code (ToDo: does this make sense?).
*/
struct rpl_event_consumer_transaction
{
virtual int trx_start(const rpl_event_transaction_start *) { return 0; }
virtual int trx_commit(const rpl_event_transaction_commit *) { return 0; }
virtual int trx_rollback(const rpl_event_transaction_rollback *) { return 0; }
};
/*
Consuming statement-based events.
The statement event generator is stacked on top of the transaction event
generator, so we can receive those events as well.
*/
struct rpl_event_consumer_statement : public rpl_event_consumer_transaction
{
virtual int stmt_start(const rpl_event_statement_start *) { return 0; }
virtual int stmt_end(const rpl_event_statement_end *) { return 0; }
virtual int stmt_query(const rpl_event_statement_query *) { return 0; }
/* Data for a file used in LOAD DATA [LOCAL] INFILE. */
virtual int stmt_load_data_block(const rpl_event_statement_load_data_block *)
{ return 0; }
/*
These are specific kinds of statements; if specified they override
consume_stmt_query() for the corresponding event.
*/
virtual int stmt_load_query(const rpl_event_statement_load_query *ev)
{ return stmt_query(ev); }
};
/*
Consuming row-based events.
The row event generator is stacked on top of the statement event generator.
*/
struct rpl_event_consumer_row : public rpl_event_consumer_statement
{
virtual int row_write(const rpl_event_row_write *) { return 0; }
virtual int row_update(const rpl_event_row_update *) { return 0; }
virtual int row_delete(const rpl_event_row_delete *) { return 0; }
};
/*
Registration functions.
ToDo: Make a way to de-register.
ToDo: Find a good way for a plugin (eg. PBXT) to publish a generator
registration method.
*/
int rpl_event_transaction_register(const rpl_event_consumer_transaction *cbs);
int rpl_event_statement_register(const rpl_event_consumer_statement *cbs);
int rpl_event_row_register(const rpl_event_consumer_row *cbs);
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Knielsen): Replication API for stacked event generators (120)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Replication API for stacked event generators
CREATION DATE..: Mon, 07 Jun 2010, 13:13
SUPERVISOR.....: Knielsen
IMPLEMENTOR....:
COPIES TO......:
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 120 (http://askmonty.org/worklog/?tid=120)
VERSION........: Server-9.x
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 2
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Knielsen - Thu, 24 Jun 2010, 14:28)=-=-
High-Level Specification modified.
--- /tmp/wklog.120.old.7341 2010-06-24 14:28:17.000000000 +0000
+++ /tmp/wklog.120.new.7341 2010-06-24 14:28:17.000000000 +0000
@@ -1 +1,159 @@
+Generators and consumbers
+-------------------------
+
+We have the two concepts:
+
+1. Event _generators_, that produce events describing all changes to data in a
+ server.
+
+2. Event consumers, that receive such events and use them in various ways.
+
+Examples of event generators is execution of SQL statements, which generates
+events like those used for statement-based replication. Another example is
+PBXT engine-level replication.
+
+An example of an event consumer is the writing of the binlog on a master.
+
+Event generators are not really plugins. Rather, there are specific points in
+the server where events are generated. However, a generator can be part of a
+plugin, for example a PBXT engine-level replication event generator would be
+part of the PBXT storage engine plugin.
+
+Event consumers on the other hand could be a plugin.
+
+One generator can be stacked on top of another. This means that a generator on
+top (for example row-based events) will handle some events itself
+(eg. non-deterministic update in mixed-mode binlogging). Other events that it
+does not want to or cannot handle (for example deterministic delete or DDL)
+will be defered to the generator below (for example statement-based events).
+
+
+Materialisation (or not)
+------------------------
+
+A central decision is how to represent events that are generated in the API at
+the point of generation.
+
+I want to avoid making the API require that events are materialised. By
+"Materialised" I mean that all (or most) of the data for the event is written
+into memory in a struct/class used inside the server or serialised in a data
+buffer (byte buffer) in a format suitable for network transport or disk
+storage.
+
+Using a non-materialised event means storing just a reference to appropriate
+context that allows to retrieve all information for the event using
+accessors. Ie. typically this would be based on getting the event information
+from the THD pointer.
+
+Some reasons to avoid using materialised events in the API:
+
+ - Replication events have a _lot_ of detailed context information that can be
+ needed in events: user-defined variables, random seed, character sets,
+ table column names and types, etc. etc. If we make the API based on
+ materialisation, then the initial decision about which context information
+ to include with which events will have to be done in the API, while ideally
+ we want this decision to be done by the individual consumer plugin. There
+ will this be a conflict between what to include (to allow consumers access)
+ and what to exclude (to avoid excessive needless work).
+
+ - Materialising means defining a very specific format, which will tend to
+ make the API less generic and flexible.
+
+ - Unless the materialised format is made _very_ specific (and thus very
+ inflexible), it is unlikely to be directly useful for transport
+ (eg. binlog), so it will need to be re-materialised into a different format
+ anyway, wasting work.
+
+ - If a generator on top handles an event, then we want to avoid wasting work
+ materialising an event in a generator below which would be completely
+ unused. Thus there would be a need for the upper generator to somehow
+ notify the lower generator ahead of event generation time to not fire an
+ event, complicating the API.
+
+Some advantages for materialisation:
+
+ - Using an API based on passing around some well-defined struct event (or
+ byte buffer) will be simpler than the complex class hierarchy proposed here
+ with no requirement for materialisation.
+
+ - Defining a materialised format would allow an easy way to use the same
+ consumer code on a generator that produces events at the source of
+ execution and on a generator that produces events from eg. reading them
+ from an event log.
+
+Note that there can be some middle way, where some data is materialised and
+some is kept as reference to context (eg. THD) only. This however looses most
+of the mentioned advantages for materialisation.
+
+The design proposed here aims for as little materialisation as possible.
+
+
+Default materialisation format
+------------------------------
+
+
+While the proposed API doesn't _require_ materialisation, we can still think
+about providing the _option_ for built-in materialisation. This could be
+useful if such materialisation is made suitable for transport to a different
+server (eg. no endian-dependance etc). If there is a facility for such
+materialisation built-in to the API, it becomes possible to write something
+like a generic binlog plugin or generic network transport plugin. This would
+be really useful for eg. PBXT engine-level replication, as it could be
+implemented without having to re-invent a binlog format.
+
+I added in the proposed API a simple facility to materialise every event as a
+string of bytes. To use this, I still need to add a suitable facility to
+de-materialise the event.
+
+However, it is still an open question whether such a facility will be at all
+useful. It still has some of the problems with materialisation mentioned
+above. And I think it is likely that a good binlog implementation will need
+to do more than just blindly copy opaque events from one endpoint to
+another. For example, it might need different event boundaries (merge and/or
+split events); it might need to augment or modify events, or inject new
+events, etc.
+
+So I think maybe it is better to add such a generic materialisation facility
+on top of the basic event generator API. Such a facility would provide
+materialisation of an replication event stream, not of individual events, so
+would be more flexible in providing a good implementation. It would be
+implemented for all generators. It would separate from both the event
+generator API (so we have flexibility to put a filter class in-between
+generator and materialisation), and could also be separate from the actual
+transport handling stuff like fsync() of binlog files and socket connections
+etc. It would be paired with a corresponding applier API which would handle
+executing events on a slave.
+
+Then we can have a default materialised event format, which is available, but
+not mandatory. So there can still be other formats alongside (like legacy
+MySQL 5.1 binlog event format and maybe Tungsten would have its own format).
+
+
+Encapsulation
+-------------
+
+Another fundamental question about the design is the level of encapsulation
+used for the API.
+
+At the implementation level, a lot of the work is basically to pull out all of
+the needed information from the THD object/context. The API I propose tries to
+_not_ expose the THD to consumers. Instead it provides accessor functions for
+all the bits and pieces relevant to each replication event, while the event
+class itself likely will be more or less just an encapsulated THD.
+
+So an alternative would be to have a generic event that was just (type, THD).
+Then consumers could just pull out whatever information they want from the
+THD. The THD implementation is already exposed to storage engines. This would
+of course greatly reduce the size of the API, eliminating lots of class
+definitions and accessor functions. Though arguably it wouldn't really
+simplify the API, as the complexity would just be in understanding the THD
+class.
+
+Note that we do not have to take any performance hit from using encapsulated
+accessors since compilers can inline them (though if inlining then we do not
+get any ABI stability with respect to THD implemetation).
+
+For now, the API is proposed without exposing the THD class. (Similar
+encapsulation could be added in actual implementation to also not expose TABLE
+and similar classes).
-=-=(Knielsen - Thu, 24 Jun 2010, 12:04)=-=-
Dependency created: 107 now depends on 120
-=-=(Knielsen - Thu, 24 Jun 2010, 11:59)=-=-
High Level Description modified.
--- /tmp/wklog.120.old.516 2010-06-24 11:59:24.000000000 +0000
+++ /tmp/wklog.120.new.516 2010-06-24 11:59:24.000000000 +0000
@@ -11,4 +11,4 @@
Event generators can be stacked, and a generator may defer event generation to
the next one down the stack. For example, the row-level replication event
-generator may defer DLL to the statement-level replication event generator.
+generator may defer DDL to the statement-level replication event generator.
-=-=(Knielsen - Mon, 21 Jun 2010, 08:35)=-=-
Research and design thoughts.
Worked 2 hours and estimate 0 hours remain (original estimate increased by 2 hours).
DESCRIPTION:
A part of the replication project, MWL#107.
Events are produced by event Generators. Examples are
- Generation of statement-based replication events
- Generation of row-based events
- Generation of PBXT engine-level replication events
and maybe reading of events from relay log on slave may also be an example of
generating events.
Event generators can be stacked, and a generator may defer event generation to
the next one down the stack. For example, the row-level replication event
generator may defer DDL to the statement-level replication event generator.
HIGH-LEVEL SPECIFICATION:
Generators and consumbers
-------------------------
We have the two concepts:
1. Event _generators_, that produce events describing all changes to data in a
server.
2. Event consumers, that receive such events and use them in various ways.
Examples of event generators is execution of SQL statements, which generates
events like those used for statement-based replication. Another example is
PBXT engine-level replication.
An example of an event consumer is the writing of the binlog on a master.
Event generators are not really plugins. Rather, there are specific points in
the server where events are generated. However, a generator can be part of a
plugin, for example a PBXT engine-level replication event generator would be
part of the PBXT storage engine plugin.
Event consumers on the other hand could be a plugin.
One generator can be stacked on top of another. This means that a generator on
top (for example row-based events) will handle some events itself
(eg. non-deterministic update in mixed-mode binlogging). Other events that it
does not want to or cannot handle (for example deterministic delete or DDL)
will be defered to the generator below (for example statement-based events).
Materialisation (or not)
------------------------
A central decision is how to represent events that are generated in the API at
the point of generation.
I want to avoid making the API require that events are materialised. By
"Materialised" I mean that all (or most) of the data for the event is written
into memory in a struct/class used inside the server or serialised in a data
buffer (byte buffer) in a format suitable for network transport or disk
storage.
Using a non-materialised event means storing just a reference to appropriate
context that allows to retrieve all information for the event using
accessors. Ie. typically this would be based on getting the event information
from the THD pointer.
Some reasons to avoid using materialised events in the API:
- Replication events have a _lot_ of detailed context information that can be
needed in events: user-defined variables, random seed, character sets,
table column names and types, etc. etc. If we make the API based on
materialisation, then the initial decision about which context information
to include with which events will have to be done in the API, while ideally
we want this decision to be done by the individual consumer plugin. There
will this be a conflict between what to include (to allow consumers access)
and what to exclude (to avoid excessive needless work).
- Materialising means defining a very specific format, which will tend to
make the API less generic and flexible.
- Unless the materialised format is made _very_ specific (and thus very
inflexible), it is unlikely to be directly useful for transport
(eg. binlog), so it will need to be re-materialised into a different format
anyway, wasting work.
- If a generator on top handles an event, then we want to avoid wasting work
materialising an event in a generator below which would be completely
unused. Thus there would be a need for the upper generator to somehow
notify the lower generator ahead of event generation time to not fire an
event, complicating the API.
Some advantages for materialisation:
- Using an API based on passing around some well-defined struct event (or
byte buffer) will be simpler than the complex class hierarchy proposed here
with no requirement for materialisation.
- Defining a materialised format would allow an easy way to use the same
consumer code on a generator that produces events at the source of
execution and on a generator that produces events from eg. reading them
from an event log.
Note that there can be some middle way, where some data is materialised and
some is kept as reference to context (eg. THD) only. This however looses most
of the mentioned advantages for materialisation.
The design proposed here aims for as little materialisation as possible.
Default materialisation format
------------------------------
While the proposed API doesn't _require_ materialisation, we can still think
about providing the _option_ for built-in materialisation. This could be
useful if such materialisation is made suitable for transport to a different
server (eg. no endian-dependance etc). If there is a facility for such
materialisation built-in to the API, it becomes possible to write something
like a generic binlog plugin or generic network transport plugin. This would
be really useful for eg. PBXT engine-level replication, as it could be
implemented without having to re-invent a binlog format.
I added in the proposed API a simple facility to materialise every event as a
string of bytes. To use this, I still need to add a suitable facility to
de-materialise the event.
However, it is still an open question whether such a facility will be at all
useful. It still has some of the problems with materialisation mentioned
above. And I think it is likely that a good binlog implementation will need
to do more than just blindly copy opaque events from one endpoint to
another. For example, it might need different event boundaries (merge and/or
split events); it might need to augment or modify events, or inject new
events, etc.
So I think maybe it is better to add such a generic materialisation facility
on top of the basic event generator API. Such a facility would provide
materialisation of an replication event stream, not of individual events, so
would be more flexible in providing a good implementation. It would be
implemented for all generators. It would separate from both the event
generator API (so we have flexibility to put a filter class in-between
generator and materialisation), and could also be separate from the actual
transport handling stuff like fsync() of binlog files and socket connections
etc. It would be paired with a corresponding applier API which would handle
executing events on a slave.
Then we can have a default materialised event format, which is available, but
not mandatory. So there can still be other formats alongside (like legacy
MySQL 5.1 binlog event format and maybe Tungsten would have its own format).
Encapsulation
-------------
Another fundamental question about the design is the level of encapsulation
used for the API.
At the implementation level, a lot of the work is basically to pull out all of
the needed information from the THD object/context. The API I propose tries to
_not_ expose the THD to consumers. Instead it provides accessor functions for
all the bits and pieces relevant to each replication event, while the event
class itself likely will be more or less just an encapsulated THD.
So an alternative would be to have a generic event that was just (type, THD).
Then consumers could just pull out whatever information they want from the
THD. The THD implementation is already exposed to storage engines. This would
of course greatly reduce the size of the API, eliminating lots of class
definitions and accessor functions. Though arguably it wouldn't really
simplify the API, as the complexity would just be in understanding the THD
class.
Note that we do not have to take any performance hit from using encapsulated
accessors since compilers can inline them (though if inlining then we do not
get any ABI stability with respect to THD implemetation).
For now, the API is proposed without exposing the THD class. (Similar
encapsulation could be added in actual implementation to also not expose TABLE
and similar classes).
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Knielsen): Replication API for stacked event generators (120)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Replication API for stacked event generators
CREATION DATE..: Mon, 07 Jun 2010, 13:13
SUPERVISOR.....: Knielsen
IMPLEMENTOR....:
COPIES TO......:
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 120 (http://askmonty.org/worklog/?tid=120)
VERSION........: Server-9.x
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 2
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Knielsen - Thu, 24 Jun 2010, 14:28)=-=-
High-Level Specification modified.
--- /tmp/wklog.120.old.7341 2010-06-24 14:28:17.000000000 +0000
+++ /tmp/wklog.120.new.7341 2010-06-24 14:28:17.000000000 +0000
@@ -1 +1,159 @@
+Generators and consumbers
+-------------------------
+
+We have the two concepts:
+
+1. Event _generators_, that produce events describing all changes to data in a
+ server.
+
+2. Event consumers, that receive such events and use them in various ways.
+
+Examples of event generators is execution of SQL statements, which generates
+events like those used for statement-based replication. Another example is
+PBXT engine-level replication.
+
+An example of an event consumer is the writing of the binlog on a master.
+
+Event generators are not really plugins. Rather, there are specific points in
+the server where events are generated. However, a generator can be part of a
+plugin, for example a PBXT engine-level replication event generator would be
+part of the PBXT storage engine plugin.
+
+Event consumers on the other hand could be a plugin.
+
+One generator can be stacked on top of another. This means that a generator on
+top (for example row-based events) will handle some events itself
+(eg. non-deterministic update in mixed-mode binlogging). Other events that it
+does not want to or cannot handle (for example deterministic delete or DDL)
+will be defered to the generator below (for example statement-based events).
+
+
+Materialisation (or not)
+------------------------
+
+A central decision is how to represent events that are generated in the API at
+the point of generation.
+
+I want to avoid making the API require that events are materialised. By
+"Materialised" I mean that all (or most) of the data for the event is written
+into memory in a struct/class used inside the server or serialised in a data
+buffer (byte buffer) in a format suitable for network transport or disk
+storage.
+
+Using a non-materialised event means storing just a reference to appropriate
+context that allows to retrieve all information for the event using
+accessors. Ie. typically this would be based on getting the event information
+from the THD pointer.
+
+Some reasons to avoid using materialised events in the API:
+
+ - Replication events have a _lot_ of detailed context information that can be
+ needed in events: user-defined variables, random seed, character sets,
+ table column names and types, etc. etc. If we make the API based on
+ materialisation, then the initial decision about which context information
+ to include with which events will have to be done in the API, while ideally
+ we want this decision to be done by the individual consumer plugin. There
+ will this be a conflict between what to include (to allow consumers access)
+ and what to exclude (to avoid excessive needless work).
+
+ - Materialising means defining a very specific format, which will tend to
+ make the API less generic and flexible.
+
+ - Unless the materialised format is made _very_ specific (and thus very
+ inflexible), it is unlikely to be directly useful for transport
+ (eg. binlog), so it will need to be re-materialised into a different format
+ anyway, wasting work.
+
+ - If a generator on top handles an event, then we want to avoid wasting work
+ materialising an event in a generator below which would be completely
+ unused. Thus there would be a need for the upper generator to somehow
+ notify the lower generator ahead of event generation time to not fire an
+ event, complicating the API.
+
+Some advantages for materialisation:
+
+ - Using an API based on passing around some well-defined struct event (or
+ byte buffer) will be simpler than the complex class hierarchy proposed here
+ with no requirement for materialisation.
+
+ - Defining a materialised format would allow an easy way to use the same
+ consumer code on a generator that produces events at the source of
+ execution and on a generator that produces events from eg. reading them
+ from an event log.
+
+Note that there can be some middle way, where some data is materialised and
+some is kept as reference to context (eg. THD) only. This however looses most
+of the mentioned advantages for materialisation.
+
+The design proposed here aims for as little materialisation as possible.
+
+
+Default materialisation format
+------------------------------
+
+
+While the proposed API doesn't _require_ materialisation, we can still think
+about providing the _option_ for built-in materialisation. This could be
+useful if such materialisation is made suitable for transport to a different
+server (eg. no endian-dependance etc). If there is a facility for such
+materialisation built-in to the API, it becomes possible to write something
+like a generic binlog plugin or generic network transport plugin. This would
+be really useful for eg. PBXT engine-level replication, as it could be
+implemented without having to re-invent a binlog format.
+
+I added in the proposed API a simple facility to materialise every event as a
+string of bytes. To use this, I still need to add a suitable facility to
+de-materialise the event.
+
+However, it is still an open question whether such a facility will be at all
+useful. It still has some of the problems with materialisation mentioned
+above. And I think it is likely that a good binlog implementation will need
+to do more than just blindly copy opaque events from one endpoint to
+another. For example, it might need different event boundaries (merge and/or
+split events); it might need to augment or modify events, or inject new
+events, etc.
+
+So I think maybe it is better to add such a generic materialisation facility
+on top of the basic event generator API. Such a facility would provide
+materialisation of an replication event stream, not of individual events, so
+would be more flexible in providing a good implementation. It would be
+implemented for all generators. It would separate from both the event
+generator API (so we have flexibility to put a filter class in-between
+generator and materialisation), and could also be separate from the actual
+transport handling stuff like fsync() of binlog files and socket connections
+etc. It would be paired with a corresponding applier API which would handle
+executing events on a slave.
+
+Then we can have a default materialised event format, which is available, but
+not mandatory. So there can still be other formats alongside (like legacy
+MySQL 5.1 binlog event format and maybe Tungsten would have its own format).
+
+
+Encapsulation
+-------------
+
+Another fundamental question about the design is the level of encapsulation
+used for the API.
+
+At the implementation level, a lot of the work is basically to pull out all of
+the needed information from the THD object/context. The API I propose tries to
+_not_ expose the THD to consumers. Instead it provides accessor functions for
+all the bits and pieces relevant to each replication event, while the event
+class itself likely will be more or less just an encapsulated THD.
+
+So an alternative would be to have a generic event that was just (type, THD).
+Then consumers could just pull out whatever information they want from the
+THD. The THD implementation is already exposed to storage engines. This would
+of course greatly reduce the size of the API, eliminating lots of class
+definitions and accessor functions. Though arguably it wouldn't really
+simplify the API, as the complexity would just be in understanding the THD
+class.
+
+Note that we do not have to take any performance hit from using encapsulated
+accessors since compilers can inline them (though if inlining then we do not
+get any ABI stability with respect to THD implemetation).
+
+For now, the API is proposed without exposing the THD class. (Similar
+encapsulation could be added in actual implementation to also not expose TABLE
+and similar classes).
-=-=(Knielsen - Thu, 24 Jun 2010, 12:04)=-=-
Dependency created: 107 now depends on 120
-=-=(Knielsen - Thu, 24 Jun 2010, 11:59)=-=-
High Level Description modified.
--- /tmp/wklog.120.old.516 2010-06-24 11:59:24.000000000 +0000
+++ /tmp/wklog.120.new.516 2010-06-24 11:59:24.000000000 +0000
@@ -11,4 +11,4 @@
Event generators can be stacked, and a generator may defer event generation to
the next one down the stack. For example, the row-level replication event
-generator may defer DLL to the statement-level replication event generator.
+generator may defer DDL to the statement-level replication event generator.
-=-=(Knielsen - Mon, 21 Jun 2010, 08:35)=-=-
Research and design thoughts.
Worked 2 hours and estimate 0 hours remain (original estimate increased by 2 hours).
DESCRIPTION:
A part of the replication project, MWL#107.
Events are produced by event Generators. Examples are
- Generation of statement-based replication events
- Generation of row-based events
- Generation of PBXT engine-level replication events
and maybe reading of events from relay log on slave may also be an example of
generating events.
Event generators can be stacked, and a generator may defer event generation to
the next one down the stack. For example, the row-level replication event
generator may defer DDL to the statement-level replication event generator.
HIGH-LEVEL SPECIFICATION:
Generators and consumbers
-------------------------
We have the two concepts:
1. Event _generators_, that produce events describing all changes to data in a
server.
2. Event consumers, that receive such events and use them in various ways.
Examples of event generators is execution of SQL statements, which generates
events like those used for statement-based replication. Another example is
PBXT engine-level replication.
An example of an event consumer is the writing of the binlog on a master.
Event generators are not really plugins. Rather, there are specific points in
the server where events are generated. However, a generator can be part of a
plugin, for example a PBXT engine-level replication event generator would be
part of the PBXT storage engine plugin.
Event consumers on the other hand could be a plugin.
One generator can be stacked on top of another. This means that a generator on
top (for example row-based events) will handle some events itself
(eg. non-deterministic update in mixed-mode binlogging). Other events that it
does not want to or cannot handle (for example deterministic delete or DDL)
will be defered to the generator below (for example statement-based events).
Materialisation (or not)
------------------------
A central decision is how to represent events that are generated in the API at
the point of generation.
I want to avoid making the API require that events are materialised. By
"Materialised" I mean that all (or most) of the data for the event is written
into memory in a struct/class used inside the server or serialised in a data
buffer (byte buffer) in a format suitable for network transport or disk
storage.
Using a non-materialised event means storing just a reference to appropriate
context that allows to retrieve all information for the event using
accessors. Ie. typically this would be based on getting the event information
from the THD pointer.
Some reasons to avoid using materialised events in the API:
- Replication events have a _lot_ of detailed context information that can be
needed in events: user-defined variables, random seed, character sets,
table column names and types, etc. etc. If we make the API based on
materialisation, then the initial decision about which context information
to include with which events will have to be done in the API, while ideally
we want this decision to be done by the individual consumer plugin. There
will this be a conflict between what to include (to allow consumers access)
and what to exclude (to avoid excessive needless work).
- Materialising means defining a very specific format, which will tend to
make the API less generic and flexible.
- Unless the materialised format is made _very_ specific (and thus very
inflexible), it is unlikely to be directly useful for transport
(eg. binlog), so it will need to be re-materialised into a different format
anyway, wasting work.
- If a generator on top handles an event, then we want to avoid wasting work
materialising an event in a generator below which would be completely
unused. Thus there would be a need for the upper generator to somehow
notify the lower generator ahead of event generation time to not fire an
event, complicating the API.
Some advantages for materialisation:
- Using an API based on passing around some well-defined struct event (or
byte buffer) will be simpler than the complex class hierarchy proposed here
with no requirement for materialisation.
- Defining a materialised format would allow an easy way to use the same
consumer code on a generator that produces events at the source of
execution and on a generator that produces events from eg. reading them
from an event log.
Note that there can be some middle way, where some data is materialised and
some is kept as reference to context (eg. THD) only. This however looses most
of the mentioned advantages for materialisation.
The design proposed here aims for as little materialisation as possible.
Default materialisation format
------------------------------
While the proposed API doesn't _require_ materialisation, we can still think
about providing the _option_ for built-in materialisation. This could be
useful if such materialisation is made suitable for transport to a different
server (eg. no endian-dependance etc). If there is a facility for such
materialisation built-in to the API, it becomes possible to write something
like a generic binlog plugin or generic network transport plugin. This would
be really useful for eg. PBXT engine-level replication, as it could be
implemented without having to re-invent a binlog format.
I added in the proposed API a simple facility to materialise every event as a
string of bytes. To use this, I still need to add a suitable facility to
de-materialise the event.
However, it is still an open question whether such a facility will be at all
useful. It still has some of the problems with materialisation mentioned
above. And I think it is likely that a good binlog implementation will need
to do more than just blindly copy opaque events from one endpoint to
another. For example, it might need different event boundaries (merge and/or
split events); it might need to augment or modify events, or inject new
events, etc.
So I think maybe it is better to add such a generic materialisation facility
on top of the basic event generator API. Such a facility would provide
materialisation of an replication event stream, not of individual events, so
would be more flexible in providing a good implementation. It would be
implemented for all generators. It would separate from both the event
generator API (so we have flexibility to put a filter class in-between
generator and materialisation), and could also be separate from the actual
transport handling stuff like fsync() of binlog files and socket connections
etc. It would be paired with a corresponding applier API which would handle
executing events on a slave.
Then we can have a default materialised event format, which is available, but
not mandatory. So there can still be other formats alongside (like legacy
MySQL 5.1 binlog event format and maybe Tungsten would have its own format).
Encapsulation
-------------
Another fundamental question about the design is the level of encapsulation
used for the API.
At the implementation level, a lot of the work is basically to pull out all of
the needed information from the THD object/context. The API I propose tries to
_not_ expose the THD to consumers. Instead it provides accessor functions for
all the bits and pieces relevant to each replication event, while the event
class itself likely will be more or less just an encapsulated THD.
So an alternative would be to have a generic event that was just (type, THD).
Then consumers could just pull out whatever information they want from the
THD. The THD implementation is already exposed to storage engines. This would
of course greatly reduce the size of the API, eliminating lots of class
definitions and accessor functions. Though arguably it wouldn't really
simplify the API, as the complexity would just be in understanding the THD
class.
Note that we do not have to take any performance hit from using encapsulated
accessors since compilers can inline them (though if inlining then we do not
get any ABI stability with respect to THD implemetation).
For now, the API is proposed without exposing the THD class. (Similar
encapsulation could be added in actual implementation to also not expose TABLE
and similar classes).
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Knielsen): Replication API for stacked event generators (120)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Replication API for stacked event generators
CREATION DATE..: Mon, 07 Jun 2010, 13:13
SUPERVISOR.....: Knielsen
IMPLEMENTOR....:
COPIES TO......:
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 120 (http://askmonty.org/worklog/?tid=120)
VERSION........: Server-9.x
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 2
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Knielsen - Thu, 24 Jun 2010, 11:59)=-=-
High Level Description modified.
--- /tmp/wklog.120.old.516 2010-06-24 11:59:24.000000000 +0000
+++ /tmp/wklog.120.new.516 2010-06-24 11:59:24.000000000 +0000
@@ -11,4 +11,4 @@
Event generators can be stacked, and a generator may defer event generation to
the next one down the stack. For example, the row-level replication event
-generator may defer DLL to the statement-level replication event generator.
+generator may defer DDL to the statement-level replication event generator.
-=-=(Knielsen - Mon, 21 Jun 2010, 08:35)=-=-
Research and design thoughts.
Worked 2 hours and estimate 0 hours remain (original estimate increased by 2 hours).
DESCRIPTION:
A part of the replication project, MWL#107.
Events are produced by event Generators. Examples are
- Generation of statement-based replication events
- Generation of row-based events
- Generation of PBXT engine-level replication events
and maybe reading of events from relay log on slave may also be an example of
generating events.
Event generators can be stacked, and a generator may defer event generation to
the next one down the stack. For example, the row-level replication event
generator may defer DDL to the statement-level replication event generator.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Knielsen): Replication API for stacked event generators (120)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Replication API for stacked event generators
CREATION DATE..: Mon, 07 Jun 2010, 13:13
SUPERVISOR.....: Knielsen
IMPLEMENTOR....:
COPIES TO......:
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 120 (http://askmonty.org/worklog/?tid=120)
VERSION........: Server-9.x
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 2
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Knielsen - Thu, 24 Jun 2010, 11:59)=-=-
High Level Description modified.
--- /tmp/wklog.120.old.516 2010-06-24 11:59:24.000000000 +0000
+++ /tmp/wklog.120.new.516 2010-06-24 11:59:24.000000000 +0000
@@ -11,4 +11,4 @@
Event generators can be stacked, and a generator may defer event generation to
the next one down the stack. For example, the row-level replication event
-generator may defer DLL to the statement-level replication event generator.
+generator may defer DDL to the statement-level replication event generator.
-=-=(Knielsen - Mon, 21 Jun 2010, 08:35)=-=-
Research and design thoughts.
Worked 2 hours and estimate 0 hours remain (original estimate increased by 2 hours).
DESCRIPTION:
A part of the replication project, MWL#107.
Events are produced by event Generators. Examples are
- Generation of statement-based replication events
- Generation of row-based events
- Generation of PBXT engine-level replication events
and maybe reading of events from relay log on slave may also be an example of
generating events.
Event generators can be stacked, and a generator may defer event generation to
the next one down the stack. For example, the row-level replication event
generator may defer DDL to the statement-level replication event generator.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Progress (by Alexi): Store in binlog text of statements that caused RBR events (47)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Store in binlog text of statements that caused RBR events
CREATION DATE..: Sat, 15 Aug 2009, 23:48
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Knielsen, Serg
CATEGORY.......: Server-Sprint
TASK ID........: 47 (http://askmonty.org/worklog/?tid=47)
VERSION........: Server-9.x
STATUS.........: Code-Review
PRIORITY.......: 60
WORKED HOURS...: 72
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 35
PROGRESS NOTES:
-=-=(Alexi - Thu, 24 Jun 2010, 09:49)=-=-
Final implementation cleanup, testing, help Percona with build issues related to the WL
Worked 5 hours and estimate 0 hours remain (original estimate increased by 5 hours).
-=-=(Alexi - Thu, 24 Jun 2010, 09:47)=-=-
Making rpl- and binlog-tests stable w.r.t. adding new binlog events.
Worked 25 hours and estimate 0 hours remain (original estimate increased by 25 hours).
-=-=(Knielsen - Mon, 21 Jun 2010, 08:32)=-=-
Final review.
Assist with some problems applying the patch.
Worked 1 hour and estimate 0 hours remain (original estimate increased by 1 hour).
-=-=(Guest - Thu, 17 Jun 2010, 00:38)=-=-
Dependency deleted: 39 no longer depends on 47
-=-=(Knielsen - Mon, 07 Jun 2010, 07:13)=-=-
Help debug some test failures seen in Buildbot.
Worked 6 hours and estimate 0 hours remain (original estimate increased by 6 hours).
-=-=(Knielsen - Mon, 31 May 2010, 06:49)=-=-
Help Alexi debug+fix some test problems in the patch.
Worked 4 hours and estimate 0 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 25 May 2010, 08:29)=-=-
Help debug strange problem in mysqlbinlog.test.
Worked 1 hour and estimate 4 hours remain (original estimate unchanged).
-=-=(Knielsen - Mon, 17 May 2010, 08:45)=-=-
Merge with latest trunk and run Buildbot tests.
Worked 1 hour and estimate 5 hours remain (original estimate unchanged).
-=-=(Knielsen - Wed, 05 May 2010, 13:53)=-=-
Review of fixes to first review done. No new issues found.
Worked 2 hours and estimate 6 hours remain (original estimate unchanged).
-=-=(Knielsen - Fri, 23 Apr 2010, 12:51)=-=-
Status updated.
--- /tmp/wklog.47.old.28747 2010-04-23 12:51:36.000000000 +0000
+++ /tmp/wklog.47.new.28747 2010-04-23 12:51:36.000000000 +0000
@@ -1 +1 @@
-In-Progress
+Code-Review
------------------------------------------------------------
-=-=(View All Progress Notes, 37 total)=-=-
http://askmonty.org/worklog/index.pl?tid=47&nolimit=1
DESCRIPTION:
Store in binlog (and show in mysqlbinlog output) texts of statements that
caused RBR events
This is needed for (list from Monty):
- Easier to understand why updates happened
- Would make it easier to find out where in application things went
wrong (as you can search for exact strings)
- Allow one to filter things based on comments in the statement.
The cost of this can be that the binlog will be approximately 2x in size
(especially insert of big blob's would be a bit painful), so this should
be an optional feature.
HIGH-LEVEL SPECIFICATION:
Content
~~~~~~~
1. Annotate_rows_log_event
2. Server option: --binlog-annotate-rows-events
3. Server option: --replicate-annotate-rows-events
4. mysqlbinlog option: --print-annotate-rows-events
5. mysqlbinlog output
1. Annotate_rows_log_event [ ANNOTATE_ROWS_EVENT ]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Describes the query which caused the corresponding rows events. Has empty
post-header and contains the query text in its data part. Example:
************************
ANNOTATE_ROWS_EVENT
************************
00000220 | B6 A0 2C 4B | time_when = 1261215926
00000224 | 33 | event_type = 51
00000225 | 64 00 00 00 | server_id = 100
00000229 | 36 00 00 00 | event_len = 54
0000022D | 56 02 00 00 | log_pos = 00000256
00000231 | 00 00 | flags = <none>
------------------------
00000233 | 49 4E 53 45 | query = "INSERT INTO t1 VALUES (1), (2), (3)"
00000237 | 52 54 20 49 |
0000023B | 4E 54 4F 20 |
0000023F | 74 31 20 56 |
00000243 | 41 4C 55 45 |
00000247 | 53 20 28 31 |
0000024B | 29 2C 20 28 |
0000024F | 32 29 2C 20 |
00000253 | 28 33 29 |
************************
In binary log, Annotate_rows event follows the (possible) 'BEGIN' Query event
and precedes the first of Table map events which accompany the corresponding
rows events. (See example in the "mysqlbinlog output" section below.)
2. Server option: --binlog-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the master to write Annotate_rows events to the binary log.
* Variable Name: binlog_annotate_rows_events
* Scope: Global & Session
* Access Type: Dynamic
* Data Type: bool
* Default Value: OFF
NOTE. Session values allows to annotate only some selected statements:
...
SET SESSION binlog_annotate_rows_events=ON;
... statements to be annotated ...
SET SESSION binlog_annotate_rows_events=OFF;
... statements not to be annotated ...
3. Server option: --replicate-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the slave to reproduce Annotate_rows events recieved from the master
in its own binary log (sensible only in pair with log-slave-updates option).
* Variable Name: replicate_annotate_rows_events
* Scope: Global
* Access Type: Read only
* Data Type: bool
* Default Value: OFF
NOTE. Why do we additionally need this 'replicate' option? Why not to make
the slave to reproduce this events when its binlog-annotate-rows-events
global value is ON? Well, because, for example, we may want to configure
the slave which should reproduce Annotate_rows events but has global
binlog-annotate-rows-events = OFF meaning this to be the default value for
the client threads (see also "How slave treats replicate-annotate-rows-events
option" in LLD part).
4. mysqlbinlog option: --print-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
With this option, mysqlbinlog prints the content of Annotate_rows events (if
the binary log does contain them). Without this option (i.e. by default),
mysqlbinlog skips Annotate_rows events.
5. mysqlbinlog output
~~~~~~~~~~~~~~~~~~~~~
With --print-annotate-rows-events, mysqlbinlog outputs Annotate_rows events
in a form like this:
...
# at 1646
#091219 12:45:26 server id 100 end_log_pos 1714 Query thread_id=1
exec_time=0 error_code=0
SET TIMESTAMP=1261215926/*!*/;
BEGIN
/*!*/;
# at 1714
# at 1812
# at 1853
# at 1894
# at 1938
#091219 12:45:26 server id 100 end_log_pos 1812 Query: `DELETE t1, t2 FROM
t1 INNER JOIN t2 INNER JOIN t3 WHERE t1.a=t2.a AND t2.a=t3.a`
#091219 12:45:26 server id 100 end_log_pos 1853 Table_map: `test`.`t1`
mapped to number 16
#091219 12:45:26 server id 100 end_log_pos 1894 Table_map: `test`.`t2`
mapped to number 17
#091219 12:45:26 server id 100 end_log_pos 1938 Delete_rows: table id 16
#091219 12:45:26 server id 100 end_log_pos 1982 Delete_rows: table id 17
flags: STMT_END_F
...
LOW-LEVEL DESIGN:
Content
~~~~~~~
1. Annotate_rows event number
2. Outline of Annotate_rows event behavior
3. How Master writes Annotate_rows events to the binary log
4. How slave treats replicate-annotate-rows-events option
5. How slave IO thread requests Annotate_rows events
6. How master executes the request
7. How slave SQL thread processes Annotate_rows events
8. General remarks
1. Annotate_rows event number
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To avoid possible event numbers conflict with MySQL/Sun, we leave a gap
between the last MySQL event number and the Annotate_rows event number:
enum Log_event_type
{ ...
INCIDENT_EVENT= 26,
// New MySQL event numbers are to be added here
MYSQL_EVENTS_END,
MARIA_EVENTS_BEGIN= 51,
// New Maria event numbers start from here
ANNOTATE_ROWS_EVENT= 51,
ENUM_END_EVENT
};
together with the corresponding extension of 'post_header_len' array in the
Format description event. (This extension does not affect the compatibility
of the binary log). Here is how Format description event looks like with
this extension:
************************
FORMAT_DESCRIPTION_EVENT
************************
00000004 | A1 A0 2C 4B | time_when = 1261215905
00000008 | 0F | event_type = 15
00000009 | 64 00 00 00 | server_id = 100
0000000D | 7F 00 00 00 | event_len = 127
00000011 | 83 00 00 00 | log_pos = 00000083
00000015 | 01 00 | flags = LOG_EVENT_BINLOG_IN_USE_F
------------------------
00000017 | 04 00 | binlog_ver = 4
00000019 | 35 2E 32 2E | server_ver = 5.2.0-MariaDB-alpha-debug-log
..... ...
0000004B | A1 A0 2C 4B | time_created = 1261215905
0000004F | 13 | common_header_len = 19
------------------------
post_header_len
------------------------
00000050 | 38 | 56 - START_EVENT_V3 [1]
..... ...
00000069 | 02 | 2 - INCIDENT_EVENT [26]
0000006A | 00 | 0 - RESERVED [27]
..... ...
00000081 | 00 | 0 - RESERVED [50]
00000082 | 00 | 0 - ANNOTATE_ROWS_EVENT [51]
************************
2. Outline of Annotate_rows event behavior
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Each Annotate_rows_log_event object has two private members describing the
corresponding query:
char *m_query_txt;
uint m_query_len;
When the object is created for writing to a binary log, this query is taken
from 'thd' (for short, below we omit the 'Annotate_rows_log_event::' prefix
as well as other implementation details):
Annotate_rows_log_event(THD *thd)
{
m_query_txt = thd->query();
m_query_len = thd->query_length();
}
When the object is read from a binary log, the query is taken from the buffer
containing the binary log representation of the event (this buffer is allocated
in Log_event object from which all Log events are derived):
Annotate_rows_log_event(char *buf, uint event_len,
Format_description_log_event *desc)
{
m_query_len = event_len - desc->common_header_len;
m_query_txt = buf + desc->common_header_len;
}
The events are written to the binary log by the Log_event::write() member
which calls virtual write_data_header() and write_data_body() members
("data header" and "post header" are synonym in replication terminology).
In our case, data header is empty and data body is just the query:
bool write_data_body(IO_CACHE *file)
{
return my_b_safe_write(file, (uchar*) m_query_txt, m_query_len);
}
Printing the event is just printing the query:
void Annotate_rows_log_event::print(FILE *file, PRINT_EVENT_INFO *pinfo)
{
my_b_printf(&pinfo->head_cache, "\tQuery: `%s`\n", m_query_txt);
}
3. How Master writes Annotate_rows events to the binary log
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The event is written to the binary log just before the group of Table_map
events which precede corresponding Rows events (one query may generate
several Table map events in the binary log, but the corresponding
Annotate_rows event must be written only once before the first Table map
event; hence the boolean variable 'with_annotate' below):
int write_locked_table_maps(THD *thd)
{ ...
bool with_annotate= thd->variables.binlog_annotate_rows_events;
...
for (uint i= 0; i < ... <number of tables> ...; ++i)
{ ...
thd->binlog_write_table_map(table, ..., with_annotate);
with_annotate= 0; // write Annotate_event not more than once
...
}
...
}
int THD::binlog_write_table_map(TABLE *table, ..., bool with_annotate)
{ ...
Table_map_log_event the_event(...);
...
if (with_annotate)
{
Annotate_rows_log_event anno(this);
mysql_bin_log.write(&anno);
}
mysql_bin_log.write(&the_event);
...
}
4. How slave treats replicate-annotate-rows-events option
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The replicate-annotate-rows-events option is treated just as the session
value of the binlog_annotate_rows_events variable for the slave IO and
SQL threads. This setting is done during initialization of these threads:
pthread_handler_t handle_slave_io(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_IO);
...
}
pthread_handler_t handle_slave_sql(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_SQL);
...
}
int init_slave_thread(THD* thd, SLAVE_THD_TYPE thd_type)
{ ...
thd->variables.binlog_annotate_rows_events=
opt_replicate_annotate_rows_events;
...
}
5. How slave IO thread requests Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If the replicate-annotate-rows-events option is not set on a slave, there
is no need for master to send Annotate_rows events to this slave. The slave
(or mysqlbinlog in remote case), before requesting binlog dump via the
COM_BINLOG_DUMP command, informs the master whether it should send these
events by executing the newly added COM_BINLOG_DUMP_OPTIONS_EXT server
command:
case COM_BINLOG_DUMP_OPTIONS_EXT:
thd->binlog_dump_flags_ext= packet[0];
my_ok(thd);
break;
Note. We add this new command and don't use COM_BINLOG_DUMP to avoid possible
conflicts with MySQL/Sun.
6. How master executes the request
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
case COM_BINLOG_DUMP:
{ ...
flags= uint2korr(packet + 4);
...
mysql_binlog_send(thd, ..., flags);
...
}
void mysql_binlog_send(THD* thd, ..., ushort flags)
{ ...
Log_event::read_log_event(&log, packet, ...);
...
if ((*packet)[EVENT_TYPE_OFFSET + 1] != ANNOTATE_ROWS_EVENT ||
flags & BINLOG_SEND_ANNOTATE_ROWS_EVENT)
{
my_net_write(net, packet->ptr(), packet->length());
}
...
}
7. How slave SQL thread processes Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The slave processes each recieved event by "applying" it, i.e. by
calling the Log_event::apply_event() function which in turn calls
the virtual do_apply_event() member specific for each type of the
event.
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev = next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
int apply_event_and_update_pos(Log_event *ev, ...)
{ ...
ev->apply_event(...);
...
}
int Log_event::apply_event(...)
{
return do_apply_event(...);
}
What does it mean to "apply" an Annotate_rows event? It means to set current
thd query to that of the described by the event, i.e. to the query which
caused the subsequent Rows events (see "How Master writes Annotate_rows
events to the binary log" to follow what happens further when the subsequent
Rows events are applied):
int Annotate_rows_log_event::do_apply_event(...)
{
thd->set_query(m_query_txt, m_query_len);
}
NOTE. I am not sure, but possibly current values of thd->query and
thd->query_length should be saved before calling set_query() and to be
restored on the Annotate_rows_log_event object deletion.
Is it really needed ?
After calling this do_apply_event() function we may not delete the
Annotate_rows_log_event object immediatedly (see exec_relay_log_event()
above) because thd->query now points to the string inside this object.
We may keep the pointer to this object in the Relay_log_info:
class Relay_log_info
{
public:
...
void set_annotate_event(Annotate_rows_log_event*);
Annotate_rows_log_event* get_annotate_event();
void free_annotate_event();
...
private:
Annotate_rows_log_event* m_annotate_event;
};
The saved Annotate_rows object should be deleted when all corresponding
Rows events will be processed:
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev= next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (rli->get_annotate_event() && is_last_rows_event(ev))
rli->free_annotate_event();
else if (ev->get_type_code() == ANNOTATE_ROWS_EVENT)
rli->set_annotate_event((Annotate_rows_log_event*) ev);
else if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
where
bool is_last_rows_event(Log_event* ev)
{
Log_event_type type= ev->get_type_code();
if (IS_ROWS_EVENT_TYPE(type))
{
Rows_log_event* rows= (Rows_log_event*)ev;
return rows->get_flags(Rows_log_event::STMT_END_F);
}
return 0;
}
#define IS_ROWS_EVENT_TYPE(type) ((type) == WRITE_ROWS_EVENT || \
(type) == UPDATE_ROWS_EVENT || \
(type) == DELETE_ROWS_EVENT)
8. General remarks
~~~~~~~~~~~~~~~~~~
Kristian noticed that introducing new log event type should be coordinated
somehow with MySQL/Sun:
Kristian: The numeric code for this event must be assigned carefully.
It should be coordinated with MySQL/Sun, otherwise we can get into a
situation where MySQL uses the same numeric code for one event that
MariaDB uses for ANNOTATE_ROWS_EVENT, which would make merging the two
impossible.
Alex: I reserved about 20 numbers not to have possible conflicts
with MySQL.
Kristian: Still, I think it would be appropriate to send a polite email
to internals(a)lists.mysql.com about this and suggesting to reserve the
event number.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Progress (by Alexi): Store in binlog text of statements that caused RBR events (47)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Store in binlog text of statements that caused RBR events
CREATION DATE..: Sat, 15 Aug 2009, 23:48
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Knielsen, Serg
CATEGORY.......: Server-Sprint
TASK ID........: 47 (http://askmonty.org/worklog/?tid=47)
VERSION........: Server-9.x
STATUS.........: Code-Review
PRIORITY.......: 60
WORKED HOURS...: 72
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 35
PROGRESS NOTES:
-=-=(Alexi - Thu, 24 Jun 2010, 09:49)=-=-
Final implementation cleanup, testing, help Percona with build issues related to the WL
Worked 5 hours and estimate 0 hours remain (original estimate increased by 5 hours).
-=-=(Alexi - Thu, 24 Jun 2010, 09:47)=-=-
Making rpl- and binlog-tests stable w.r.t. adding new binlog events.
Worked 25 hours and estimate 0 hours remain (original estimate increased by 25 hours).
-=-=(Knielsen - Mon, 21 Jun 2010, 08:32)=-=-
Final review.
Assist with some problems applying the patch.
Worked 1 hour and estimate 0 hours remain (original estimate increased by 1 hour).
-=-=(Guest - Thu, 17 Jun 2010, 00:38)=-=-
Dependency deleted: 39 no longer depends on 47
-=-=(Knielsen - Mon, 07 Jun 2010, 07:13)=-=-
Help debug some test failures seen in Buildbot.
Worked 6 hours and estimate 0 hours remain (original estimate increased by 6 hours).
-=-=(Knielsen - Mon, 31 May 2010, 06:49)=-=-
Help Alexi debug+fix some test problems in the patch.
Worked 4 hours and estimate 0 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 25 May 2010, 08:29)=-=-
Help debug strange problem in mysqlbinlog.test.
Worked 1 hour and estimate 4 hours remain (original estimate unchanged).
-=-=(Knielsen - Mon, 17 May 2010, 08:45)=-=-
Merge with latest trunk and run Buildbot tests.
Worked 1 hour and estimate 5 hours remain (original estimate unchanged).
-=-=(Knielsen - Wed, 05 May 2010, 13:53)=-=-
Review of fixes to first review done. No new issues found.
Worked 2 hours and estimate 6 hours remain (original estimate unchanged).
-=-=(Knielsen - Fri, 23 Apr 2010, 12:51)=-=-
Status updated.
--- /tmp/wklog.47.old.28747 2010-04-23 12:51:36.000000000 +0000
+++ /tmp/wklog.47.new.28747 2010-04-23 12:51:36.000000000 +0000
@@ -1 +1 @@
-In-Progress
+Code-Review
------------------------------------------------------------
-=-=(View All Progress Notes, 37 total)=-=-
http://askmonty.org/worklog/index.pl?tid=47&nolimit=1
DESCRIPTION:
Store in binlog (and show in mysqlbinlog output) texts of statements that
caused RBR events
This is needed for (list from Monty):
- Easier to understand why updates happened
- Would make it easier to find out where in application things went
wrong (as you can search for exact strings)
- Allow one to filter things based on comments in the statement.
The cost of this can be that the binlog will be approximately 2x in size
(especially insert of big blob's would be a bit painful), so this should
be an optional feature.
HIGH-LEVEL SPECIFICATION:
Content
~~~~~~~
1. Annotate_rows_log_event
2. Server option: --binlog-annotate-rows-events
3. Server option: --replicate-annotate-rows-events
4. mysqlbinlog option: --print-annotate-rows-events
5. mysqlbinlog output
1. Annotate_rows_log_event [ ANNOTATE_ROWS_EVENT ]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Describes the query which caused the corresponding rows events. Has empty
post-header and contains the query text in its data part. Example:
************************
ANNOTATE_ROWS_EVENT
************************
00000220 | B6 A0 2C 4B | time_when = 1261215926
00000224 | 33 | event_type = 51
00000225 | 64 00 00 00 | server_id = 100
00000229 | 36 00 00 00 | event_len = 54
0000022D | 56 02 00 00 | log_pos = 00000256
00000231 | 00 00 | flags = <none>
------------------------
00000233 | 49 4E 53 45 | query = "INSERT INTO t1 VALUES (1), (2), (3)"
00000237 | 52 54 20 49 |
0000023B | 4E 54 4F 20 |
0000023F | 74 31 20 56 |
00000243 | 41 4C 55 45 |
00000247 | 53 20 28 31 |
0000024B | 29 2C 20 28 |
0000024F | 32 29 2C 20 |
00000253 | 28 33 29 |
************************
In binary log, Annotate_rows event follows the (possible) 'BEGIN' Query event
and precedes the first of Table map events which accompany the corresponding
rows events. (See example in the "mysqlbinlog output" section below.)
2. Server option: --binlog-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the master to write Annotate_rows events to the binary log.
* Variable Name: binlog_annotate_rows_events
* Scope: Global & Session
* Access Type: Dynamic
* Data Type: bool
* Default Value: OFF
NOTE. Session values allows to annotate only some selected statements:
...
SET SESSION binlog_annotate_rows_events=ON;
... statements to be annotated ...
SET SESSION binlog_annotate_rows_events=OFF;
... statements not to be annotated ...
3. Server option: --replicate-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the slave to reproduce Annotate_rows events recieved from the master
in its own binary log (sensible only in pair with log-slave-updates option).
* Variable Name: replicate_annotate_rows_events
* Scope: Global
* Access Type: Read only
* Data Type: bool
* Default Value: OFF
NOTE. Why do we additionally need this 'replicate' option? Why not to make
the slave to reproduce this events when its binlog-annotate-rows-events
global value is ON? Well, because, for example, we may want to configure
the slave which should reproduce Annotate_rows events but has global
binlog-annotate-rows-events = OFF meaning this to be the default value for
the client threads (see also "How slave treats replicate-annotate-rows-events
option" in LLD part).
4. mysqlbinlog option: --print-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
With this option, mysqlbinlog prints the content of Annotate_rows events (if
the binary log does contain them). Without this option (i.e. by default),
mysqlbinlog skips Annotate_rows events.
5. mysqlbinlog output
~~~~~~~~~~~~~~~~~~~~~
With --print-annotate-rows-events, mysqlbinlog outputs Annotate_rows events
in a form like this:
...
# at 1646
#091219 12:45:26 server id 100 end_log_pos 1714 Query thread_id=1
exec_time=0 error_code=0
SET TIMESTAMP=1261215926/*!*/;
BEGIN
/*!*/;
# at 1714
# at 1812
# at 1853
# at 1894
# at 1938
#091219 12:45:26 server id 100 end_log_pos 1812 Query: `DELETE t1, t2 FROM
t1 INNER JOIN t2 INNER JOIN t3 WHERE t1.a=t2.a AND t2.a=t3.a`
#091219 12:45:26 server id 100 end_log_pos 1853 Table_map: `test`.`t1`
mapped to number 16
#091219 12:45:26 server id 100 end_log_pos 1894 Table_map: `test`.`t2`
mapped to number 17
#091219 12:45:26 server id 100 end_log_pos 1938 Delete_rows: table id 16
#091219 12:45:26 server id 100 end_log_pos 1982 Delete_rows: table id 17
flags: STMT_END_F
...
LOW-LEVEL DESIGN:
Content
~~~~~~~
1. Annotate_rows event number
2. Outline of Annotate_rows event behavior
3. How Master writes Annotate_rows events to the binary log
4. How slave treats replicate-annotate-rows-events option
5. How slave IO thread requests Annotate_rows events
6. How master executes the request
7. How slave SQL thread processes Annotate_rows events
8. General remarks
1. Annotate_rows event number
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To avoid possible event numbers conflict with MySQL/Sun, we leave a gap
between the last MySQL event number and the Annotate_rows event number:
enum Log_event_type
{ ...
INCIDENT_EVENT= 26,
// New MySQL event numbers are to be added here
MYSQL_EVENTS_END,
MARIA_EVENTS_BEGIN= 51,
// New Maria event numbers start from here
ANNOTATE_ROWS_EVENT= 51,
ENUM_END_EVENT
};
together with the corresponding extension of 'post_header_len' array in the
Format description event. (This extension does not affect the compatibility
of the binary log). Here is how Format description event looks like with
this extension:
************************
FORMAT_DESCRIPTION_EVENT
************************
00000004 | A1 A0 2C 4B | time_when = 1261215905
00000008 | 0F | event_type = 15
00000009 | 64 00 00 00 | server_id = 100
0000000D | 7F 00 00 00 | event_len = 127
00000011 | 83 00 00 00 | log_pos = 00000083
00000015 | 01 00 | flags = LOG_EVENT_BINLOG_IN_USE_F
------------------------
00000017 | 04 00 | binlog_ver = 4
00000019 | 35 2E 32 2E | server_ver = 5.2.0-MariaDB-alpha-debug-log
..... ...
0000004B | A1 A0 2C 4B | time_created = 1261215905
0000004F | 13 | common_header_len = 19
------------------------
post_header_len
------------------------
00000050 | 38 | 56 - START_EVENT_V3 [1]
..... ...
00000069 | 02 | 2 - INCIDENT_EVENT [26]
0000006A | 00 | 0 - RESERVED [27]
..... ...
00000081 | 00 | 0 - RESERVED [50]
00000082 | 00 | 0 - ANNOTATE_ROWS_EVENT [51]
************************
2. Outline of Annotate_rows event behavior
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Each Annotate_rows_log_event object has two private members describing the
corresponding query:
char *m_query_txt;
uint m_query_len;
When the object is created for writing to a binary log, this query is taken
from 'thd' (for short, below we omit the 'Annotate_rows_log_event::' prefix
as well as other implementation details):
Annotate_rows_log_event(THD *thd)
{
m_query_txt = thd->query();
m_query_len = thd->query_length();
}
When the object is read from a binary log, the query is taken from the buffer
containing the binary log representation of the event (this buffer is allocated
in Log_event object from which all Log events are derived):
Annotate_rows_log_event(char *buf, uint event_len,
Format_description_log_event *desc)
{
m_query_len = event_len - desc->common_header_len;
m_query_txt = buf + desc->common_header_len;
}
The events are written to the binary log by the Log_event::write() member
which calls virtual write_data_header() and write_data_body() members
("data header" and "post header" are synonym in replication terminology).
In our case, data header is empty and data body is just the query:
bool write_data_body(IO_CACHE *file)
{
return my_b_safe_write(file, (uchar*) m_query_txt, m_query_len);
}
Printing the event is just printing the query:
void Annotate_rows_log_event::print(FILE *file, PRINT_EVENT_INFO *pinfo)
{
my_b_printf(&pinfo->head_cache, "\tQuery: `%s`\n", m_query_txt);
}
3. How Master writes Annotate_rows events to the binary log
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The event is written to the binary log just before the group of Table_map
events which precede corresponding Rows events (one query may generate
several Table map events in the binary log, but the corresponding
Annotate_rows event must be written only once before the first Table map
event; hence the boolean variable 'with_annotate' below):
int write_locked_table_maps(THD *thd)
{ ...
bool with_annotate= thd->variables.binlog_annotate_rows_events;
...
for (uint i= 0; i < ... <number of tables> ...; ++i)
{ ...
thd->binlog_write_table_map(table, ..., with_annotate);
with_annotate= 0; // write Annotate_event not more than once
...
}
...
}
int THD::binlog_write_table_map(TABLE *table, ..., bool with_annotate)
{ ...
Table_map_log_event the_event(...);
...
if (with_annotate)
{
Annotate_rows_log_event anno(this);
mysql_bin_log.write(&anno);
}
mysql_bin_log.write(&the_event);
...
}
4. How slave treats replicate-annotate-rows-events option
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The replicate-annotate-rows-events option is treated just as the session
value of the binlog_annotate_rows_events variable for the slave IO and
SQL threads. This setting is done during initialization of these threads:
pthread_handler_t handle_slave_io(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_IO);
...
}
pthread_handler_t handle_slave_sql(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_SQL);
...
}
int init_slave_thread(THD* thd, SLAVE_THD_TYPE thd_type)
{ ...
thd->variables.binlog_annotate_rows_events=
opt_replicate_annotate_rows_events;
...
}
5. How slave IO thread requests Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If the replicate-annotate-rows-events option is not set on a slave, there
is no need for master to send Annotate_rows events to this slave. The slave
(or mysqlbinlog in remote case), before requesting binlog dump via the
COM_BINLOG_DUMP command, informs the master whether it should send these
events by executing the newly added COM_BINLOG_DUMP_OPTIONS_EXT server
command:
case COM_BINLOG_DUMP_OPTIONS_EXT:
thd->binlog_dump_flags_ext= packet[0];
my_ok(thd);
break;
Note. We add this new command and don't use COM_BINLOG_DUMP to avoid possible
conflicts with MySQL/Sun.
6. How master executes the request
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
case COM_BINLOG_DUMP:
{ ...
flags= uint2korr(packet + 4);
...
mysql_binlog_send(thd, ..., flags);
...
}
void mysql_binlog_send(THD* thd, ..., ushort flags)
{ ...
Log_event::read_log_event(&log, packet, ...);
...
if ((*packet)[EVENT_TYPE_OFFSET + 1] != ANNOTATE_ROWS_EVENT ||
flags & BINLOG_SEND_ANNOTATE_ROWS_EVENT)
{
my_net_write(net, packet->ptr(), packet->length());
}
...
}
7. How slave SQL thread processes Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The slave processes each recieved event by "applying" it, i.e. by
calling the Log_event::apply_event() function which in turn calls
the virtual do_apply_event() member specific for each type of the
event.
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev = next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
int apply_event_and_update_pos(Log_event *ev, ...)
{ ...
ev->apply_event(...);
...
}
int Log_event::apply_event(...)
{
return do_apply_event(...);
}
What does it mean to "apply" an Annotate_rows event? It means to set current
thd query to that of the described by the event, i.e. to the query which
caused the subsequent Rows events (see "How Master writes Annotate_rows
events to the binary log" to follow what happens further when the subsequent
Rows events are applied):
int Annotate_rows_log_event::do_apply_event(...)
{
thd->set_query(m_query_txt, m_query_len);
}
NOTE. I am not sure, but possibly current values of thd->query and
thd->query_length should be saved before calling set_query() and to be
restored on the Annotate_rows_log_event object deletion.
Is it really needed ?
After calling this do_apply_event() function we may not delete the
Annotate_rows_log_event object immediatedly (see exec_relay_log_event()
above) because thd->query now points to the string inside this object.
We may keep the pointer to this object in the Relay_log_info:
class Relay_log_info
{
public:
...
void set_annotate_event(Annotate_rows_log_event*);
Annotate_rows_log_event* get_annotate_event();
void free_annotate_event();
...
private:
Annotate_rows_log_event* m_annotate_event;
};
The saved Annotate_rows object should be deleted when all corresponding
Rows events will be processed:
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev= next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (rli->get_annotate_event() && is_last_rows_event(ev))
rli->free_annotate_event();
else if (ev->get_type_code() == ANNOTATE_ROWS_EVENT)
rli->set_annotate_event((Annotate_rows_log_event*) ev);
else if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
where
bool is_last_rows_event(Log_event* ev)
{
Log_event_type type= ev->get_type_code();
if (IS_ROWS_EVENT_TYPE(type))
{
Rows_log_event* rows= (Rows_log_event*)ev;
return rows->get_flags(Rows_log_event::STMT_END_F);
}
return 0;
}
#define IS_ROWS_EVENT_TYPE(type) ((type) == WRITE_ROWS_EVENT || \
(type) == UPDATE_ROWS_EVENT || \
(type) == DELETE_ROWS_EVENT)
8. General remarks
~~~~~~~~~~~~~~~~~~
Kristian noticed that introducing new log event type should be coordinated
somehow with MySQL/Sun:
Kristian: The numeric code for this event must be assigned carefully.
It should be coordinated with MySQL/Sun, otherwise we can get into a
situation where MySQL uses the same numeric code for one event that
MariaDB uses for ANNOTATE_ROWS_EVENT, which would make merging the two
impossible.
Alex: I reserved about 20 numbers not to have possible conflicts
with MySQL.
Kristian: Still, I think it would be appropriate to send a polite email
to internals(a)lists.mysql.com about this and suggesting to reserve the
event number.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Progress (by Alexi): Store in binlog text of statements that caused RBR events (47)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Store in binlog text of statements that caused RBR events
CREATION DATE..: Sat, 15 Aug 2009, 23:48
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Knielsen, Serg
CATEGORY.......: Server-Sprint
TASK ID........: 47 (http://askmonty.org/worklog/?tid=47)
VERSION........: Server-9.x
STATUS.........: Code-Review
PRIORITY.......: 60
WORKED HOURS...: 72
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 35
PROGRESS NOTES:
-=-=(Alexi - Thu, 24 Jun 2010, 09:49)=-=-
Final implementation cleanup, testing, help Percona with build issues related to the WL
Worked 5 hours and estimate 0 hours remain (original estimate increased by 5 hours).
-=-=(Alexi - Thu, 24 Jun 2010, 09:47)=-=-
Making rpl- and binlog-tests stable w.r.t. adding new binlog events.
Worked 25 hours and estimate 0 hours remain (original estimate increased by 25 hours).
-=-=(Knielsen - Mon, 21 Jun 2010, 08:32)=-=-
Final review.
Assist with some problems applying the patch.
Worked 1 hour and estimate 0 hours remain (original estimate increased by 1 hour).
-=-=(Guest - Thu, 17 Jun 2010, 00:38)=-=-
Dependency deleted: 39 no longer depends on 47
-=-=(Knielsen - Mon, 07 Jun 2010, 07:13)=-=-
Help debug some test failures seen in Buildbot.
Worked 6 hours and estimate 0 hours remain (original estimate increased by 6 hours).
-=-=(Knielsen - Mon, 31 May 2010, 06:49)=-=-
Help Alexi debug+fix some test problems in the patch.
Worked 4 hours and estimate 0 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 25 May 2010, 08:29)=-=-
Help debug strange problem in mysqlbinlog.test.
Worked 1 hour and estimate 4 hours remain (original estimate unchanged).
-=-=(Knielsen - Mon, 17 May 2010, 08:45)=-=-
Merge with latest trunk and run Buildbot tests.
Worked 1 hour and estimate 5 hours remain (original estimate unchanged).
-=-=(Knielsen - Wed, 05 May 2010, 13:53)=-=-
Review of fixes to first review done. No new issues found.
Worked 2 hours and estimate 6 hours remain (original estimate unchanged).
-=-=(Knielsen - Fri, 23 Apr 2010, 12:51)=-=-
Status updated.
--- /tmp/wklog.47.old.28747 2010-04-23 12:51:36.000000000 +0000
+++ /tmp/wklog.47.new.28747 2010-04-23 12:51:36.000000000 +0000
@@ -1 +1 @@
-In-Progress
+Code-Review
------------------------------------------------------------
-=-=(View All Progress Notes, 37 total)=-=-
http://askmonty.org/worklog/index.pl?tid=47&nolimit=1
DESCRIPTION:
Store in binlog (and show in mysqlbinlog output) texts of statements that
caused RBR events
This is needed for (list from Monty):
- Easier to understand why updates happened
- Would make it easier to find out where in application things went
wrong (as you can search for exact strings)
- Allow one to filter things based on comments in the statement.
The cost of this can be that the binlog will be approximately 2x in size
(especially insert of big blob's would be a bit painful), so this should
be an optional feature.
HIGH-LEVEL SPECIFICATION:
Content
~~~~~~~
1. Annotate_rows_log_event
2. Server option: --binlog-annotate-rows-events
3. Server option: --replicate-annotate-rows-events
4. mysqlbinlog option: --print-annotate-rows-events
5. mysqlbinlog output
1. Annotate_rows_log_event [ ANNOTATE_ROWS_EVENT ]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Describes the query which caused the corresponding rows events. Has empty
post-header and contains the query text in its data part. Example:
************************
ANNOTATE_ROWS_EVENT
************************
00000220 | B6 A0 2C 4B | time_when = 1261215926
00000224 | 33 | event_type = 51
00000225 | 64 00 00 00 | server_id = 100
00000229 | 36 00 00 00 | event_len = 54
0000022D | 56 02 00 00 | log_pos = 00000256
00000231 | 00 00 | flags = <none>
------------------------
00000233 | 49 4E 53 45 | query = "INSERT INTO t1 VALUES (1), (2), (3)"
00000237 | 52 54 20 49 |
0000023B | 4E 54 4F 20 |
0000023F | 74 31 20 56 |
00000243 | 41 4C 55 45 |
00000247 | 53 20 28 31 |
0000024B | 29 2C 20 28 |
0000024F | 32 29 2C 20 |
00000253 | 28 33 29 |
************************
In binary log, Annotate_rows event follows the (possible) 'BEGIN' Query event
and precedes the first of Table map events which accompany the corresponding
rows events. (See example in the "mysqlbinlog output" section below.)
2. Server option: --binlog-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the master to write Annotate_rows events to the binary log.
* Variable Name: binlog_annotate_rows_events
* Scope: Global & Session
* Access Type: Dynamic
* Data Type: bool
* Default Value: OFF
NOTE. Session values allows to annotate only some selected statements:
...
SET SESSION binlog_annotate_rows_events=ON;
... statements to be annotated ...
SET SESSION binlog_annotate_rows_events=OFF;
... statements not to be annotated ...
3. Server option: --replicate-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the slave to reproduce Annotate_rows events recieved from the master
in its own binary log (sensible only in pair with log-slave-updates option).
* Variable Name: replicate_annotate_rows_events
* Scope: Global
* Access Type: Read only
* Data Type: bool
* Default Value: OFF
NOTE. Why do we additionally need this 'replicate' option? Why not to make
the slave to reproduce this events when its binlog-annotate-rows-events
global value is ON? Well, because, for example, we may want to configure
the slave which should reproduce Annotate_rows events but has global
binlog-annotate-rows-events = OFF meaning this to be the default value for
the client threads (see also "How slave treats replicate-annotate-rows-events
option" in LLD part).
4. mysqlbinlog option: --print-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
With this option, mysqlbinlog prints the content of Annotate_rows events (if
the binary log does contain them). Without this option (i.e. by default),
mysqlbinlog skips Annotate_rows events.
5. mysqlbinlog output
~~~~~~~~~~~~~~~~~~~~~
With --print-annotate-rows-events, mysqlbinlog outputs Annotate_rows events
in a form like this:
...
# at 1646
#091219 12:45:26 server id 100 end_log_pos 1714 Query thread_id=1
exec_time=0 error_code=0
SET TIMESTAMP=1261215926/*!*/;
BEGIN
/*!*/;
# at 1714
# at 1812
# at 1853
# at 1894
# at 1938
#091219 12:45:26 server id 100 end_log_pos 1812 Query: `DELETE t1, t2 FROM
t1 INNER JOIN t2 INNER JOIN t3 WHERE t1.a=t2.a AND t2.a=t3.a`
#091219 12:45:26 server id 100 end_log_pos 1853 Table_map: `test`.`t1`
mapped to number 16
#091219 12:45:26 server id 100 end_log_pos 1894 Table_map: `test`.`t2`
mapped to number 17
#091219 12:45:26 server id 100 end_log_pos 1938 Delete_rows: table id 16
#091219 12:45:26 server id 100 end_log_pos 1982 Delete_rows: table id 17
flags: STMT_END_F
...
LOW-LEVEL DESIGN:
Content
~~~~~~~
1. Annotate_rows event number
2. Outline of Annotate_rows event behavior
3. How Master writes Annotate_rows events to the binary log
4. How slave treats replicate-annotate-rows-events option
5. How slave IO thread requests Annotate_rows events
6. How master executes the request
7. How slave SQL thread processes Annotate_rows events
8. General remarks
1. Annotate_rows event number
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To avoid possible event numbers conflict with MySQL/Sun, we leave a gap
between the last MySQL event number and the Annotate_rows event number:
enum Log_event_type
{ ...
INCIDENT_EVENT= 26,
// New MySQL event numbers are to be added here
MYSQL_EVENTS_END,
MARIA_EVENTS_BEGIN= 51,
// New Maria event numbers start from here
ANNOTATE_ROWS_EVENT= 51,
ENUM_END_EVENT
};
together with the corresponding extension of 'post_header_len' array in the
Format description event. (This extension does not affect the compatibility
of the binary log). Here is how Format description event looks like with
this extension:
************************
FORMAT_DESCRIPTION_EVENT
************************
00000004 | A1 A0 2C 4B | time_when = 1261215905
00000008 | 0F | event_type = 15
00000009 | 64 00 00 00 | server_id = 100
0000000D | 7F 00 00 00 | event_len = 127
00000011 | 83 00 00 00 | log_pos = 00000083
00000015 | 01 00 | flags = LOG_EVENT_BINLOG_IN_USE_F
------------------------
00000017 | 04 00 | binlog_ver = 4
00000019 | 35 2E 32 2E | server_ver = 5.2.0-MariaDB-alpha-debug-log
..... ...
0000004B | A1 A0 2C 4B | time_created = 1261215905
0000004F | 13 | common_header_len = 19
------------------------
post_header_len
------------------------
00000050 | 38 | 56 - START_EVENT_V3 [1]
..... ...
00000069 | 02 | 2 - INCIDENT_EVENT [26]
0000006A | 00 | 0 - RESERVED [27]
..... ...
00000081 | 00 | 0 - RESERVED [50]
00000082 | 00 | 0 - ANNOTATE_ROWS_EVENT [51]
************************
2. Outline of Annotate_rows event behavior
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Each Annotate_rows_log_event object has two private members describing the
corresponding query:
char *m_query_txt;
uint m_query_len;
When the object is created for writing to a binary log, this query is taken
from 'thd' (for short, below we omit the 'Annotate_rows_log_event::' prefix
as well as other implementation details):
Annotate_rows_log_event(THD *thd)
{
m_query_txt = thd->query();
m_query_len = thd->query_length();
}
When the object is read from a binary log, the query is taken from the buffer
containing the binary log representation of the event (this buffer is allocated
in Log_event object from which all Log events are derived):
Annotate_rows_log_event(char *buf, uint event_len,
Format_description_log_event *desc)
{
m_query_len = event_len - desc->common_header_len;
m_query_txt = buf + desc->common_header_len;
}
The events are written to the binary log by the Log_event::write() member
which calls virtual write_data_header() and write_data_body() members
("data header" and "post header" are synonym in replication terminology).
In our case, data header is empty and data body is just the query:
bool write_data_body(IO_CACHE *file)
{
return my_b_safe_write(file, (uchar*) m_query_txt, m_query_len);
}
Printing the event is just printing the query:
void Annotate_rows_log_event::print(FILE *file, PRINT_EVENT_INFO *pinfo)
{
my_b_printf(&pinfo->head_cache, "\tQuery: `%s`\n", m_query_txt);
}
3. How Master writes Annotate_rows events to the binary log
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The event is written to the binary log just before the group of Table_map
events which precede corresponding Rows events (one query may generate
several Table map events in the binary log, but the corresponding
Annotate_rows event must be written only once before the first Table map
event; hence the boolean variable 'with_annotate' below):
int write_locked_table_maps(THD *thd)
{ ...
bool with_annotate= thd->variables.binlog_annotate_rows_events;
...
for (uint i= 0; i < ... <number of tables> ...; ++i)
{ ...
thd->binlog_write_table_map(table, ..., with_annotate);
with_annotate= 0; // write Annotate_event not more than once
...
}
...
}
int THD::binlog_write_table_map(TABLE *table, ..., bool with_annotate)
{ ...
Table_map_log_event the_event(...);
...
if (with_annotate)
{
Annotate_rows_log_event anno(this);
mysql_bin_log.write(&anno);
}
mysql_bin_log.write(&the_event);
...
}
4. How slave treats replicate-annotate-rows-events option
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The replicate-annotate-rows-events option is treated just as the session
value of the binlog_annotate_rows_events variable for the slave IO and
SQL threads. This setting is done during initialization of these threads:
pthread_handler_t handle_slave_io(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_IO);
...
}
pthread_handler_t handle_slave_sql(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_SQL);
...
}
int init_slave_thread(THD* thd, SLAVE_THD_TYPE thd_type)
{ ...
thd->variables.binlog_annotate_rows_events=
opt_replicate_annotate_rows_events;
...
}
5. How slave IO thread requests Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If the replicate-annotate-rows-events option is not set on a slave, there
is no need for master to send Annotate_rows events to this slave. The slave
(or mysqlbinlog in remote case), before requesting binlog dump via the
COM_BINLOG_DUMP command, informs the master whether it should send these
events by executing the newly added COM_BINLOG_DUMP_OPTIONS_EXT server
command:
case COM_BINLOG_DUMP_OPTIONS_EXT:
thd->binlog_dump_flags_ext= packet[0];
my_ok(thd);
break;
Note. We add this new command and don't use COM_BINLOG_DUMP to avoid possible
conflicts with MySQL/Sun.
6. How master executes the request
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
case COM_BINLOG_DUMP:
{ ...
flags= uint2korr(packet + 4);
...
mysql_binlog_send(thd, ..., flags);
...
}
void mysql_binlog_send(THD* thd, ..., ushort flags)
{ ...
Log_event::read_log_event(&log, packet, ...);
...
if ((*packet)[EVENT_TYPE_OFFSET + 1] != ANNOTATE_ROWS_EVENT ||
flags & BINLOG_SEND_ANNOTATE_ROWS_EVENT)
{
my_net_write(net, packet->ptr(), packet->length());
}
...
}
7. How slave SQL thread processes Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The slave processes each recieved event by "applying" it, i.e. by
calling the Log_event::apply_event() function which in turn calls
the virtual do_apply_event() member specific for each type of the
event.
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev = next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
int apply_event_and_update_pos(Log_event *ev, ...)
{ ...
ev->apply_event(...);
...
}
int Log_event::apply_event(...)
{
return do_apply_event(...);
}
What does it mean to "apply" an Annotate_rows event? It means to set current
thd query to that of the described by the event, i.e. to the query which
caused the subsequent Rows events (see "How Master writes Annotate_rows
events to the binary log" to follow what happens further when the subsequent
Rows events are applied):
int Annotate_rows_log_event::do_apply_event(...)
{
thd->set_query(m_query_txt, m_query_len);
}
NOTE. I am not sure, but possibly current values of thd->query and
thd->query_length should be saved before calling set_query() and to be
restored on the Annotate_rows_log_event object deletion.
Is it really needed ?
After calling this do_apply_event() function we may not delete the
Annotate_rows_log_event object immediatedly (see exec_relay_log_event()
above) because thd->query now points to the string inside this object.
We may keep the pointer to this object in the Relay_log_info:
class Relay_log_info
{
public:
...
void set_annotate_event(Annotate_rows_log_event*);
Annotate_rows_log_event* get_annotate_event();
void free_annotate_event();
...
private:
Annotate_rows_log_event* m_annotate_event;
};
The saved Annotate_rows object should be deleted when all corresponding
Rows events will be processed:
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev= next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (rli->get_annotate_event() && is_last_rows_event(ev))
rli->free_annotate_event();
else if (ev->get_type_code() == ANNOTATE_ROWS_EVENT)
rli->set_annotate_event((Annotate_rows_log_event*) ev);
else if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
where
bool is_last_rows_event(Log_event* ev)
{
Log_event_type type= ev->get_type_code();
if (IS_ROWS_EVENT_TYPE(type))
{
Rows_log_event* rows= (Rows_log_event*)ev;
return rows->get_flags(Rows_log_event::STMT_END_F);
}
return 0;
}
#define IS_ROWS_EVENT_TYPE(type) ((type) == WRITE_ROWS_EVENT || \
(type) == UPDATE_ROWS_EVENT || \
(type) == DELETE_ROWS_EVENT)
8. General remarks
~~~~~~~~~~~~~~~~~~
Kristian noticed that introducing new log event type should be coordinated
somehow with MySQL/Sun:
Kristian: The numeric code for this event must be assigned carefully.
It should be coordinated with MySQL/Sun, otherwise we can get into a
situation where MySQL uses the same numeric code for one event that
MariaDB uses for ANNOTATE_ROWS_EVENT, which would make merging the two
impossible.
Alex: I reserved about 20 numbers not to have possible conflicts
with MySQL.
Kristian: Still, I think it would be appropriate to send a polite email
to internals(a)lists.mysql.com about this and suggesting to reserve the
event number.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Progress (by Alexi): Store in binlog text of statements that caused RBR events (47)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Store in binlog text of statements that caused RBR events
CREATION DATE..: Sat, 15 Aug 2009, 23:48
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Knielsen, Serg
CATEGORY.......: Server-Sprint
TASK ID........: 47 (http://askmonty.org/worklog/?tid=47)
VERSION........: Server-9.x
STATUS.........: Code-Review
PRIORITY.......: 60
WORKED HOURS...: 72
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 35
PROGRESS NOTES:
-=-=(Alexi - Thu, 24 Jun 2010, 09:49)=-=-
Final implementation cleanup, testing, help Percona with build issues related to the WL
Worked 5 hours and estimate 0 hours remain (original estimate increased by 5 hours).
-=-=(Alexi - Thu, 24 Jun 2010, 09:47)=-=-
Making rpl- and binlog-tests stable w.r.t. adding new binlog events.
Worked 25 hours and estimate 0 hours remain (original estimate increased by 25 hours).
-=-=(Knielsen - Mon, 21 Jun 2010, 08:32)=-=-
Final review.
Assist with some problems applying the patch.
Worked 1 hour and estimate 0 hours remain (original estimate increased by 1 hour).
-=-=(Guest - Thu, 17 Jun 2010, 00:38)=-=-
Dependency deleted: 39 no longer depends on 47
-=-=(Knielsen - Mon, 07 Jun 2010, 07:13)=-=-
Help debug some test failures seen in Buildbot.
Worked 6 hours and estimate 0 hours remain (original estimate increased by 6 hours).
-=-=(Knielsen - Mon, 31 May 2010, 06:49)=-=-
Help Alexi debug+fix some test problems in the patch.
Worked 4 hours and estimate 0 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 25 May 2010, 08:29)=-=-
Help debug strange problem in mysqlbinlog.test.
Worked 1 hour and estimate 4 hours remain (original estimate unchanged).
-=-=(Knielsen - Mon, 17 May 2010, 08:45)=-=-
Merge with latest trunk and run Buildbot tests.
Worked 1 hour and estimate 5 hours remain (original estimate unchanged).
-=-=(Knielsen - Wed, 05 May 2010, 13:53)=-=-
Review of fixes to first review done. No new issues found.
Worked 2 hours and estimate 6 hours remain (original estimate unchanged).
-=-=(Knielsen - Fri, 23 Apr 2010, 12:51)=-=-
Status updated.
--- /tmp/wklog.47.old.28747 2010-04-23 12:51:36.000000000 +0000
+++ /tmp/wklog.47.new.28747 2010-04-23 12:51:36.000000000 +0000
@@ -1 +1 @@
-In-Progress
+Code-Review
------------------------------------------------------------
-=-=(View All Progress Notes, 37 total)=-=-
http://askmonty.org/worklog/index.pl?tid=47&nolimit=1
DESCRIPTION:
Store in binlog (and show in mysqlbinlog output) texts of statements that
caused RBR events
This is needed for (list from Monty):
- Easier to understand why updates happened
- Would make it easier to find out where in application things went
wrong (as you can search for exact strings)
- Allow one to filter things based on comments in the statement.
The cost of this can be that the binlog will be approximately 2x in size
(especially insert of big blob's would be a bit painful), so this should
be an optional feature.
HIGH-LEVEL SPECIFICATION:
Content
~~~~~~~
1. Annotate_rows_log_event
2. Server option: --binlog-annotate-rows-events
3. Server option: --replicate-annotate-rows-events
4. mysqlbinlog option: --print-annotate-rows-events
5. mysqlbinlog output
1. Annotate_rows_log_event [ ANNOTATE_ROWS_EVENT ]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Describes the query which caused the corresponding rows events. Has empty
post-header and contains the query text in its data part. Example:
************************
ANNOTATE_ROWS_EVENT
************************
00000220 | B6 A0 2C 4B | time_when = 1261215926
00000224 | 33 | event_type = 51
00000225 | 64 00 00 00 | server_id = 100
00000229 | 36 00 00 00 | event_len = 54
0000022D | 56 02 00 00 | log_pos = 00000256
00000231 | 00 00 | flags = <none>
------------------------
00000233 | 49 4E 53 45 | query = "INSERT INTO t1 VALUES (1), (2), (3)"
00000237 | 52 54 20 49 |
0000023B | 4E 54 4F 20 |
0000023F | 74 31 20 56 |
00000243 | 41 4C 55 45 |
00000247 | 53 20 28 31 |
0000024B | 29 2C 20 28 |
0000024F | 32 29 2C 20 |
00000253 | 28 33 29 |
************************
In binary log, Annotate_rows event follows the (possible) 'BEGIN' Query event
and precedes the first of Table map events which accompany the corresponding
rows events. (See example in the "mysqlbinlog output" section below.)
2. Server option: --binlog-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the master to write Annotate_rows events to the binary log.
* Variable Name: binlog_annotate_rows_events
* Scope: Global & Session
* Access Type: Dynamic
* Data Type: bool
* Default Value: OFF
NOTE. Session values allows to annotate only some selected statements:
...
SET SESSION binlog_annotate_rows_events=ON;
... statements to be annotated ...
SET SESSION binlog_annotate_rows_events=OFF;
... statements not to be annotated ...
3. Server option: --replicate-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the slave to reproduce Annotate_rows events recieved from the master
in its own binary log (sensible only in pair with log-slave-updates option).
* Variable Name: replicate_annotate_rows_events
* Scope: Global
* Access Type: Read only
* Data Type: bool
* Default Value: OFF
NOTE. Why do we additionally need this 'replicate' option? Why not to make
the slave to reproduce this events when its binlog-annotate-rows-events
global value is ON? Well, because, for example, we may want to configure
the slave which should reproduce Annotate_rows events but has global
binlog-annotate-rows-events = OFF meaning this to be the default value for
the client threads (see also "How slave treats replicate-annotate-rows-events
option" in LLD part).
4. mysqlbinlog option: --print-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
With this option, mysqlbinlog prints the content of Annotate_rows events (if
the binary log does contain them). Without this option (i.e. by default),
mysqlbinlog skips Annotate_rows events.
5. mysqlbinlog output
~~~~~~~~~~~~~~~~~~~~~
With --print-annotate-rows-events, mysqlbinlog outputs Annotate_rows events
in a form like this:
...
# at 1646
#091219 12:45:26 server id 100 end_log_pos 1714 Query thread_id=1
exec_time=0 error_code=0
SET TIMESTAMP=1261215926/*!*/;
BEGIN
/*!*/;
# at 1714
# at 1812
# at 1853
# at 1894
# at 1938
#091219 12:45:26 server id 100 end_log_pos 1812 Query: `DELETE t1, t2 FROM
t1 INNER JOIN t2 INNER JOIN t3 WHERE t1.a=t2.a AND t2.a=t3.a`
#091219 12:45:26 server id 100 end_log_pos 1853 Table_map: `test`.`t1`
mapped to number 16
#091219 12:45:26 server id 100 end_log_pos 1894 Table_map: `test`.`t2`
mapped to number 17
#091219 12:45:26 server id 100 end_log_pos 1938 Delete_rows: table id 16
#091219 12:45:26 server id 100 end_log_pos 1982 Delete_rows: table id 17
flags: STMT_END_F
...
LOW-LEVEL DESIGN:
Content
~~~~~~~
1. Annotate_rows event number
2. Outline of Annotate_rows event behavior
3. How Master writes Annotate_rows events to the binary log
4. How slave treats replicate-annotate-rows-events option
5. How slave IO thread requests Annotate_rows events
6. How master executes the request
7. How slave SQL thread processes Annotate_rows events
8. General remarks
1. Annotate_rows event number
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To avoid possible event numbers conflict with MySQL/Sun, we leave a gap
between the last MySQL event number and the Annotate_rows event number:
enum Log_event_type
{ ...
INCIDENT_EVENT= 26,
// New MySQL event numbers are to be added here
MYSQL_EVENTS_END,
MARIA_EVENTS_BEGIN= 51,
// New Maria event numbers start from here
ANNOTATE_ROWS_EVENT= 51,
ENUM_END_EVENT
};
together with the corresponding extension of 'post_header_len' array in the
Format description event. (This extension does not affect the compatibility
of the binary log). Here is how Format description event looks like with
this extension:
************************
FORMAT_DESCRIPTION_EVENT
************************
00000004 | A1 A0 2C 4B | time_when = 1261215905
00000008 | 0F | event_type = 15
00000009 | 64 00 00 00 | server_id = 100
0000000D | 7F 00 00 00 | event_len = 127
00000011 | 83 00 00 00 | log_pos = 00000083
00000015 | 01 00 | flags = LOG_EVENT_BINLOG_IN_USE_F
------------------------
00000017 | 04 00 | binlog_ver = 4
00000019 | 35 2E 32 2E | server_ver = 5.2.0-MariaDB-alpha-debug-log
..... ...
0000004B | A1 A0 2C 4B | time_created = 1261215905
0000004F | 13 | common_header_len = 19
------------------------
post_header_len
------------------------
00000050 | 38 | 56 - START_EVENT_V3 [1]
..... ...
00000069 | 02 | 2 - INCIDENT_EVENT [26]
0000006A | 00 | 0 - RESERVED [27]
..... ...
00000081 | 00 | 0 - RESERVED [50]
00000082 | 00 | 0 - ANNOTATE_ROWS_EVENT [51]
************************
2. Outline of Annotate_rows event behavior
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Each Annotate_rows_log_event object has two private members describing the
corresponding query:
char *m_query_txt;
uint m_query_len;
When the object is created for writing to a binary log, this query is taken
from 'thd' (for short, below we omit the 'Annotate_rows_log_event::' prefix
as well as other implementation details):
Annotate_rows_log_event(THD *thd)
{
m_query_txt = thd->query();
m_query_len = thd->query_length();
}
When the object is read from a binary log, the query is taken from the buffer
containing the binary log representation of the event (this buffer is allocated
in Log_event object from which all Log events are derived):
Annotate_rows_log_event(char *buf, uint event_len,
Format_description_log_event *desc)
{
m_query_len = event_len - desc->common_header_len;
m_query_txt = buf + desc->common_header_len;
}
The events are written to the binary log by the Log_event::write() member
which calls virtual write_data_header() and write_data_body() members
("data header" and "post header" are synonym in replication terminology).
In our case, data header is empty and data body is just the query:
bool write_data_body(IO_CACHE *file)
{
return my_b_safe_write(file, (uchar*) m_query_txt, m_query_len);
}
Printing the event is just printing the query:
void Annotate_rows_log_event::print(FILE *file, PRINT_EVENT_INFO *pinfo)
{
my_b_printf(&pinfo->head_cache, "\tQuery: `%s`\n", m_query_txt);
}
3. How Master writes Annotate_rows events to the binary log
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The event is written to the binary log just before the group of Table_map
events which precede corresponding Rows events (one query may generate
several Table map events in the binary log, but the corresponding
Annotate_rows event must be written only once before the first Table map
event; hence the boolean variable 'with_annotate' below):
int write_locked_table_maps(THD *thd)
{ ...
bool with_annotate= thd->variables.binlog_annotate_rows_events;
...
for (uint i= 0; i < ... <number of tables> ...; ++i)
{ ...
thd->binlog_write_table_map(table, ..., with_annotate);
with_annotate= 0; // write Annotate_event not more than once
...
}
...
}
int THD::binlog_write_table_map(TABLE *table, ..., bool with_annotate)
{ ...
Table_map_log_event the_event(...);
...
if (with_annotate)
{
Annotate_rows_log_event anno(this);
mysql_bin_log.write(&anno);
}
mysql_bin_log.write(&the_event);
...
}
4. How slave treats replicate-annotate-rows-events option
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The replicate-annotate-rows-events option is treated just as the session
value of the binlog_annotate_rows_events variable for the slave IO and
SQL threads. This setting is done during initialization of these threads:
pthread_handler_t handle_slave_io(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_IO);
...
}
pthread_handler_t handle_slave_sql(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_SQL);
...
}
int init_slave_thread(THD* thd, SLAVE_THD_TYPE thd_type)
{ ...
thd->variables.binlog_annotate_rows_events=
opt_replicate_annotate_rows_events;
...
}
5. How slave IO thread requests Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If the replicate-annotate-rows-events option is not set on a slave, there
is no need for master to send Annotate_rows events to this slave. The slave
(or mysqlbinlog in remote case), before requesting binlog dump via the
COM_BINLOG_DUMP command, informs the master whether it should send these
events by executing the newly added COM_BINLOG_DUMP_OPTIONS_EXT server
command:
case COM_BINLOG_DUMP_OPTIONS_EXT:
thd->binlog_dump_flags_ext= packet[0];
my_ok(thd);
break;
Note. We add this new command and don't use COM_BINLOG_DUMP to avoid possible
conflicts with MySQL/Sun.
6. How master executes the request
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
case COM_BINLOG_DUMP:
{ ...
flags= uint2korr(packet + 4);
...
mysql_binlog_send(thd, ..., flags);
...
}
void mysql_binlog_send(THD* thd, ..., ushort flags)
{ ...
Log_event::read_log_event(&log, packet, ...);
...
if ((*packet)[EVENT_TYPE_OFFSET + 1] != ANNOTATE_ROWS_EVENT ||
flags & BINLOG_SEND_ANNOTATE_ROWS_EVENT)
{
my_net_write(net, packet->ptr(), packet->length());
}
...
}
7. How slave SQL thread processes Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The slave processes each recieved event by "applying" it, i.e. by
calling the Log_event::apply_event() function which in turn calls
the virtual do_apply_event() member specific for each type of the
event.
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev = next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
int apply_event_and_update_pos(Log_event *ev, ...)
{ ...
ev->apply_event(...);
...
}
int Log_event::apply_event(...)
{
return do_apply_event(...);
}
What does it mean to "apply" an Annotate_rows event? It means to set current
thd query to that of the described by the event, i.e. to the query which
caused the subsequent Rows events (see "How Master writes Annotate_rows
events to the binary log" to follow what happens further when the subsequent
Rows events are applied):
int Annotate_rows_log_event::do_apply_event(...)
{
thd->set_query(m_query_txt, m_query_len);
}
NOTE. I am not sure, but possibly current values of thd->query and
thd->query_length should be saved before calling set_query() and to be
restored on the Annotate_rows_log_event object deletion.
Is it really needed ?
After calling this do_apply_event() function we may not delete the
Annotate_rows_log_event object immediatedly (see exec_relay_log_event()
above) because thd->query now points to the string inside this object.
We may keep the pointer to this object in the Relay_log_info:
class Relay_log_info
{
public:
...
void set_annotate_event(Annotate_rows_log_event*);
Annotate_rows_log_event* get_annotate_event();
void free_annotate_event();
...
private:
Annotate_rows_log_event* m_annotate_event;
};
The saved Annotate_rows object should be deleted when all corresponding
Rows events will be processed:
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev= next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (rli->get_annotate_event() && is_last_rows_event(ev))
rli->free_annotate_event();
else if (ev->get_type_code() == ANNOTATE_ROWS_EVENT)
rli->set_annotate_event((Annotate_rows_log_event*) ev);
else if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
where
bool is_last_rows_event(Log_event* ev)
{
Log_event_type type= ev->get_type_code();
if (IS_ROWS_EVENT_TYPE(type))
{
Rows_log_event* rows= (Rows_log_event*)ev;
return rows->get_flags(Rows_log_event::STMT_END_F);
}
return 0;
}
#define IS_ROWS_EVENT_TYPE(type) ((type) == WRITE_ROWS_EVENT || \
(type) == UPDATE_ROWS_EVENT || \
(type) == DELETE_ROWS_EVENT)
8. General remarks
~~~~~~~~~~~~~~~~~~
Kristian noticed that introducing new log event type should be coordinated
somehow with MySQL/Sun:
Kristian: The numeric code for this event must be assigned carefully.
It should be coordinated with MySQL/Sun, otherwise we can get into a
situation where MySQL uses the same numeric code for one event that
MariaDB uses for ANNOTATE_ROWS_EVENT, which would make merging the two
impossible.
Alex: I reserved about 20 numbers not to have possible conflicts
with MySQL.
Kristian: Still, I think it would be appropriate to send a polite email
to internals(a)lists.mysql.com about this and suggesting to reserve the
event number.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Progress (by Alexi): Store in binlog text of statements that caused RBR events (47)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Store in binlog text of statements that caused RBR events
CREATION DATE..: Sat, 15 Aug 2009, 23:48
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Knielsen, Serg
CATEGORY.......: Server-Sprint
TASK ID........: 47 (http://askmonty.org/worklog/?tid=47)
VERSION........: Server-9.x
STATUS.........: Code-Review
PRIORITY.......: 60
WORKED HOURS...: 67
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 35
PROGRESS NOTES:
-=-=(Alexi - Thu, 24 Jun 2010, 09:47)=-=-
Making rpl- and binlog-tests stable w.r.t. adding new binlog events.
Worked 25 hours and estimate 0 hours remain (original estimate increased by 25 hours).
-=-=(Knielsen - Mon, 21 Jun 2010, 08:32)=-=-
Final review.
Assist with some problems applying the patch.
Worked 1 hour and estimate 0 hours remain (original estimate increased by 1 hour).
-=-=(Guest - Thu, 17 Jun 2010, 00:38)=-=-
Dependency deleted: 39 no longer depends on 47
-=-=(Knielsen - Mon, 07 Jun 2010, 07:13)=-=-
Help debug some test failures seen in Buildbot.
Worked 6 hours and estimate 0 hours remain (original estimate increased by 6 hours).
-=-=(Knielsen - Mon, 31 May 2010, 06:49)=-=-
Help Alexi debug+fix some test problems in the patch.
Worked 4 hours and estimate 0 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 25 May 2010, 08:29)=-=-
Help debug strange problem in mysqlbinlog.test.
Worked 1 hour and estimate 4 hours remain (original estimate unchanged).
-=-=(Knielsen - Mon, 17 May 2010, 08:45)=-=-
Merge with latest trunk and run Buildbot tests.
Worked 1 hour and estimate 5 hours remain (original estimate unchanged).
-=-=(Knielsen - Wed, 05 May 2010, 13:53)=-=-
Review of fixes to first review done. No new issues found.
Worked 2 hours and estimate 6 hours remain (original estimate unchanged).
-=-=(Knielsen - Fri, 23 Apr 2010, 12:51)=-=-
Status updated.
--- /tmp/wklog.47.old.28747 2010-04-23 12:51:36.000000000 +0000
+++ /tmp/wklog.47.new.28747 2010-04-23 12:51:36.000000000 +0000
@@ -1 +1 @@
-In-Progress
+Code-Review
-=-=(Knielsen - Tue, 06 Apr 2010, 15:26)=-=-
Code review (mailed to maria-developers@).
Worked 7 hours and estimate 8 hours remain (original estimate unchanged).
------------------------------------------------------------
-=-=(View All Progress Notes, 36 total)=-=-
http://askmonty.org/worklog/index.pl?tid=47&nolimit=1
DESCRIPTION:
Store in binlog (and show in mysqlbinlog output) texts of statements that
caused RBR events
This is needed for (list from Monty):
- Easier to understand why updates happened
- Would make it easier to find out where in application things went
wrong (as you can search for exact strings)
- Allow one to filter things based on comments in the statement.
The cost of this can be that the binlog will be approximately 2x in size
(especially insert of big blob's would be a bit painful), so this should
be an optional feature.
HIGH-LEVEL SPECIFICATION:
Content
~~~~~~~
1. Annotate_rows_log_event
2. Server option: --binlog-annotate-rows-events
3. Server option: --replicate-annotate-rows-events
4. mysqlbinlog option: --print-annotate-rows-events
5. mysqlbinlog output
1. Annotate_rows_log_event [ ANNOTATE_ROWS_EVENT ]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Describes the query which caused the corresponding rows events. Has empty
post-header and contains the query text in its data part. Example:
************************
ANNOTATE_ROWS_EVENT
************************
00000220 | B6 A0 2C 4B | time_when = 1261215926
00000224 | 33 | event_type = 51
00000225 | 64 00 00 00 | server_id = 100
00000229 | 36 00 00 00 | event_len = 54
0000022D | 56 02 00 00 | log_pos = 00000256
00000231 | 00 00 | flags = <none>
------------------------
00000233 | 49 4E 53 45 | query = "INSERT INTO t1 VALUES (1), (2), (3)"
00000237 | 52 54 20 49 |
0000023B | 4E 54 4F 20 |
0000023F | 74 31 20 56 |
00000243 | 41 4C 55 45 |
00000247 | 53 20 28 31 |
0000024B | 29 2C 20 28 |
0000024F | 32 29 2C 20 |
00000253 | 28 33 29 |
************************
In binary log, Annotate_rows event follows the (possible) 'BEGIN' Query event
and precedes the first of Table map events which accompany the corresponding
rows events. (See example in the "mysqlbinlog output" section below.)
2. Server option: --binlog-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the master to write Annotate_rows events to the binary log.
* Variable Name: binlog_annotate_rows_events
* Scope: Global & Session
* Access Type: Dynamic
* Data Type: bool
* Default Value: OFF
NOTE. Session values allows to annotate only some selected statements:
...
SET SESSION binlog_annotate_rows_events=ON;
... statements to be annotated ...
SET SESSION binlog_annotate_rows_events=OFF;
... statements not to be annotated ...
3. Server option: --replicate-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the slave to reproduce Annotate_rows events recieved from the master
in its own binary log (sensible only in pair with log-slave-updates option).
* Variable Name: replicate_annotate_rows_events
* Scope: Global
* Access Type: Read only
* Data Type: bool
* Default Value: OFF
NOTE. Why do we additionally need this 'replicate' option? Why not to make
the slave to reproduce this events when its binlog-annotate-rows-events
global value is ON? Well, because, for example, we may want to configure
the slave which should reproduce Annotate_rows events but has global
binlog-annotate-rows-events = OFF meaning this to be the default value for
the client threads (see also "How slave treats replicate-annotate-rows-events
option" in LLD part).
4. mysqlbinlog option: --print-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
With this option, mysqlbinlog prints the content of Annotate_rows events (if
the binary log does contain them). Without this option (i.e. by default),
mysqlbinlog skips Annotate_rows events.
5. mysqlbinlog output
~~~~~~~~~~~~~~~~~~~~~
With --print-annotate-rows-events, mysqlbinlog outputs Annotate_rows events
in a form like this:
...
# at 1646
#091219 12:45:26 server id 100 end_log_pos 1714 Query thread_id=1
exec_time=0 error_code=0
SET TIMESTAMP=1261215926/*!*/;
BEGIN
/*!*/;
# at 1714
# at 1812
# at 1853
# at 1894
# at 1938
#091219 12:45:26 server id 100 end_log_pos 1812 Query: `DELETE t1, t2 FROM
t1 INNER JOIN t2 INNER JOIN t3 WHERE t1.a=t2.a AND t2.a=t3.a`
#091219 12:45:26 server id 100 end_log_pos 1853 Table_map: `test`.`t1`
mapped to number 16
#091219 12:45:26 server id 100 end_log_pos 1894 Table_map: `test`.`t2`
mapped to number 17
#091219 12:45:26 server id 100 end_log_pos 1938 Delete_rows: table id 16
#091219 12:45:26 server id 100 end_log_pos 1982 Delete_rows: table id 17
flags: STMT_END_F
...
LOW-LEVEL DESIGN:
Content
~~~~~~~
1. Annotate_rows event number
2. Outline of Annotate_rows event behavior
3. How Master writes Annotate_rows events to the binary log
4. How slave treats replicate-annotate-rows-events option
5. How slave IO thread requests Annotate_rows events
6. How master executes the request
7. How slave SQL thread processes Annotate_rows events
8. General remarks
1. Annotate_rows event number
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To avoid possible event numbers conflict with MySQL/Sun, we leave a gap
between the last MySQL event number and the Annotate_rows event number:
enum Log_event_type
{ ...
INCIDENT_EVENT= 26,
// New MySQL event numbers are to be added here
MYSQL_EVENTS_END,
MARIA_EVENTS_BEGIN= 51,
// New Maria event numbers start from here
ANNOTATE_ROWS_EVENT= 51,
ENUM_END_EVENT
};
together with the corresponding extension of 'post_header_len' array in the
Format description event. (This extension does not affect the compatibility
of the binary log). Here is how Format description event looks like with
this extension:
************************
FORMAT_DESCRIPTION_EVENT
************************
00000004 | A1 A0 2C 4B | time_when = 1261215905
00000008 | 0F | event_type = 15
00000009 | 64 00 00 00 | server_id = 100
0000000D | 7F 00 00 00 | event_len = 127
00000011 | 83 00 00 00 | log_pos = 00000083
00000015 | 01 00 | flags = LOG_EVENT_BINLOG_IN_USE_F
------------------------
00000017 | 04 00 | binlog_ver = 4
00000019 | 35 2E 32 2E | server_ver = 5.2.0-MariaDB-alpha-debug-log
..... ...
0000004B | A1 A0 2C 4B | time_created = 1261215905
0000004F | 13 | common_header_len = 19
------------------------
post_header_len
------------------------
00000050 | 38 | 56 - START_EVENT_V3 [1]
..... ...
00000069 | 02 | 2 - INCIDENT_EVENT [26]
0000006A | 00 | 0 - RESERVED [27]
..... ...
00000081 | 00 | 0 - RESERVED [50]
00000082 | 00 | 0 - ANNOTATE_ROWS_EVENT [51]
************************
2. Outline of Annotate_rows event behavior
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Each Annotate_rows_log_event object has two private members describing the
corresponding query:
char *m_query_txt;
uint m_query_len;
When the object is created for writing to a binary log, this query is taken
from 'thd' (for short, below we omit the 'Annotate_rows_log_event::' prefix
as well as other implementation details):
Annotate_rows_log_event(THD *thd)
{
m_query_txt = thd->query();
m_query_len = thd->query_length();
}
When the object is read from a binary log, the query is taken from the buffer
containing the binary log representation of the event (this buffer is allocated
in Log_event object from which all Log events are derived):
Annotate_rows_log_event(char *buf, uint event_len,
Format_description_log_event *desc)
{
m_query_len = event_len - desc->common_header_len;
m_query_txt = buf + desc->common_header_len;
}
The events are written to the binary log by the Log_event::write() member
which calls virtual write_data_header() and write_data_body() members
("data header" and "post header" are synonym in replication terminology).
In our case, data header is empty and data body is just the query:
bool write_data_body(IO_CACHE *file)
{
return my_b_safe_write(file, (uchar*) m_query_txt, m_query_len);
}
Printing the event is just printing the query:
void Annotate_rows_log_event::print(FILE *file, PRINT_EVENT_INFO *pinfo)
{
my_b_printf(&pinfo->head_cache, "\tQuery: `%s`\n", m_query_txt);
}
3. How Master writes Annotate_rows events to the binary log
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The event is written to the binary log just before the group of Table_map
events which precede corresponding Rows events (one query may generate
several Table map events in the binary log, but the corresponding
Annotate_rows event must be written only once before the first Table map
event; hence the boolean variable 'with_annotate' below):
int write_locked_table_maps(THD *thd)
{ ...
bool with_annotate= thd->variables.binlog_annotate_rows_events;
...
for (uint i= 0; i < ... <number of tables> ...; ++i)
{ ...
thd->binlog_write_table_map(table, ..., with_annotate);
with_annotate= 0; // write Annotate_event not more than once
...
}
...
}
int THD::binlog_write_table_map(TABLE *table, ..., bool with_annotate)
{ ...
Table_map_log_event the_event(...);
...
if (with_annotate)
{
Annotate_rows_log_event anno(this);
mysql_bin_log.write(&anno);
}
mysql_bin_log.write(&the_event);
...
}
4. How slave treats replicate-annotate-rows-events option
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The replicate-annotate-rows-events option is treated just as the session
value of the binlog_annotate_rows_events variable for the slave IO and
SQL threads. This setting is done during initialization of these threads:
pthread_handler_t handle_slave_io(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_IO);
...
}
pthread_handler_t handle_slave_sql(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_SQL);
...
}
int init_slave_thread(THD* thd, SLAVE_THD_TYPE thd_type)
{ ...
thd->variables.binlog_annotate_rows_events=
opt_replicate_annotate_rows_events;
...
}
5. How slave IO thread requests Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If the replicate-annotate-rows-events option is not set on a slave, there
is no need for master to send Annotate_rows events to this slave. The slave
(or mysqlbinlog in remote case), before requesting binlog dump via the
COM_BINLOG_DUMP command, informs the master whether it should send these
events by executing the newly added COM_BINLOG_DUMP_OPTIONS_EXT server
command:
case COM_BINLOG_DUMP_OPTIONS_EXT:
thd->binlog_dump_flags_ext= packet[0];
my_ok(thd);
break;
Note. We add this new command and don't use COM_BINLOG_DUMP to avoid possible
conflicts with MySQL/Sun.
6. How master executes the request
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
case COM_BINLOG_DUMP:
{ ...
flags= uint2korr(packet + 4);
...
mysql_binlog_send(thd, ..., flags);
...
}
void mysql_binlog_send(THD* thd, ..., ushort flags)
{ ...
Log_event::read_log_event(&log, packet, ...);
...
if ((*packet)[EVENT_TYPE_OFFSET + 1] != ANNOTATE_ROWS_EVENT ||
flags & BINLOG_SEND_ANNOTATE_ROWS_EVENT)
{
my_net_write(net, packet->ptr(), packet->length());
}
...
}
7. How slave SQL thread processes Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The slave processes each recieved event by "applying" it, i.e. by
calling the Log_event::apply_event() function which in turn calls
the virtual do_apply_event() member specific for each type of the
event.
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev = next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
int apply_event_and_update_pos(Log_event *ev, ...)
{ ...
ev->apply_event(...);
...
}
int Log_event::apply_event(...)
{
return do_apply_event(...);
}
What does it mean to "apply" an Annotate_rows event? It means to set current
thd query to that of the described by the event, i.e. to the query which
caused the subsequent Rows events (see "How Master writes Annotate_rows
events to the binary log" to follow what happens further when the subsequent
Rows events are applied):
int Annotate_rows_log_event::do_apply_event(...)
{
thd->set_query(m_query_txt, m_query_len);
}
NOTE. I am not sure, but possibly current values of thd->query and
thd->query_length should be saved before calling set_query() and to be
restored on the Annotate_rows_log_event object deletion.
Is it really needed ?
After calling this do_apply_event() function we may not delete the
Annotate_rows_log_event object immediatedly (see exec_relay_log_event()
above) because thd->query now points to the string inside this object.
We may keep the pointer to this object in the Relay_log_info:
class Relay_log_info
{
public:
...
void set_annotate_event(Annotate_rows_log_event*);
Annotate_rows_log_event* get_annotate_event();
void free_annotate_event();
...
private:
Annotate_rows_log_event* m_annotate_event;
};
The saved Annotate_rows object should be deleted when all corresponding
Rows events will be processed:
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev= next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (rli->get_annotate_event() && is_last_rows_event(ev))
rli->free_annotate_event();
else if (ev->get_type_code() == ANNOTATE_ROWS_EVENT)
rli->set_annotate_event((Annotate_rows_log_event*) ev);
else if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
where
bool is_last_rows_event(Log_event* ev)
{
Log_event_type type= ev->get_type_code();
if (IS_ROWS_EVENT_TYPE(type))
{
Rows_log_event* rows= (Rows_log_event*)ev;
return rows->get_flags(Rows_log_event::STMT_END_F);
}
return 0;
}
#define IS_ROWS_EVENT_TYPE(type) ((type) == WRITE_ROWS_EVENT || \
(type) == UPDATE_ROWS_EVENT || \
(type) == DELETE_ROWS_EVENT)
8. General remarks
~~~~~~~~~~~~~~~~~~
Kristian noticed that introducing new log event type should be coordinated
somehow with MySQL/Sun:
Kristian: The numeric code for this event must be assigned carefully.
It should be coordinated with MySQL/Sun, otherwise we can get into a
situation where MySQL uses the same numeric code for one event that
MariaDB uses for ANNOTATE_ROWS_EVENT, which would make merging the two
impossible.
Alex: I reserved about 20 numbers not to have possible conflicts
with MySQL.
Kristian: Still, I think it would be appropriate to send a polite email
to internals(a)lists.mysql.com about this and suggesting to reserve the
event number.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Progress (by Alexi): Store in binlog text of statements that caused RBR events (47)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Store in binlog text of statements that caused RBR events
CREATION DATE..: Sat, 15 Aug 2009, 23:48
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Knielsen, Serg
CATEGORY.......: Server-Sprint
TASK ID........: 47 (http://askmonty.org/worklog/?tid=47)
VERSION........: Server-9.x
STATUS.........: Code-Review
PRIORITY.......: 60
WORKED HOURS...: 67
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 35
PROGRESS NOTES:
-=-=(Alexi - Thu, 24 Jun 2010, 09:47)=-=-
Making rpl- and binlog-tests stable w.r.t. adding new binlog events.
Worked 25 hours and estimate 0 hours remain (original estimate increased by 25 hours).
-=-=(Knielsen - Mon, 21 Jun 2010, 08:32)=-=-
Final review.
Assist with some problems applying the patch.
Worked 1 hour and estimate 0 hours remain (original estimate increased by 1 hour).
-=-=(Guest - Thu, 17 Jun 2010, 00:38)=-=-
Dependency deleted: 39 no longer depends on 47
-=-=(Knielsen - Mon, 07 Jun 2010, 07:13)=-=-
Help debug some test failures seen in Buildbot.
Worked 6 hours and estimate 0 hours remain (original estimate increased by 6 hours).
-=-=(Knielsen - Mon, 31 May 2010, 06:49)=-=-
Help Alexi debug+fix some test problems in the patch.
Worked 4 hours and estimate 0 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 25 May 2010, 08:29)=-=-
Help debug strange problem in mysqlbinlog.test.
Worked 1 hour and estimate 4 hours remain (original estimate unchanged).
-=-=(Knielsen - Mon, 17 May 2010, 08:45)=-=-
Merge with latest trunk and run Buildbot tests.
Worked 1 hour and estimate 5 hours remain (original estimate unchanged).
-=-=(Knielsen - Wed, 05 May 2010, 13:53)=-=-
Review of fixes to first review done. No new issues found.
Worked 2 hours and estimate 6 hours remain (original estimate unchanged).
-=-=(Knielsen - Fri, 23 Apr 2010, 12:51)=-=-
Status updated.
--- /tmp/wklog.47.old.28747 2010-04-23 12:51:36.000000000 +0000
+++ /tmp/wklog.47.new.28747 2010-04-23 12:51:36.000000000 +0000
@@ -1 +1 @@
-In-Progress
+Code-Review
-=-=(Knielsen - Tue, 06 Apr 2010, 15:26)=-=-
Code review (mailed to maria-developers@).
Worked 7 hours and estimate 8 hours remain (original estimate unchanged).
------------------------------------------------------------
-=-=(View All Progress Notes, 36 total)=-=-
http://askmonty.org/worklog/index.pl?tid=47&nolimit=1
DESCRIPTION:
Store in binlog (and show in mysqlbinlog output) texts of statements that
caused RBR events
This is needed for (list from Monty):
- Easier to understand why updates happened
- Would make it easier to find out where in application things went
wrong (as you can search for exact strings)
- Allow one to filter things based on comments in the statement.
The cost of this can be that the binlog will be approximately 2x in size
(especially insert of big blob's would be a bit painful), so this should
be an optional feature.
HIGH-LEVEL SPECIFICATION:
Content
~~~~~~~
1. Annotate_rows_log_event
2. Server option: --binlog-annotate-rows-events
3. Server option: --replicate-annotate-rows-events
4. mysqlbinlog option: --print-annotate-rows-events
5. mysqlbinlog output
1. Annotate_rows_log_event [ ANNOTATE_ROWS_EVENT ]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Describes the query which caused the corresponding rows events. Has empty
post-header and contains the query text in its data part. Example:
************************
ANNOTATE_ROWS_EVENT
************************
00000220 | B6 A0 2C 4B | time_when = 1261215926
00000224 | 33 | event_type = 51
00000225 | 64 00 00 00 | server_id = 100
00000229 | 36 00 00 00 | event_len = 54
0000022D | 56 02 00 00 | log_pos = 00000256
00000231 | 00 00 | flags = <none>
------------------------
00000233 | 49 4E 53 45 | query = "INSERT INTO t1 VALUES (1), (2), (3)"
00000237 | 52 54 20 49 |
0000023B | 4E 54 4F 20 |
0000023F | 74 31 20 56 |
00000243 | 41 4C 55 45 |
00000247 | 53 20 28 31 |
0000024B | 29 2C 20 28 |
0000024F | 32 29 2C 20 |
00000253 | 28 33 29 |
************************
In binary log, Annotate_rows event follows the (possible) 'BEGIN' Query event
and precedes the first of Table map events which accompany the corresponding
rows events. (See example in the "mysqlbinlog output" section below.)
2. Server option: --binlog-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the master to write Annotate_rows events to the binary log.
* Variable Name: binlog_annotate_rows_events
* Scope: Global & Session
* Access Type: Dynamic
* Data Type: bool
* Default Value: OFF
NOTE. Session values allows to annotate only some selected statements:
...
SET SESSION binlog_annotate_rows_events=ON;
... statements to be annotated ...
SET SESSION binlog_annotate_rows_events=OFF;
... statements not to be annotated ...
3. Server option: --replicate-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the slave to reproduce Annotate_rows events recieved from the master
in its own binary log (sensible only in pair with log-slave-updates option).
* Variable Name: replicate_annotate_rows_events
* Scope: Global
* Access Type: Read only
* Data Type: bool
* Default Value: OFF
NOTE. Why do we additionally need this 'replicate' option? Why not to make
the slave to reproduce this events when its binlog-annotate-rows-events
global value is ON? Well, because, for example, we may want to configure
the slave which should reproduce Annotate_rows events but has global
binlog-annotate-rows-events = OFF meaning this to be the default value for
the client threads (see also "How slave treats replicate-annotate-rows-events
option" in LLD part).
4. mysqlbinlog option: --print-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
With this option, mysqlbinlog prints the content of Annotate_rows events (if
the binary log does contain them). Without this option (i.e. by default),
mysqlbinlog skips Annotate_rows events.
5. mysqlbinlog output
~~~~~~~~~~~~~~~~~~~~~
With --print-annotate-rows-events, mysqlbinlog outputs Annotate_rows events
in a form like this:
...
# at 1646
#091219 12:45:26 server id 100 end_log_pos 1714 Query thread_id=1
exec_time=0 error_code=0
SET TIMESTAMP=1261215926/*!*/;
BEGIN
/*!*/;
# at 1714
# at 1812
# at 1853
# at 1894
# at 1938
#091219 12:45:26 server id 100 end_log_pos 1812 Query: `DELETE t1, t2 FROM
t1 INNER JOIN t2 INNER JOIN t3 WHERE t1.a=t2.a AND t2.a=t3.a`
#091219 12:45:26 server id 100 end_log_pos 1853 Table_map: `test`.`t1`
mapped to number 16
#091219 12:45:26 server id 100 end_log_pos 1894 Table_map: `test`.`t2`
mapped to number 17
#091219 12:45:26 server id 100 end_log_pos 1938 Delete_rows: table id 16
#091219 12:45:26 server id 100 end_log_pos 1982 Delete_rows: table id 17
flags: STMT_END_F
...
LOW-LEVEL DESIGN:
Content
~~~~~~~
1. Annotate_rows event number
2. Outline of Annotate_rows event behavior
3. How Master writes Annotate_rows events to the binary log
4. How slave treats replicate-annotate-rows-events option
5. How slave IO thread requests Annotate_rows events
6. How master executes the request
7. How slave SQL thread processes Annotate_rows events
8. General remarks
1. Annotate_rows event number
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To avoid possible event numbers conflict with MySQL/Sun, we leave a gap
between the last MySQL event number and the Annotate_rows event number:
enum Log_event_type
{ ...
INCIDENT_EVENT= 26,
// New MySQL event numbers are to be added here
MYSQL_EVENTS_END,
MARIA_EVENTS_BEGIN= 51,
// New Maria event numbers start from here
ANNOTATE_ROWS_EVENT= 51,
ENUM_END_EVENT
};
together with the corresponding extension of 'post_header_len' array in the
Format description event. (This extension does not affect the compatibility
of the binary log). Here is how Format description event looks like with
this extension:
************************
FORMAT_DESCRIPTION_EVENT
************************
00000004 | A1 A0 2C 4B | time_when = 1261215905
00000008 | 0F | event_type = 15
00000009 | 64 00 00 00 | server_id = 100
0000000D | 7F 00 00 00 | event_len = 127
00000011 | 83 00 00 00 | log_pos = 00000083
00000015 | 01 00 | flags = LOG_EVENT_BINLOG_IN_USE_F
------------------------
00000017 | 04 00 | binlog_ver = 4
00000019 | 35 2E 32 2E | server_ver = 5.2.0-MariaDB-alpha-debug-log
..... ...
0000004B | A1 A0 2C 4B | time_created = 1261215905
0000004F | 13 | common_header_len = 19
------------------------
post_header_len
------------------------
00000050 | 38 | 56 - START_EVENT_V3 [1]
..... ...
00000069 | 02 | 2 - INCIDENT_EVENT [26]
0000006A | 00 | 0 - RESERVED [27]
..... ...
00000081 | 00 | 0 - RESERVED [50]
00000082 | 00 | 0 - ANNOTATE_ROWS_EVENT [51]
************************
2. Outline of Annotate_rows event behavior
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Each Annotate_rows_log_event object has two private members describing the
corresponding query:
char *m_query_txt;
uint m_query_len;
When the object is created for writing to a binary log, this query is taken
from 'thd' (for short, below we omit the 'Annotate_rows_log_event::' prefix
as well as other implementation details):
Annotate_rows_log_event(THD *thd)
{
m_query_txt = thd->query();
m_query_len = thd->query_length();
}
When the object is read from a binary log, the query is taken from the buffer
containing the binary log representation of the event (this buffer is allocated
in Log_event object from which all Log events are derived):
Annotate_rows_log_event(char *buf, uint event_len,
Format_description_log_event *desc)
{
m_query_len = event_len - desc->common_header_len;
m_query_txt = buf + desc->common_header_len;
}
The events are written to the binary log by the Log_event::write() member
which calls virtual write_data_header() and write_data_body() members
("data header" and "post header" are synonym in replication terminology).
In our case, data header is empty and data body is just the query:
bool write_data_body(IO_CACHE *file)
{
return my_b_safe_write(file, (uchar*) m_query_txt, m_query_len);
}
Printing the event is just printing the query:
void Annotate_rows_log_event::print(FILE *file, PRINT_EVENT_INFO *pinfo)
{
my_b_printf(&pinfo->head_cache, "\tQuery: `%s`\n", m_query_txt);
}
3. How Master writes Annotate_rows events to the binary log
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The event is written to the binary log just before the group of Table_map
events which precede corresponding Rows events (one query may generate
several Table map events in the binary log, but the corresponding
Annotate_rows event must be written only once before the first Table map
event; hence the boolean variable 'with_annotate' below):
int write_locked_table_maps(THD *thd)
{ ...
bool with_annotate= thd->variables.binlog_annotate_rows_events;
...
for (uint i= 0; i < ... <number of tables> ...; ++i)
{ ...
thd->binlog_write_table_map(table, ..., with_annotate);
with_annotate= 0; // write Annotate_event not more than once
...
}
...
}
int THD::binlog_write_table_map(TABLE *table, ..., bool with_annotate)
{ ...
Table_map_log_event the_event(...);
...
if (with_annotate)
{
Annotate_rows_log_event anno(this);
mysql_bin_log.write(&anno);
}
mysql_bin_log.write(&the_event);
...
}
4. How slave treats replicate-annotate-rows-events option
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The replicate-annotate-rows-events option is treated just as the session
value of the binlog_annotate_rows_events variable for the slave IO and
SQL threads. This setting is done during initialization of these threads:
pthread_handler_t handle_slave_io(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_IO);
...
}
pthread_handler_t handle_slave_sql(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_SQL);
...
}
int init_slave_thread(THD* thd, SLAVE_THD_TYPE thd_type)
{ ...
thd->variables.binlog_annotate_rows_events=
opt_replicate_annotate_rows_events;
...
}
5. How slave IO thread requests Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If the replicate-annotate-rows-events option is not set on a slave, there
is no need for master to send Annotate_rows events to this slave. The slave
(or mysqlbinlog in remote case), before requesting binlog dump via the
COM_BINLOG_DUMP command, informs the master whether it should send these
events by executing the newly added COM_BINLOG_DUMP_OPTIONS_EXT server
command:
case COM_BINLOG_DUMP_OPTIONS_EXT:
thd->binlog_dump_flags_ext= packet[0];
my_ok(thd);
break;
Note. We add this new command and don't use COM_BINLOG_DUMP to avoid possible
conflicts with MySQL/Sun.
6. How master executes the request
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
case COM_BINLOG_DUMP:
{ ...
flags= uint2korr(packet + 4);
...
mysql_binlog_send(thd, ..., flags);
...
}
void mysql_binlog_send(THD* thd, ..., ushort flags)
{ ...
Log_event::read_log_event(&log, packet, ...);
...
if ((*packet)[EVENT_TYPE_OFFSET + 1] != ANNOTATE_ROWS_EVENT ||
flags & BINLOG_SEND_ANNOTATE_ROWS_EVENT)
{
my_net_write(net, packet->ptr(), packet->length());
}
...
}
7. How slave SQL thread processes Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The slave processes each recieved event by "applying" it, i.e. by
calling the Log_event::apply_event() function which in turn calls
the virtual do_apply_event() member specific for each type of the
event.
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev = next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
int apply_event_and_update_pos(Log_event *ev, ...)
{ ...
ev->apply_event(...);
...
}
int Log_event::apply_event(...)
{
return do_apply_event(...);
}
What does it mean to "apply" an Annotate_rows event? It means to set current
thd query to that of the described by the event, i.e. to the query which
caused the subsequent Rows events (see "How Master writes Annotate_rows
events to the binary log" to follow what happens further when the subsequent
Rows events are applied):
int Annotate_rows_log_event::do_apply_event(...)
{
thd->set_query(m_query_txt, m_query_len);
}
NOTE. I am not sure, but possibly current values of thd->query and
thd->query_length should be saved before calling set_query() and to be
restored on the Annotate_rows_log_event object deletion.
Is it really needed ?
After calling this do_apply_event() function we may not delete the
Annotate_rows_log_event object immediatedly (see exec_relay_log_event()
above) because thd->query now points to the string inside this object.
We may keep the pointer to this object in the Relay_log_info:
class Relay_log_info
{
public:
...
void set_annotate_event(Annotate_rows_log_event*);
Annotate_rows_log_event* get_annotate_event();
void free_annotate_event();
...
private:
Annotate_rows_log_event* m_annotate_event;
};
The saved Annotate_rows object should be deleted when all corresponding
Rows events will be processed:
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev= next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (rli->get_annotate_event() && is_last_rows_event(ev))
rli->free_annotate_event();
else if (ev->get_type_code() == ANNOTATE_ROWS_EVENT)
rli->set_annotate_event((Annotate_rows_log_event*) ev);
else if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
where
bool is_last_rows_event(Log_event* ev)
{
Log_event_type type= ev->get_type_code();
if (IS_ROWS_EVENT_TYPE(type))
{
Rows_log_event* rows= (Rows_log_event*)ev;
return rows->get_flags(Rows_log_event::STMT_END_F);
}
return 0;
}
#define IS_ROWS_EVENT_TYPE(type) ((type) == WRITE_ROWS_EVENT || \
(type) == UPDATE_ROWS_EVENT || \
(type) == DELETE_ROWS_EVENT)
8. General remarks
~~~~~~~~~~~~~~~~~~
Kristian noticed that introducing new log event type should be coordinated
somehow with MySQL/Sun:
Kristian: The numeric code for this event must be assigned carefully.
It should be coordinated with MySQL/Sun, otherwise we can get into a
situation where MySQL uses the same numeric code for one event that
MariaDB uses for ANNOTATE_ROWS_EVENT, which would make merging the two
impossible.
Alex: I reserved about 20 numbers not to have possible conflicts
with MySQL.
Kristian: Still, I think it would be appropriate to send a polite email
to internals(a)lists.mysql.com about this and suggesting to reserve the
event number.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Progress (by Alexi): Store in binlog text of statements that caused RBR events (47)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Store in binlog text of statements that caused RBR events
CREATION DATE..: Sat, 15 Aug 2009, 23:48
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Knielsen, Serg
CATEGORY.......: Server-Sprint
TASK ID........: 47 (http://askmonty.org/worklog/?tid=47)
VERSION........: Server-9.x
STATUS.........: Code-Review
PRIORITY.......: 60
WORKED HOURS...: 67
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 35
PROGRESS NOTES:
-=-=(Alexi - Thu, 24 Jun 2010, 09:47)=-=-
Making rpl- and binlog-tests stable w.r.t. adding new binlog events.
Worked 25 hours and estimate 0 hours remain (original estimate increased by 25 hours).
-=-=(Knielsen - Mon, 21 Jun 2010, 08:32)=-=-
Final review.
Assist with some problems applying the patch.
Worked 1 hour and estimate 0 hours remain (original estimate increased by 1 hour).
-=-=(Guest - Thu, 17 Jun 2010, 00:38)=-=-
Dependency deleted: 39 no longer depends on 47
-=-=(Knielsen - Mon, 07 Jun 2010, 07:13)=-=-
Help debug some test failures seen in Buildbot.
Worked 6 hours and estimate 0 hours remain (original estimate increased by 6 hours).
-=-=(Knielsen - Mon, 31 May 2010, 06:49)=-=-
Help Alexi debug+fix some test problems in the patch.
Worked 4 hours and estimate 0 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 25 May 2010, 08:29)=-=-
Help debug strange problem in mysqlbinlog.test.
Worked 1 hour and estimate 4 hours remain (original estimate unchanged).
-=-=(Knielsen - Mon, 17 May 2010, 08:45)=-=-
Merge with latest trunk and run Buildbot tests.
Worked 1 hour and estimate 5 hours remain (original estimate unchanged).
-=-=(Knielsen - Wed, 05 May 2010, 13:53)=-=-
Review of fixes to first review done. No new issues found.
Worked 2 hours and estimate 6 hours remain (original estimate unchanged).
-=-=(Knielsen - Fri, 23 Apr 2010, 12:51)=-=-
Status updated.
--- /tmp/wklog.47.old.28747 2010-04-23 12:51:36.000000000 +0000
+++ /tmp/wklog.47.new.28747 2010-04-23 12:51:36.000000000 +0000
@@ -1 +1 @@
-In-Progress
+Code-Review
-=-=(Knielsen - Tue, 06 Apr 2010, 15:26)=-=-
Code review (mailed to maria-developers@).
Worked 7 hours and estimate 8 hours remain (original estimate unchanged).
------------------------------------------------------------
-=-=(View All Progress Notes, 36 total)=-=-
http://askmonty.org/worklog/index.pl?tid=47&nolimit=1
DESCRIPTION:
Store in binlog (and show in mysqlbinlog output) texts of statements that
caused RBR events
This is needed for (list from Monty):
- Easier to understand why updates happened
- Would make it easier to find out where in application things went
wrong (as you can search for exact strings)
- Allow one to filter things based on comments in the statement.
The cost of this can be that the binlog will be approximately 2x in size
(especially insert of big blob's would be a bit painful), so this should
be an optional feature.
HIGH-LEVEL SPECIFICATION:
Content
~~~~~~~
1. Annotate_rows_log_event
2. Server option: --binlog-annotate-rows-events
3. Server option: --replicate-annotate-rows-events
4. mysqlbinlog option: --print-annotate-rows-events
5. mysqlbinlog output
1. Annotate_rows_log_event [ ANNOTATE_ROWS_EVENT ]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Describes the query which caused the corresponding rows events. Has empty
post-header and contains the query text in its data part. Example:
************************
ANNOTATE_ROWS_EVENT
************************
00000220 | B6 A0 2C 4B | time_when = 1261215926
00000224 | 33 | event_type = 51
00000225 | 64 00 00 00 | server_id = 100
00000229 | 36 00 00 00 | event_len = 54
0000022D | 56 02 00 00 | log_pos = 00000256
00000231 | 00 00 | flags = <none>
------------------------
00000233 | 49 4E 53 45 | query = "INSERT INTO t1 VALUES (1), (2), (3)"
00000237 | 52 54 20 49 |
0000023B | 4E 54 4F 20 |
0000023F | 74 31 20 56 |
00000243 | 41 4C 55 45 |
00000247 | 53 20 28 31 |
0000024B | 29 2C 20 28 |
0000024F | 32 29 2C 20 |
00000253 | 28 33 29 |
************************
In binary log, Annotate_rows event follows the (possible) 'BEGIN' Query event
and precedes the first of Table map events which accompany the corresponding
rows events. (See example in the "mysqlbinlog output" section below.)
2. Server option: --binlog-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the master to write Annotate_rows events to the binary log.
* Variable Name: binlog_annotate_rows_events
* Scope: Global & Session
* Access Type: Dynamic
* Data Type: bool
* Default Value: OFF
NOTE. Session values allows to annotate only some selected statements:
...
SET SESSION binlog_annotate_rows_events=ON;
... statements to be annotated ...
SET SESSION binlog_annotate_rows_events=OFF;
... statements not to be annotated ...
3. Server option: --replicate-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the slave to reproduce Annotate_rows events recieved from the master
in its own binary log (sensible only in pair with log-slave-updates option).
* Variable Name: replicate_annotate_rows_events
* Scope: Global
* Access Type: Read only
* Data Type: bool
* Default Value: OFF
NOTE. Why do we additionally need this 'replicate' option? Why not to make
the slave to reproduce this events when its binlog-annotate-rows-events
global value is ON? Well, because, for example, we may want to configure
the slave which should reproduce Annotate_rows events but has global
binlog-annotate-rows-events = OFF meaning this to be the default value for
the client threads (see also "How slave treats replicate-annotate-rows-events
option" in LLD part).
4. mysqlbinlog option: --print-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
With this option, mysqlbinlog prints the content of Annotate_rows events (if
the binary log does contain them). Without this option (i.e. by default),
mysqlbinlog skips Annotate_rows events.
5. mysqlbinlog output
~~~~~~~~~~~~~~~~~~~~~
With --print-annotate-rows-events, mysqlbinlog outputs Annotate_rows events
in a form like this:
...
# at 1646
#091219 12:45:26 server id 100 end_log_pos 1714 Query thread_id=1
exec_time=0 error_code=0
SET TIMESTAMP=1261215926/*!*/;
BEGIN
/*!*/;
# at 1714
# at 1812
# at 1853
# at 1894
# at 1938
#091219 12:45:26 server id 100 end_log_pos 1812 Query: `DELETE t1, t2 FROM
t1 INNER JOIN t2 INNER JOIN t3 WHERE t1.a=t2.a AND t2.a=t3.a`
#091219 12:45:26 server id 100 end_log_pos 1853 Table_map: `test`.`t1`
mapped to number 16
#091219 12:45:26 server id 100 end_log_pos 1894 Table_map: `test`.`t2`
mapped to number 17
#091219 12:45:26 server id 100 end_log_pos 1938 Delete_rows: table id 16
#091219 12:45:26 server id 100 end_log_pos 1982 Delete_rows: table id 17
flags: STMT_END_F
...
LOW-LEVEL DESIGN:
Content
~~~~~~~
1. Annotate_rows event number
2. Outline of Annotate_rows event behavior
3. How Master writes Annotate_rows events to the binary log
4. How slave treats replicate-annotate-rows-events option
5. How slave IO thread requests Annotate_rows events
6. How master executes the request
7. How slave SQL thread processes Annotate_rows events
8. General remarks
1. Annotate_rows event number
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To avoid possible event numbers conflict with MySQL/Sun, we leave a gap
between the last MySQL event number and the Annotate_rows event number:
enum Log_event_type
{ ...
INCIDENT_EVENT= 26,
// New MySQL event numbers are to be added here
MYSQL_EVENTS_END,
MARIA_EVENTS_BEGIN= 51,
// New Maria event numbers start from here
ANNOTATE_ROWS_EVENT= 51,
ENUM_END_EVENT
};
together with the corresponding extension of 'post_header_len' array in the
Format description event. (This extension does not affect the compatibility
of the binary log). Here is how Format description event looks like with
this extension:
************************
FORMAT_DESCRIPTION_EVENT
************************
00000004 | A1 A0 2C 4B | time_when = 1261215905
00000008 | 0F | event_type = 15
00000009 | 64 00 00 00 | server_id = 100
0000000D | 7F 00 00 00 | event_len = 127
00000011 | 83 00 00 00 | log_pos = 00000083
00000015 | 01 00 | flags = LOG_EVENT_BINLOG_IN_USE_F
------------------------
00000017 | 04 00 | binlog_ver = 4
00000019 | 35 2E 32 2E | server_ver = 5.2.0-MariaDB-alpha-debug-log
..... ...
0000004B | A1 A0 2C 4B | time_created = 1261215905
0000004F | 13 | common_header_len = 19
------------------------
post_header_len
------------------------
00000050 | 38 | 56 - START_EVENT_V3 [1]
..... ...
00000069 | 02 | 2 - INCIDENT_EVENT [26]
0000006A | 00 | 0 - RESERVED [27]
..... ...
00000081 | 00 | 0 - RESERVED [50]
00000082 | 00 | 0 - ANNOTATE_ROWS_EVENT [51]
************************
2. Outline of Annotate_rows event behavior
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Each Annotate_rows_log_event object has two private members describing the
corresponding query:
char *m_query_txt;
uint m_query_len;
When the object is created for writing to a binary log, this query is taken
from 'thd' (for short, below we omit the 'Annotate_rows_log_event::' prefix
as well as other implementation details):
Annotate_rows_log_event(THD *thd)
{
m_query_txt = thd->query();
m_query_len = thd->query_length();
}
When the object is read from a binary log, the query is taken from the buffer
containing the binary log representation of the event (this buffer is allocated
in Log_event object from which all Log events are derived):
Annotate_rows_log_event(char *buf, uint event_len,
Format_description_log_event *desc)
{
m_query_len = event_len - desc->common_header_len;
m_query_txt = buf + desc->common_header_len;
}
The events are written to the binary log by the Log_event::write() member
which calls virtual write_data_header() and write_data_body() members
("data header" and "post header" are synonym in replication terminology).
In our case, data header is empty and data body is just the query:
bool write_data_body(IO_CACHE *file)
{
return my_b_safe_write(file, (uchar*) m_query_txt, m_query_len);
}
Printing the event is just printing the query:
void Annotate_rows_log_event::print(FILE *file, PRINT_EVENT_INFO *pinfo)
{
my_b_printf(&pinfo->head_cache, "\tQuery: `%s`\n", m_query_txt);
}
3. How Master writes Annotate_rows events to the binary log
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The event is written to the binary log just before the group of Table_map
events which precede corresponding Rows events (one query may generate
several Table map events in the binary log, but the corresponding
Annotate_rows event must be written only once before the first Table map
event; hence the boolean variable 'with_annotate' below):
int write_locked_table_maps(THD *thd)
{ ...
bool with_annotate= thd->variables.binlog_annotate_rows_events;
...
for (uint i= 0; i < ... <number of tables> ...; ++i)
{ ...
thd->binlog_write_table_map(table, ..., with_annotate);
with_annotate= 0; // write Annotate_event not more than once
...
}
...
}
int THD::binlog_write_table_map(TABLE *table, ..., bool with_annotate)
{ ...
Table_map_log_event the_event(...);
...
if (with_annotate)
{
Annotate_rows_log_event anno(this);
mysql_bin_log.write(&anno);
}
mysql_bin_log.write(&the_event);
...
}
4. How slave treats replicate-annotate-rows-events option
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The replicate-annotate-rows-events option is treated just as the session
value of the binlog_annotate_rows_events variable for the slave IO and
SQL threads. This setting is done during initialization of these threads:
pthread_handler_t handle_slave_io(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_IO);
...
}
pthread_handler_t handle_slave_sql(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_SQL);
...
}
int init_slave_thread(THD* thd, SLAVE_THD_TYPE thd_type)
{ ...
thd->variables.binlog_annotate_rows_events=
opt_replicate_annotate_rows_events;
...
}
5. How slave IO thread requests Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If the replicate-annotate-rows-events option is not set on a slave, there
is no need for master to send Annotate_rows events to this slave. The slave
(or mysqlbinlog in remote case), before requesting binlog dump via the
COM_BINLOG_DUMP command, informs the master whether it should send these
events by executing the newly added COM_BINLOG_DUMP_OPTIONS_EXT server
command:
case COM_BINLOG_DUMP_OPTIONS_EXT:
thd->binlog_dump_flags_ext= packet[0];
my_ok(thd);
break;
Note. We add this new command and don't use COM_BINLOG_DUMP to avoid possible
conflicts with MySQL/Sun.
6. How master executes the request
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
case COM_BINLOG_DUMP:
{ ...
flags= uint2korr(packet + 4);
...
mysql_binlog_send(thd, ..., flags);
...
}
void mysql_binlog_send(THD* thd, ..., ushort flags)
{ ...
Log_event::read_log_event(&log, packet, ...);
...
if ((*packet)[EVENT_TYPE_OFFSET + 1] != ANNOTATE_ROWS_EVENT ||
flags & BINLOG_SEND_ANNOTATE_ROWS_EVENT)
{
my_net_write(net, packet->ptr(), packet->length());
}
...
}
7. How slave SQL thread processes Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The slave processes each recieved event by "applying" it, i.e. by
calling the Log_event::apply_event() function which in turn calls
the virtual do_apply_event() member specific for each type of the
event.
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev = next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
int apply_event_and_update_pos(Log_event *ev, ...)
{ ...
ev->apply_event(...);
...
}
int Log_event::apply_event(...)
{
return do_apply_event(...);
}
What does it mean to "apply" an Annotate_rows event? It means to set current
thd query to that of the described by the event, i.e. to the query which
caused the subsequent Rows events (see "How Master writes Annotate_rows
events to the binary log" to follow what happens further when the subsequent
Rows events are applied):
int Annotate_rows_log_event::do_apply_event(...)
{
thd->set_query(m_query_txt, m_query_len);
}
NOTE. I am not sure, but possibly current values of thd->query and
thd->query_length should be saved before calling set_query() and to be
restored on the Annotate_rows_log_event object deletion.
Is it really needed ?
After calling this do_apply_event() function we may not delete the
Annotate_rows_log_event object immediatedly (see exec_relay_log_event()
above) because thd->query now points to the string inside this object.
We may keep the pointer to this object in the Relay_log_info:
class Relay_log_info
{
public:
...
void set_annotate_event(Annotate_rows_log_event*);
Annotate_rows_log_event* get_annotate_event();
void free_annotate_event();
...
private:
Annotate_rows_log_event* m_annotate_event;
};
The saved Annotate_rows object should be deleted when all corresponding
Rows events will be processed:
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev= next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (rli->get_annotate_event() && is_last_rows_event(ev))
rli->free_annotate_event();
else if (ev->get_type_code() == ANNOTATE_ROWS_EVENT)
rli->set_annotate_event((Annotate_rows_log_event*) ev);
else if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
where
bool is_last_rows_event(Log_event* ev)
{
Log_event_type type= ev->get_type_code();
if (IS_ROWS_EVENT_TYPE(type))
{
Rows_log_event* rows= (Rows_log_event*)ev;
return rows->get_flags(Rows_log_event::STMT_END_F);
}
return 0;
}
#define IS_ROWS_EVENT_TYPE(type) ((type) == WRITE_ROWS_EVENT || \
(type) == UPDATE_ROWS_EVENT || \
(type) == DELETE_ROWS_EVENT)
8. General remarks
~~~~~~~~~~~~~~~~~~
Kristian noticed that introducing new log event type should be coordinated
somehow with MySQL/Sun:
Kristian: The numeric code for this event must be assigned carefully.
It should be coordinated with MySQL/Sun, otherwise we can get into a
situation where MySQL uses the same numeric code for one event that
MariaDB uses for ANNOTATE_ROWS_EVENT, which would make merging the two
impossible.
Alex: I reserved about 20 numbers not to have possible conflicts
with MySQL.
Kristian: Still, I think it would be appropriate to send a polite email
to internals(a)lists.mysql.com about this and suggesting to reserve the
event number.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Progress (by Alexi): Store in binlog text of statements that caused RBR events (47)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Store in binlog text of statements that caused RBR events
CREATION DATE..: Sat, 15 Aug 2009, 23:48
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Knielsen, Serg
CATEGORY.......: Server-Sprint
TASK ID........: 47 (http://askmonty.org/worklog/?tid=47)
VERSION........: Server-9.x
STATUS.........: Code-Review
PRIORITY.......: 60
WORKED HOURS...: 67
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 35
PROGRESS NOTES:
-=-=(Alexi - Thu, 24 Jun 2010, 09:47)=-=-
Making rpl- and binlog-tests stable w.r.t. adding new binlog events.
Worked 25 hours and estimate 0 hours remain (original estimate increased by 25 hours).
-=-=(Knielsen - Mon, 21 Jun 2010, 08:32)=-=-
Final review.
Assist with some problems applying the patch.
Worked 1 hour and estimate 0 hours remain (original estimate increased by 1 hour).
-=-=(Guest - Thu, 17 Jun 2010, 00:38)=-=-
Dependency deleted: 39 no longer depends on 47
-=-=(Knielsen - Mon, 07 Jun 2010, 07:13)=-=-
Help debug some test failures seen in Buildbot.
Worked 6 hours and estimate 0 hours remain (original estimate increased by 6 hours).
-=-=(Knielsen - Mon, 31 May 2010, 06:49)=-=-
Help Alexi debug+fix some test problems in the patch.
Worked 4 hours and estimate 0 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 25 May 2010, 08:29)=-=-
Help debug strange problem in mysqlbinlog.test.
Worked 1 hour and estimate 4 hours remain (original estimate unchanged).
-=-=(Knielsen - Mon, 17 May 2010, 08:45)=-=-
Merge with latest trunk and run Buildbot tests.
Worked 1 hour and estimate 5 hours remain (original estimate unchanged).
-=-=(Knielsen - Wed, 05 May 2010, 13:53)=-=-
Review of fixes to first review done. No new issues found.
Worked 2 hours and estimate 6 hours remain (original estimate unchanged).
-=-=(Knielsen - Fri, 23 Apr 2010, 12:51)=-=-
Status updated.
--- /tmp/wklog.47.old.28747 2010-04-23 12:51:36.000000000 +0000
+++ /tmp/wklog.47.new.28747 2010-04-23 12:51:36.000000000 +0000
@@ -1 +1 @@
-In-Progress
+Code-Review
-=-=(Knielsen - Tue, 06 Apr 2010, 15:26)=-=-
Code review (mailed to maria-developers@).
Worked 7 hours and estimate 8 hours remain (original estimate unchanged).
------------------------------------------------------------
-=-=(View All Progress Notes, 36 total)=-=-
http://askmonty.org/worklog/index.pl?tid=47&nolimit=1
DESCRIPTION:
Store in binlog (and show in mysqlbinlog output) texts of statements that
caused RBR events
This is needed for (list from Monty):
- Easier to understand why updates happened
- Would make it easier to find out where in application things went
wrong (as you can search for exact strings)
- Allow one to filter things based on comments in the statement.
The cost of this can be that the binlog will be approximately 2x in size
(especially insert of big blob's would be a bit painful), so this should
be an optional feature.
HIGH-LEVEL SPECIFICATION:
Content
~~~~~~~
1. Annotate_rows_log_event
2. Server option: --binlog-annotate-rows-events
3. Server option: --replicate-annotate-rows-events
4. mysqlbinlog option: --print-annotate-rows-events
5. mysqlbinlog output
1. Annotate_rows_log_event [ ANNOTATE_ROWS_EVENT ]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Describes the query which caused the corresponding rows events. Has empty
post-header and contains the query text in its data part. Example:
************************
ANNOTATE_ROWS_EVENT
************************
00000220 | B6 A0 2C 4B | time_when = 1261215926
00000224 | 33 | event_type = 51
00000225 | 64 00 00 00 | server_id = 100
00000229 | 36 00 00 00 | event_len = 54
0000022D | 56 02 00 00 | log_pos = 00000256
00000231 | 00 00 | flags = <none>
------------------------
00000233 | 49 4E 53 45 | query = "INSERT INTO t1 VALUES (1), (2), (3)"
00000237 | 52 54 20 49 |
0000023B | 4E 54 4F 20 |
0000023F | 74 31 20 56 |
00000243 | 41 4C 55 45 |
00000247 | 53 20 28 31 |
0000024B | 29 2C 20 28 |
0000024F | 32 29 2C 20 |
00000253 | 28 33 29 |
************************
In binary log, Annotate_rows event follows the (possible) 'BEGIN' Query event
and precedes the first of Table map events which accompany the corresponding
rows events. (See example in the "mysqlbinlog output" section below.)
2. Server option: --binlog-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the master to write Annotate_rows events to the binary log.
* Variable Name: binlog_annotate_rows_events
* Scope: Global & Session
* Access Type: Dynamic
* Data Type: bool
* Default Value: OFF
NOTE. Session values allows to annotate only some selected statements:
...
SET SESSION binlog_annotate_rows_events=ON;
... statements to be annotated ...
SET SESSION binlog_annotate_rows_events=OFF;
... statements not to be annotated ...
3. Server option: --replicate-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the slave to reproduce Annotate_rows events recieved from the master
in its own binary log (sensible only in pair with log-slave-updates option).
* Variable Name: replicate_annotate_rows_events
* Scope: Global
* Access Type: Read only
* Data Type: bool
* Default Value: OFF
NOTE. Why do we additionally need this 'replicate' option? Why not to make
the slave to reproduce this events when its binlog-annotate-rows-events
global value is ON? Well, because, for example, we may want to configure
the slave which should reproduce Annotate_rows events but has global
binlog-annotate-rows-events = OFF meaning this to be the default value for
the client threads (see also "How slave treats replicate-annotate-rows-events
option" in LLD part).
4. mysqlbinlog option: --print-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
With this option, mysqlbinlog prints the content of Annotate_rows events (if
the binary log does contain them). Without this option (i.e. by default),
mysqlbinlog skips Annotate_rows events.
5. mysqlbinlog output
~~~~~~~~~~~~~~~~~~~~~
With --print-annotate-rows-events, mysqlbinlog outputs Annotate_rows events
in a form like this:
...
# at 1646
#091219 12:45:26 server id 100 end_log_pos 1714 Query thread_id=1
exec_time=0 error_code=0
SET TIMESTAMP=1261215926/*!*/;
BEGIN
/*!*/;
# at 1714
# at 1812
# at 1853
# at 1894
# at 1938
#091219 12:45:26 server id 100 end_log_pos 1812 Query: `DELETE t1, t2 FROM
t1 INNER JOIN t2 INNER JOIN t3 WHERE t1.a=t2.a AND t2.a=t3.a`
#091219 12:45:26 server id 100 end_log_pos 1853 Table_map: `test`.`t1`
mapped to number 16
#091219 12:45:26 server id 100 end_log_pos 1894 Table_map: `test`.`t2`
mapped to number 17
#091219 12:45:26 server id 100 end_log_pos 1938 Delete_rows: table id 16
#091219 12:45:26 server id 100 end_log_pos 1982 Delete_rows: table id 17
flags: STMT_END_F
...
LOW-LEVEL DESIGN:
Content
~~~~~~~
1. Annotate_rows event number
2. Outline of Annotate_rows event behavior
3. How Master writes Annotate_rows events to the binary log
4. How slave treats replicate-annotate-rows-events option
5. How slave IO thread requests Annotate_rows events
6. How master executes the request
7. How slave SQL thread processes Annotate_rows events
8. General remarks
1. Annotate_rows event number
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To avoid possible event numbers conflict with MySQL/Sun, we leave a gap
between the last MySQL event number and the Annotate_rows event number:
enum Log_event_type
{ ...
INCIDENT_EVENT= 26,
// New MySQL event numbers are to be added here
MYSQL_EVENTS_END,
MARIA_EVENTS_BEGIN= 51,
// New Maria event numbers start from here
ANNOTATE_ROWS_EVENT= 51,
ENUM_END_EVENT
};
together with the corresponding extension of 'post_header_len' array in the
Format description event. (This extension does not affect the compatibility
of the binary log). Here is how Format description event looks like with
this extension:
************************
FORMAT_DESCRIPTION_EVENT
************************
00000004 | A1 A0 2C 4B | time_when = 1261215905
00000008 | 0F | event_type = 15
00000009 | 64 00 00 00 | server_id = 100
0000000D | 7F 00 00 00 | event_len = 127
00000011 | 83 00 00 00 | log_pos = 00000083
00000015 | 01 00 | flags = LOG_EVENT_BINLOG_IN_USE_F
------------------------
00000017 | 04 00 | binlog_ver = 4
00000019 | 35 2E 32 2E | server_ver = 5.2.0-MariaDB-alpha-debug-log
..... ...
0000004B | A1 A0 2C 4B | time_created = 1261215905
0000004F | 13 | common_header_len = 19
------------------------
post_header_len
------------------------
00000050 | 38 | 56 - START_EVENT_V3 [1]
..... ...
00000069 | 02 | 2 - INCIDENT_EVENT [26]
0000006A | 00 | 0 - RESERVED [27]
..... ...
00000081 | 00 | 0 - RESERVED [50]
00000082 | 00 | 0 - ANNOTATE_ROWS_EVENT [51]
************************
2. Outline of Annotate_rows event behavior
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Each Annotate_rows_log_event object has two private members describing the
corresponding query:
char *m_query_txt;
uint m_query_len;
When the object is created for writing to a binary log, this query is taken
from 'thd' (for short, below we omit the 'Annotate_rows_log_event::' prefix
as well as other implementation details):
Annotate_rows_log_event(THD *thd)
{
m_query_txt = thd->query();
m_query_len = thd->query_length();
}
When the object is read from a binary log, the query is taken from the buffer
containing the binary log representation of the event (this buffer is allocated
in Log_event object from which all Log events are derived):
Annotate_rows_log_event(char *buf, uint event_len,
Format_description_log_event *desc)
{
m_query_len = event_len - desc->common_header_len;
m_query_txt = buf + desc->common_header_len;
}
The events are written to the binary log by the Log_event::write() member
which calls virtual write_data_header() and write_data_body() members
("data header" and "post header" are synonym in replication terminology).
In our case, data header is empty and data body is just the query:
bool write_data_body(IO_CACHE *file)
{
return my_b_safe_write(file, (uchar*) m_query_txt, m_query_len);
}
Printing the event is just printing the query:
void Annotate_rows_log_event::print(FILE *file, PRINT_EVENT_INFO *pinfo)
{
my_b_printf(&pinfo->head_cache, "\tQuery: `%s`\n", m_query_txt);
}
3. How Master writes Annotate_rows events to the binary log
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The event is written to the binary log just before the group of Table_map
events which precede corresponding Rows events (one query may generate
several Table map events in the binary log, but the corresponding
Annotate_rows event must be written only once before the first Table map
event; hence the boolean variable 'with_annotate' below):
int write_locked_table_maps(THD *thd)
{ ...
bool with_annotate= thd->variables.binlog_annotate_rows_events;
...
for (uint i= 0; i < ... <number of tables> ...; ++i)
{ ...
thd->binlog_write_table_map(table, ..., with_annotate);
with_annotate= 0; // write Annotate_event not more than once
...
}
...
}
int THD::binlog_write_table_map(TABLE *table, ..., bool with_annotate)
{ ...
Table_map_log_event the_event(...);
...
if (with_annotate)
{
Annotate_rows_log_event anno(this);
mysql_bin_log.write(&anno);
}
mysql_bin_log.write(&the_event);
...
}
4. How slave treats replicate-annotate-rows-events option
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The replicate-annotate-rows-events option is treated just as the session
value of the binlog_annotate_rows_events variable for the slave IO and
SQL threads. This setting is done during initialization of these threads:
pthread_handler_t handle_slave_io(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_IO);
...
}
pthread_handler_t handle_slave_sql(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_SQL);
...
}
int init_slave_thread(THD* thd, SLAVE_THD_TYPE thd_type)
{ ...
thd->variables.binlog_annotate_rows_events=
opt_replicate_annotate_rows_events;
...
}
5. How slave IO thread requests Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If the replicate-annotate-rows-events option is not set on a slave, there
is no need for master to send Annotate_rows events to this slave. The slave
(or mysqlbinlog in remote case), before requesting binlog dump via the
COM_BINLOG_DUMP command, informs the master whether it should send these
events by executing the newly added COM_BINLOG_DUMP_OPTIONS_EXT server
command:
case COM_BINLOG_DUMP_OPTIONS_EXT:
thd->binlog_dump_flags_ext= packet[0];
my_ok(thd);
break;
Note. We add this new command and don't use COM_BINLOG_DUMP to avoid possible
conflicts with MySQL/Sun.
6. How master executes the request
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
case COM_BINLOG_DUMP:
{ ...
flags= uint2korr(packet + 4);
...
mysql_binlog_send(thd, ..., flags);
...
}
void mysql_binlog_send(THD* thd, ..., ushort flags)
{ ...
Log_event::read_log_event(&log, packet, ...);
...
if ((*packet)[EVENT_TYPE_OFFSET + 1] != ANNOTATE_ROWS_EVENT ||
flags & BINLOG_SEND_ANNOTATE_ROWS_EVENT)
{
my_net_write(net, packet->ptr(), packet->length());
}
...
}
7. How slave SQL thread processes Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The slave processes each recieved event by "applying" it, i.e. by
calling the Log_event::apply_event() function which in turn calls
the virtual do_apply_event() member specific for each type of the
event.
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev = next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
int apply_event_and_update_pos(Log_event *ev, ...)
{ ...
ev->apply_event(...);
...
}
int Log_event::apply_event(...)
{
return do_apply_event(...);
}
What does it mean to "apply" an Annotate_rows event? It means to set current
thd query to that of the described by the event, i.e. to the query which
caused the subsequent Rows events (see "How Master writes Annotate_rows
events to the binary log" to follow what happens further when the subsequent
Rows events are applied):
int Annotate_rows_log_event::do_apply_event(...)
{
thd->set_query(m_query_txt, m_query_len);
}
NOTE. I am not sure, but possibly current values of thd->query and
thd->query_length should be saved before calling set_query() and to be
restored on the Annotate_rows_log_event object deletion.
Is it really needed ?
After calling this do_apply_event() function we may not delete the
Annotate_rows_log_event object immediatedly (see exec_relay_log_event()
above) because thd->query now points to the string inside this object.
We may keep the pointer to this object in the Relay_log_info:
class Relay_log_info
{
public:
...
void set_annotate_event(Annotate_rows_log_event*);
Annotate_rows_log_event* get_annotate_event();
void free_annotate_event();
...
private:
Annotate_rows_log_event* m_annotate_event;
};
The saved Annotate_rows object should be deleted when all corresponding
Rows events will be processed:
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev= next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (rli->get_annotate_event() && is_last_rows_event(ev))
rli->free_annotate_event();
else if (ev->get_type_code() == ANNOTATE_ROWS_EVENT)
rli->set_annotate_event((Annotate_rows_log_event*) ev);
else if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
where
bool is_last_rows_event(Log_event* ev)
{
Log_event_type type= ev->get_type_code();
if (IS_ROWS_EVENT_TYPE(type))
{
Rows_log_event* rows= (Rows_log_event*)ev;
return rows->get_flags(Rows_log_event::STMT_END_F);
}
return 0;
}
#define IS_ROWS_EVENT_TYPE(type) ((type) == WRITE_ROWS_EVENT || \
(type) == UPDATE_ROWS_EVENT || \
(type) == DELETE_ROWS_EVENT)
8. General remarks
~~~~~~~~~~~~~~~~~~
Kristian noticed that introducing new log event type should be coordinated
somehow with MySQL/Sun:
Kristian: The numeric code for this event must be assigned carefully.
It should be coordinated with MySQL/Sun, otherwise we can get into a
situation where MySQL uses the same numeric code for one event that
MariaDB uses for ANNOTATE_ROWS_EVENT, which would make merging the two
impossible.
Alex: I reserved about 20 numbers not to have possible conflicts
with MySQL.
Kristian: Still, I think it would be appropriate to send a polite email
to internals(a)lists.mysql.com about this and suggesting to reserve the
event number.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
24 Jun '10
Hi all,
For the next release of 5.2, we wanted to be sure that it comes with the
proper engines. In previous releases it had to be built manually to
include XtraDB, and didn't have FederatedX at all.
I spent some time looking at the plugin loading code, which turned out
to be irrelevant. Because a standard build of the zip file now actually
has FederatedX and XtraDB in it.
So the Windows code is good to go :-)
Bo Thorsen.
Monty Program AB.
--
MariaDB: MySQL replacement
Community developed. Feature enhanced. Backward compatible.
1
0
[Maria-developers] Updated (by Sanja): Subquery optimization: Avoid recalculating subquery if external fields values found in subquery cache (66)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Subquery optimization: Avoid recalculating subquery if external fields
values found in subquery cache
CREATION DATE..: Wed, 25 Nov 2009, 22:25
SUPERVISOR.....: Monty
IMPLEMENTOR....: Sanja
COPIES TO......:
CATEGORY.......: Client-Sprint
TASK ID........: 66 (http://askmonty.org/worklog/?tid=66)
VERSION........: 9.x
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Sanja - Thu, 24 Jun 2010, 06:01)=-=-
Low Level Design modified.
--- /tmp/wklog.66.old.19228 2010-06-24 06:01:35.000000000 +0000
+++ /tmp/wklog.66.new.19228 2010-06-24 06:01:35.000000000 +0000
@@ -1,10 +1,10 @@
-* Target version: base on mysql-5.2 code
+* Target version: base on mysql-5.3 code
All items on which subquery depend could be collected in
-st_select_lex::mark_as_dependent (direct of indirect reference?)
+st_select_lex::register_dependency_item (indirect reference)
Temporary table index should be created by all fields except result field
-(TMP_TABLE_PARAM::keyinfo).
+(TABLE::add_tmp_key).
How to fill the temptable
-------------------------
@@ -49,7 +49,8 @@
if (res || !sjm->in_equality->val_int())
DBUG_RETURN(NESTED_LOOP_NO_MORE_ROWS);
-The code in this WL will use the same approach
+The code in this WL will use the same approach except eqality which will be
+created according to field type (some types do not need it)
Extracting the value of the subquery predicate
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -58,3 +59,11 @@
temporary table's field and then subquery_predicate->val_int() will invoke
$I->val_int(), subquery_predicate->val_str() will invoke $I->val_str() and so
forth.
+
+Caching the subquery in Item tree
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+We use Item::transform to put caching Item (Item_cache_wrapper) before the
+subquery (Item_subquery* or Item_in_optimizer). For this we add new transformer
+method ::cache_insert_transformer.
+
-=-=(Guest - Sun, 13 Jun 2010, 16:51)=-=-
Dependency deleted: 91 no longer depends on 66
-=-=(Igor - Wed, 10 Mar 2010, 21:29)=-=-
High Level Description modified.
--- /tmp/wklog.66.old.32188 2010-03-10 21:29:16.000000000 +0000
+++ /tmp/wklog.66.new.32188 2010-03-10 21:29:16.000000000 +0000
@@ -1,3 +1,10 @@
+The goal of this task is to optimize evaluation of subqueries and subquery
+predicates by storing the results of a correlated subquery together with
+correlation parameters in a cache and reusing those results for the same sets of
+parameters.
+
+Here's what is to be done in this task in more details:
+
Collect all outer items/references (left part of the subquiery and outer
references inside the subquery) in key string. Compare the string (which
represents certain value set of the references) against values in hash table and
-=-=(Igor - Wed, 10 Mar 2010, 21:13)=-=-
Dependency created: 91 now depends on 66
-=-=(Igor - Wed, 10 Mar 2010, 21:12)=-=-
Category updated.
--- /tmp/wklog.66.old.31558 2010-03-10 21:12:50.000000000 +0000
+++ /tmp/wklog.66.new.31558 2010-03-10 21:12:50.000000000 +0000
@@ -1 +1 @@
-Server-BackLog
+Client-Sprint
-=-=(Igor - Wed, 10 Mar 2010, 21:12)=-=-
Version updated.
--- /tmp/wklog.66.old.31558 2010-03-10 21:12:50.000000000 +0000
+++ /tmp/wklog.66.new.31558 2010-03-10 21:12:50.000000000 +0000
@@ -1 +1 @@
-Server-5.3
+9.x
-=-=(Monty - Fri, 29 Jan 2010, 19:07)=-=-
Version updated.
--- /tmp/wklog.66.old.5893 2010-01-29 19:07:10.000000000 +0200
+++ /tmp/wklog.66.new.5893 2010-01-29 19:07:10.000000000 +0200
@@ -1 +1 @@
-Server-5.2
+Server-5.3
-=-=(Psergey - Wed, 20 Jan 2010, 14:50)=-=-
High-Level Specification modified.
--- /tmp/wklog.66.old.26873 2010-01-20 14:50:41.000000000 +0200
+++ /tmp/wklog.66.new.26873 2010-01-20 14:50:41.000000000 +0200
@@ -4,7 +4,6 @@
To check/discuss:
-----------------
-* Do we put subquery cache on all levels of subqueries or on highest level only
* Will there be any means to measure subquery cache hit rate?
* MySQL-6.0 has a one-element predicate result cache. It is called "left
expression cache", grep for left_expr_cache in sql/item_subselect.*
@@ -41,7 +40,12 @@
- subquery_item_result is 'bool' for subquery predicates, and is of
some scalar or ROW(scalar1,...scalarN) type for scalar-context subquery.
-We dont support cases when outer_expr or correlation_references are blobs.
+We don't support cases when outer_expr or correlation_references are blobs.
+
+All subquery predicates are cached. That is, if one subquery predicate is
+located within another, both of them will have caches (one option to reduce
+cache memory usage was to use cache only for the upper-most select. we decided
+against it).
2. Data structure used for the cache
------------------------------------
-=-=(Psergey - Wed, 20 Jan 2010, 13:07)=-=-
High-Level Specification modified.
--- /tmp/wklog.66.old.17649 2010-01-20 13:07:07.000000000 +0200
+++ /tmp/wklog.66.new.17649 2010-01-20 13:07:07.000000000 +0200
@@ -3,7 +3,13 @@
To check/discuss:
- To put subquery cache on all levels of subqueries or on highest level only.
+-----------------
+* Do we put subquery cache on all levels of subqueries or on highest level only
+* Will there be any means to measure subquery cache hit rate?
+* MySQL-6.0 has a one-element predicate result cache. It is called "left
+ expression cache", grep for left_expr_cache in sql/item_subselect.*
+ When this WL is merged with 6.0's optimizations, these two caches will
+ need to be unified somehow.
<contents>
-=-=(Psergey - Mon, 18 Jan 2010, 16:40)=-=-
Low Level Design modified.
--- /tmp/wklog.66.old.24899 2010-01-18 16:40:16.000000000 +0200
+++ /tmp/wklog.66.new.24899 2010-01-18 16:40:16.000000000 +0200
@@ -1,3 +1,5 @@
+* Target version: base on mysql-5.2 code
+
All items on which subquery depend could be collected in
st_select_lex::mark_as_dependent (direct of indirect reference?)
------------------------------------------------------------
-=-=(View All Progress Notes, 19 total)=-=-
http://askmonty.org/worklog/index.pl?tid=66&nolimit=1
DESCRIPTION:
The goal of this task is to optimize evaluation of subqueries and subquery
predicates by storing the results of a correlated subquery together with
correlation parameters in a cache and reusing those results for the same sets of
parameters.
Here's what is to be done in this task in more details:
Collect all outer items/references (left part of the subquiery and outer
references inside the subquery) in key string. Compare the string (which
represents certain value set of the references) against values in hash table and
return cached result of subquery if the reference values combination has already
been used.
For example in the following subquery:
(L1, L2) IN (SELECT A, B FROM T WHERE T.F1>OTER_FIELD)
set of references to look into the subquery cache is (L1, L2, OTER_FIELD).
The subquery cache should be implemented as simple LRU connected to the subquery.
Size of the subquery cache (in number of results (but maybe in used memory
amount)) is limited by session variable (query parameter?).
HIGH-LEVEL SPECIFICATION:
Attach subquery cache to each Item_subquery. Interface should allow to use hash
or temporary table inside.
To check/discuss:
-----------------
* Will there be any means to measure subquery cache hit rate?
* MySQL-6.0 has a one-element predicate result cache. It is called "left
expression cache", grep for left_expr_cache in sql/item_subselect.*
When this WL is merged with 6.0's optimizations, these two caches will
need to be unified somehow.
<contents>
1. Scope of the task
2. Data structure used for the cache
3. Cache size
4. Interplay with other subquery optimizations
5. User interface
</contents>
1. Scope of the task
--------------------
This WL should handle all subquery predicates, i.e. it should handle these
cases:
outer_expr IN (SELECT correlated_select)
outer_expr $CMP$ ALL/ANY (SELECT correlated_select)
EXISTS (SELECT correlated_select)
scalar-context subquery: (SELECT correlated_select)
The cache will maintain
(outer_expr, correlation_references)-> subquery_item_result
mapping, where
- correlation_references is a list of tablename.column_name that are referred
from the correlated_select but tablename is a table that is ouside the
subquery.
- subquery_item_result is 'bool' for subquery predicates, and is of
some scalar or ROW(scalar1,...scalarN) type for scalar-context subquery.
We don't support cases when outer_expr or correlation_references are blobs.
All subquery predicates are cached. That is, if one subquery predicate is
located within another, both of them will have caches (one option to reduce
cache memory usage was to use cache only for the upper-most select. we decided
against it).
2. Data structure used for the cache
------------------------------------
There are two data structures available in the codebase that will allow fast
equality lookups:
1. HASH (mysys/hash.c) tables
2. Temporary tables (the ones that are used for e.g. GROUP BY)
None of them has any support for element eviction on overflow (using LRU or
some other policy).
Query cache and MyISAM/Maria's key/page cache ought to support some eviction
mechanism, but code-wise it is not readily reusable, one will need to factor
it out (or copy it).
We choose to use #2, and not to have any eviction policy. See subsequent
sections for details and reasoning behind the decision.
3. Cache size
-------------
Typically, a cache has some maximum size and a policy which is used to
select a cache entry for removal when the cache becomes full (e.g. find
and remove the least [recently] used entry)
For this WL entry we will use a cache of infinite size. The reasoning behind
this is that:
- is is easy to do: we have temporary tables that can grow to arbitrarily
large size while still providing the same insert/lookup interface.
- it suits us: unless the subquery is resolved with one index lookup,
hitting the cache would be many times cheaper than re-running the
subquery, so cache is worth having.
4. Interplay with other subquery optimizations
----------------------------------------------
* This WL entry should not care about IN->EXISTS transformation: caching for
IN subquery and result of its conversion to EXISTS would work in the same
way.
* This optimization is orthogonal to <=>ANY -> MIN/MAX rewrite (it will
work/be useful irrespectively of whether the rewrite has been performed or
not)
* TODO: compare this with materialization for uncorrelated IN-subqueries. Is
this basically the same?
A: no, it is not:
- IN-Materialization has to perform full materialization before it can
do the first subquery evaluation. This WL's code has almost no startup
costs.
- This optimization has temp.table of (corr_reference, predicate_value),
while IN-materialization will have (corr_reference) only.
5. User interface
-----------------
* There will be an @@optimizer_switch flag to turn this optimization on and
off (TODO: name of the flag?)
* TODO: how do we show this in EXPLAIN [EXTENDED]? The most easiest is to
print something in the warning text of EXPLAIN EXTEDED that would indicate
use of cache.
* temporary table sizing (max size for heap table, whether to use MyISAM or
Maria) will be controlled with common temp.table control variables.
LOW-LEVEL DESIGN:
* Target version: base on mysql-5.3 code
All items on which subquery depend could be collected in
st_select_lex::register_dependency_item (indirect reference)
Temporary table index should be created by all fields except result field
(TABLE::add_tmp_key).
How to fill the temptable
-------------------------
Can reuse approach from SJ-Materialization. Its code is in end_sj_materialize()
and is supposed to be quite trivial.
How to make lookups into temptable
----------------------------------
We'll reuse approach used by SJ-Materialization in 6.0.
Setup process
~~~~~~~~~~~~~
Setup is performed in the same way as in setup_sj_materialization(),
see the code that starts these lines:
/*
Create/initialize everything we will need to index lookups into the
temptable.
*/
and ends at this line:
Remove the injected semi-join IN-equalities from join_tab conds. This
<questionable>
We'll also need to check equalities, i.e. do an equivalent of this:
if (!(sjm->in_equality= create_subq_in_equalities(thd, sjm,
emb_sj_nest->sj_subq_pred)))
DBUG_RETURN(TRUE); /* purecov: inspected */
Question: or perhaps that is not necessarry?
</questionable>
Doing the lookup
~~~~~~~~~~~~~~~~
SJ-Materialization does lookup in sub_select_sjm(), with this code:
/* Do index lookup in the materialized table */
if ((res= join_read_key2(join_tab, sjm->table, sjm->tab_ref)) == 1)
DBUG_RETURN(NESTED_LOOP_ERROR); /* purecov: inspected */
if (res || !sjm->in_equality->val_int())
DBUG_RETURN(NESTED_LOOP_NO_MORE_ROWS);
The code in this WL will use the same approach except eqality which will be
created according to field type (some types do not need it)
Extracting the value of the subquery predicate
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The goal of making the lookup is to get the value of subquery predicate.
This is done by creating an Item_field $I which refers to appropriate
temporary table's field and then subquery_predicate->val_int() will invoke
$I->val_int(), subquery_predicate->val_str() will invoke $I->val_str() and so
forth.
Caching the subquery in Item tree
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We use Item::transform to put caching Item (Item_cache_wrapper) before the
subquery (Item_subquery* or Item_in_optimizer). For this we add new transformer
method ::cache_insert_transformer.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Sanja): Subquery optimization: Avoid recalculating subquery if external fields values found in subquery cache (66)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Subquery optimization: Avoid recalculating subquery if external fields
values found in subquery cache
CREATION DATE..: Wed, 25 Nov 2009, 22:25
SUPERVISOR.....: Monty
IMPLEMENTOR....: Sanja
COPIES TO......:
CATEGORY.......: Client-Sprint
TASK ID........: 66 (http://askmonty.org/worklog/?tid=66)
VERSION........: 9.x
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Sanja - Thu, 24 Jun 2010, 06:01)=-=-
Low Level Design modified.
--- /tmp/wklog.66.old.19228 2010-06-24 06:01:35.000000000 +0000
+++ /tmp/wklog.66.new.19228 2010-06-24 06:01:35.000000000 +0000
@@ -1,10 +1,10 @@
-* Target version: base on mysql-5.2 code
+* Target version: base on mysql-5.3 code
All items on which subquery depend could be collected in
-st_select_lex::mark_as_dependent (direct of indirect reference?)
+st_select_lex::register_dependency_item (indirect reference)
Temporary table index should be created by all fields except result field
-(TMP_TABLE_PARAM::keyinfo).
+(TABLE::add_tmp_key).
How to fill the temptable
-------------------------
@@ -49,7 +49,8 @@
if (res || !sjm->in_equality->val_int())
DBUG_RETURN(NESTED_LOOP_NO_MORE_ROWS);
-The code in this WL will use the same approach
+The code in this WL will use the same approach except eqality which will be
+created according to field type (some types do not need it)
Extracting the value of the subquery predicate
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -58,3 +59,11 @@
temporary table's field and then subquery_predicate->val_int() will invoke
$I->val_int(), subquery_predicate->val_str() will invoke $I->val_str() and so
forth.
+
+Caching the subquery in Item tree
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+We use Item::transform to put caching Item (Item_cache_wrapper) before the
+subquery (Item_subquery* or Item_in_optimizer). For this we add new transformer
+method ::cache_insert_transformer.
+
-=-=(Guest - Sun, 13 Jun 2010, 16:51)=-=-
Dependency deleted: 91 no longer depends on 66
-=-=(Igor - Wed, 10 Mar 2010, 21:29)=-=-
High Level Description modified.
--- /tmp/wklog.66.old.32188 2010-03-10 21:29:16.000000000 +0000
+++ /tmp/wklog.66.new.32188 2010-03-10 21:29:16.000000000 +0000
@@ -1,3 +1,10 @@
+The goal of this task is to optimize evaluation of subqueries and subquery
+predicates by storing the results of a correlated subquery together with
+correlation parameters in a cache and reusing those results for the same sets of
+parameters.
+
+Here's what is to be done in this task in more details:
+
Collect all outer items/references (left part of the subquiery and outer
references inside the subquery) in key string. Compare the string (which
represents certain value set of the references) against values in hash table and
-=-=(Igor - Wed, 10 Mar 2010, 21:13)=-=-
Dependency created: 91 now depends on 66
-=-=(Igor - Wed, 10 Mar 2010, 21:12)=-=-
Category updated.
--- /tmp/wklog.66.old.31558 2010-03-10 21:12:50.000000000 +0000
+++ /tmp/wklog.66.new.31558 2010-03-10 21:12:50.000000000 +0000
@@ -1 +1 @@
-Server-BackLog
+Client-Sprint
-=-=(Igor - Wed, 10 Mar 2010, 21:12)=-=-
Version updated.
--- /tmp/wklog.66.old.31558 2010-03-10 21:12:50.000000000 +0000
+++ /tmp/wklog.66.new.31558 2010-03-10 21:12:50.000000000 +0000
@@ -1 +1 @@
-Server-5.3
+9.x
-=-=(Monty - Fri, 29 Jan 2010, 19:07)=-=-
Version updated.
--- /tmp/wklog.66.old.5893 2010-01-29 19:07:10.000000000 +0200
+++ /tmp/wklog.66.new.5893 2010-01-29 19:07:10.000000000 +0200
@@ -1 +1 @@
-Server-5.2
+Server-5.3
-=-=(Psergey - Wed, 20 Jan 2010, 14:50)=-=-
High-Level Specification modified.
--- /tmp/wklog.66.old.26873 2010-01-20 14:50:41.000000000 +0200
+++ /tmp/wklog.66.new.26873 2010-01-20 14:50:41.000000000 +0200
@@ -4,7 +4,6 @@
To check/discuss:
-----------------
-* Do we put subquery cache on all levels of subqueries or on highest level only
* Will there be any means to measure subquery cache hit rate?
* MySQL-6.0 has a one-element predicate result cache. It is called "left
expression cache", grep for left_expr_cache in sql/item_subselect.*
@@ -41,7 +40,12 @@
- subquery_item_result is 'bool' for subquery predicates, and is of
some scalar or ROW(scalar1,...scalarN) type for scalar-context subquery.
-We dont support cases when outer_expr or correlation_references are blobs.
+We don't support cases when outer_expr or correlation_references are blobs.
+
+All subquery predicates are cached. That is, if one subquery predicate is
+located within another, both of them will have caches (one option to reduce
+cache memory usage was to use cache only for the upper-most select. we decided
+against it).
2. Data structure used for the cache
------------------------------------
-=-=(Psergey - Wed, 20 Jan 2010, 13:07)=-=-
High-Level Specification modified.
--- /tmp/wklog.66.old.17649 2010-01-20 13:07:07.000000000 +0200
+++ /tmp/wklog.66.new.17649 2010-01-20 13:07:07.000000000 +0200
@@ -3,7 +3,13 @@
To check/discuss:
- To put subquery cache on all levels of subqueries or on highest level only.
+-----------------
+* Do we put subquery cache on all levels of subqueries or on highest level only
+* Will there be any means to measure subquery cache hit rate?
+* MySQL-6.0 has a one-element predicate result cache. It is called "left
+ expression cache", grep for left_expr_cache in sql/item_subselect.*
+ When this WL is merged with 6.0's optimizations, these two caches will
+ need to be unified somehow.
<contents>
-=-=(Psergey - Mon, 18 Jan 2010, 16:40)=-=-
Low Level Design modified.
--- /tmp/wklog.66.old.24899 2010-01-18 16:40:16.000000000 +0200
+++ /tmp/wklog.66.new.24899 2010-01-18 16:40:16.000000000 +0200
@@ -1,3 +1,5 @@
+* Target version: base on mysql-5.2 code
+
All items on which subquery depend could be collected in
st_select_lex::mark_as_dependent (direct of indirect reference?)
------------------------------------------------------------
-=-=(View All Progress Notes, 19 total)=-=-
http://askmonty.org/worklog/index.pl?tid=66&nolimit=1
DESCRIPTION:
The goal of this task is to optimize evaluation of subqueries and subquery
predicates by storing the results of a correlated subquery together with
correlation parameters in a cache and reusing those results for the same sets of
parameters.
Here's what is to be done in this task in more details:
Collect all outer items/references (left part of the subquiery and outer
references inside the subquery) in key string. Compare the string (which
represents certain value set of the references) against values in hash table and
return cached result of subquery if the reference values combination has already
been used.
For example in the following subquery:
(L1, L2) IN (SELECT A, B FROM T WHERE T.F1>OTER_FIELD)
set of references to look into the subquery cache is (L1, L2, OTER_FIELD).
The subquery cache should be implemented as simple LRU connected to the subquery.
Size of the subquery cache (in number of results (but maybe in used memory
amount)) is limited by session variable (query parameter?).
HIGH-LEVEL SPECIFICATION:
Attach subquery cache to each Item_subquery. Interface should allow to use hash
or temporary table inside.
To check/discuss:
-----------------
* Will there be any means to measure subquery cache hit rate?
* MySQL-6.0 has a one-element predicate result cache. It is called "left
expression cache", grep for left_expr_cache in sql/item_subselect.*
When this WL is merged with 6.0's optimizations, these two caches will
need to be unified somehow.
<contents>
1. Scope of the task
2. Data structure used for the cache
3. Cache size
4. Interplay with other subquery optimizations
5. User interface
</contents>
1. Scope of the task
--------------------
This WL should handle all subquery predicates, i.e. it should handle these
cases:
outer_expr IN (SELECT correlated_select)
outer_expr $CMP$ ALL/ANY (SELECT correlated_select)
EXISTS (SELECT correlated_select)
scalar-context subquery: (SELECT correlated_select)
The cache will maintain
(outer_expr, correlation_references)-> subquery_item_result
mapping, where
- correlation_references is a list of tablename.column_name that are referred
from the correlated_select but tablename is a table that is ouside the
subquery.
- subquery_item_result is 'bool' for subquery predicates, and is of
some scalar or ROW(scalar1,...scalarN) type for scalar-context subquery.
We don't support cases when outer_expr or correlation_references are blobs.
All subquery predicates are cached. That is, if one subquery predicate is
located within another, both of them will have caches (one option to reduce
cache memory usage was to use cache only for the upper-most select. we decided
against it).
2. Data structure used for the cache
------------------------------------
There are two data structures available in the codebase that will allow fast
equality lookups:
1. HASH (mysys/hash.c) tables
2. Temporary tables (the ones that are used for e.g. GROUP BY)
None of them has any support for element eviction on overflow (using LRU or
some other policy).
Query cache and MyISAM/Maria's key/page cache ought to support some eviction
mechanism, but code-wise it is not readily reusable, one will need to factor
it out (or copy it).
We choose to use #2, and not to have any eviction policy. See subsequent
sections for details and reasoning behind the decision.
3. Cache size
-------------
Typically, a cache has some maximum size and a policy which is used to
select a cache entry for removal when the cache becomes full (e.g. find
and remove the least [recently] used entry)
For this WL entry we will use a cache of infinite size. The reasoning behind
this is that:
- is is easy to do: we have temporary tables that can grow to arbitrarily
large size while still providing the same insert/lookup interface.
- it suits us: unless the subquery is resolved with one index lookup,
hitting the cache would be many times cheaper than re-running the
subquery, so cache is worth having.
4. Interplay with other subquery optimizations
----------------------------------------------
* This WL entry should not care about IN->EXISTS transformation: caching for
IN subquery and result of its conversion to EXISTS would work in the same
way.
* This optimization is orthogonal to <=>ANY -> MIN/MAX rewrite (it will
work/be useful irrespectively of whether the rewrite has been performed or
not)
* TODO: compare this with materialization for uncorrelated IN-subqueries. Is
this basically the same?
A: no, it is not:
- IN-Materialization has to perform full materialization before it can
do the first subquery evaluation. This WL's code has almost no startup
costs.
- This optimization has temp.table of (corr_reference, predicate_value),
while IN-materialization will have (corr_reference) only.
5. User interface
-----------------
* There will be an @@optimizer_switch flag to turn this optimization on and
off (TODO: name of the flag?)
* TODO: how do we show this in EXPLAIN [EXTENDED]? The most easiest is to
print something in the warning text of EXPLAIN EXTEDED that would indicate
use of cache.
* temporary table sizing (max size for heap table, whether to use MyISAM or
Maria) will be controlled with common temp.table control variables.
LOW-LEVEL DESIGN:
* Target version: base on mysql-5.3 code
All items on which subquery depend could be collected in
st_select_lex::register_dependency_item (indirect reference)
Temporary table index should be created by all fields except result field
(TABLE::add_tmp_key).
How to fill the temptable
-------------------------
Can reuse approach from SJ-Materialization. Its code is in end_sj_materialize()
and is supposed to be quite trivial.
How to make lookups into temptable
----------------------------------
We'll reuse approach used by SJ-Materialization in 6.0.
Setup process
~~~~~~~~~~~~~
Setup is performed in the same way as in setup_sj_materialization(),
see the code that starts these lines:
/*
Create/initialize everything we will need to index lookups into the
temptable.
*/
and ends at this line:
Remove the injected semi-join IN-equalities from join_tab conds. This
<questionable>
We'll also need to check equalities, i.e. do an equivalent of this:
if (!(sjm->in_equality= create_subq_in_equalities(thd, sjm,
emb_sj_nest->sj_subq_pred)))
DBUG_RETURN(TRUE); /* purecov: inspected */
Question: or perhaps that is not necessarry?
</questionable>
Doing the lookup
~~~~~~~~~~~~~~~~
SJ-Materialization does lookup in sub_select_sjm(), with this code:
/* Do index lookup in the materialized table */
if ((res= join_read_key2(join_tab, sjm->table, sjm->tab_ref)) == 1)
DBUG_RETURN(NESTED_LOOP_ERROR); /* purecov: inspected */
if (res || !sjm->in_equality->val_int())
DBUG_RETURN(NESTED_LOOP_NO_MORE_ROWS);
The code in this WL will use the same approach except eqality which will be
created according to field type (some types do not need it)
Extracting the value of the subquery predicate
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The goal of making the lookup is to get the value of subquery predicate.
This is done by creating an Item_field $I which refers to appropriate
temporary table's field and then subquery_predicate->val_int() will invoke
$I->val_int(), subquery_predicate->val_str() will invoke $I->val_str() and so
forth.
Caching the subquery in Item tree
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We use Item::transform to put caching Item (Item_cache_wrapper) before the
subquery (Item_subquery* or Item_in_optimizer). For this we add new transformer
method ::cache_insert_transformer.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Igor): index_merge: non-ROR intersection (21)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: non-ROR intersection
CREATION DATE..: Thu, 21 May 2009, 21:32
SUPERVISOR.....: Monty
IMPLEMENTOR....: Igor
COPIES TO......: Rhuddleston, Sanja, Knielsen, Serg, Monty, Timour, Igor, Psergey
CATEGORY.......: Server-Sprint
TASK ID........: 21 (http://askmonty.org/worklog/?tid=21)
VERSION........: Server-5.3
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 25
ESTIMATE.......: 175 (hours remain)
ORIG. ESTIMATE.: 175
PROGRESS NOTES:
-=-=(Igor - Thu, 24 Jun 2010, 05:49)=-=-
Version updated.
--- /tmp/wklog.21.old.18774 2010-06-24 05:49:41.000000000 +0000
+++ /tmp/wklog.21.new.18774 2010-06-24 05:49:41.000000000 +0000
@@ -1 +1 @@
-Server-9.x
+Server-5.3
-=-=(Igor - Thu, 24 Jun 2010, 05:49)=-=-
Supervisor updated.
--- /tmp/wklog.21.old.18770 2010-06-24 05:49:14.000000000 +0000
+++ /tmp/wklog.21.new.18770 2010-06-24 05:49:14.000000000 +0000
@@ -1 +1 @@
-Knielsen
+Monty
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Category updated.
--- /tmp/wklog.21.old.18765 2010-06-24 05:48:53.000000000 +0000
+++ /tmp/wklog.21.new.18765 2010-06-24 05:48:53.000000000 +0000
@@ -1 +1 @@
-Server-RawIdeaBin
+Server-Sprint
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Status updated.
--- /tmp/wklog.21.old.18761 2010-06-24 05:48:43.000000000 +0000
+++ /tmp/wklog.21.new.18761 2010-06-24 05:48:43.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Observers changed: Knielsen,Monty,Psergey,Sanja,Igor,Rhuddleston,Timour,Serg
-=-=(Guest - Thu, 24 Jun 2010, 05:44)=-=-
I spent 25 hours in the month June 2010 to perform the following work for this task.
1. Compared tree possible algorithms to implement the operation of index intersection mentioned in
HLS by their labor/time consumption. Chose the algorithm that uses a modified Unique class (1.3) as
the most cheap requiring the least amount of efforts/time for its development.
2. Developed a design for a modification of the Unique class to support the operation of index
intersection.
3. Modified the merge_buffers procedure used by the Unique class to make it possible to use it not
only for the the operation of union, but for the operation of intersect as well.
Worked 25 hours and estimate 175 hours remain (original estimate increased by 200 hours).
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 21
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 21
-=-=(Guest - Wed, 03 Jun 2009, 01:17)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.30002 2009-06-03 01:17:32.000000000 +0300
+++ /tmp/wklog.21.new.30002 2009-06-03 01:17:32.000000000 +0300
@@ -7,13 +7,13 @@
The current optimization works with:
-WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
-WHERE key1_part1=1 OR key2_part1=3
+WHERE key1_part1=1 AND key2_part1=3
or
-WHERE key_part1<10 or key2_part1<100
+WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:06)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29694 2009-06-03 01:06:50.000000000 +0300
+++ /tmp/wklog.21.new.29694 2009-06-03 01:06:50.000000000 +0300
@@ -12,6 +12,8 @@
but not with:
WHERE key1_part1=1 OR key2_part1=3
+or
+WHERE key_part1<10 or key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
------------------------------------------------------------
-=-=(View All Progress Notes, 13 total)=-=-
http://askmonty.org/worklog/index.pl?tid=21&nolimit=1
DESCRIPTION:
At the moment index_merge supports intersection only for rowid-ordered streams.
This translates into a limitation that index_merge/intersect can only be
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
For example, assuming that key1 has 2 parts and key2 has 1 part.
The current optimization works with:
WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
WHERE key1_part1=1 AND key2_part1=3
or
WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Execution
1.1 Temptable
1.1.1 Improvement
1.2 Produce/merge sorted streams
1.3 Extend Unique class to handle intersection
1.4 Strategies that do not seem to be useful
1.4.1 Remove matches after having produced an ordered stream
1.4.2 Sparse rowid bitmaps
2. Optimization
</contents>
1. Execution
============
The primary task is to find means to compute an intersection of N unordered
streams. Besides general memory/cpu cost of computation, we consider:
- whether the produced rowid stream is ordered. If it is, it can be piped
into index_merge/intersect (as opposed to sort-intersect)
- whether the strategy can take advantage of the fact that some input streams
are already rowid-ordered
- startup cost (cost of producing the first output record)
We see the following possible strategies:
1.1 Temptable
-------------
[ This is our strategy of choice at the moment]
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
rowid binary($rowid_size),
count n,
primary key(rowid);
);
Then use this algorithm:
i1= {index with the least E(#records)};
for each record R in range_scan(i1)
temp_table.insert(R.rowid, count=1);
for each index idx except i1
{
for each R record in scan(idx) // (INNER-LOOP)
{
if (temp_table has R)
temptable[R].count++;
}
}
// The following loop can do ordered or unordered scan
// if we want it to be ordered scan, we probably better arrange so that
// 'count' column is part of the index.
for each record R in temp_table
{
if (R.count == number_of_streams)
emit(R.rowid);
}
The algorithm has an option to emit an ordered rowid stream.
In the above form, the cost to produce the first record is high. It's easy to
adjust the algorithm to make it low - we'll need to just start scanning all
indexes at once, and finish as soon as we got a full match, i.e. the
temptable[R].count++
operation resulted in the counter being equal to the number of merged scans.
1.1.1 Improvement
~~~~~~~~~~~~~~~~~
When running INNER-LOOP, we could count how many times we've done the
"count++" operation. If it has been done #records-in-temptable times, that
means that all further records will not have matches and we can finish the
scan, i.e. break out of the INNER-LOOP.
1.2 Produce/merge sorted streams
--------------------------------
For each of the merged scan, use filesort-like action to end up with an
ordered stream of rowids. Then merge the ordered streams.
By filesort-like action we mean
- Run over index, collect rowids in a buffer.
- When the buffer is full, sort it and dump into a temporary file.
After the above we'll end up with a number of sorted buffers on disk. We can
use mergebuff() function (it is part of filesort's functions) to produce one
ordered sequence (i.e. array, which may be partially on disk) of rowids.
Merging of ordered streams with help of priority queue is already implemented
in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
child_quick->get_next()
call with a call to read rowid from an ordered sequence.
1.3 Extend Unique class to handle intersection
----------------------------------------------
There is no point to use Unique object as a device that accumulates rowids of
a single scan then produces them in sorted order. One could do the same faster
with accumulating an array of rowids and then sorting it.
It's possible to use Unique object to collect/merge data from all scans though.
The idea is as follows:
- Unique should store <rowid, n_scans> pairs
- Duplicates are pairs with the same rowid
- Unique should try to avoid creating duplicates:
- don't add a duplicate into the in-memory part, instead combine two elements
together by adding their n_scans elements.
- combine duplicates when it sees them in Unique.get() call
- The data we get from Unique.get() should be filtered, all records that have
n_scans != number_of_scans_being_merged should be discarded.
If we're lucky to have started and finished a scan on some index (denote it
as S) without flushing the Unique in the process, then:
- there is no point in adding any new records into the Unique because their
absence in the Unique means that they don't have match in S and hence will
not get into the result of intersection.
- we need to only update the counters to be able to tell if the elements that
are already in the Unique will have matches in all scans.
1.4 Strategies that do not seem to be useful
--------------------------------------------
keeping them here so we don't consider them over and over
1.4.1 Remove matches after having produced an ordered stream
------------------------------------------------------------
We can dump everything into a rowid stream and get it sorted. Then we read it,
and if we see a rowid repeated $n_merged_scans times, it belongs to the
intersection (pass to output), otherwise it doesn't (skip).
This doesn't have any advantages over the produce/merge sorted streams
approach.
1.4.2 Sparse rowid bitmaps
--------------------------
Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
bitmaps assume there will always be enough memory to accommodate them.
PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
disk pages, not ids of individual records. It's hard for us to do something
similar because our rowids are opaque entities whose meaning depends on the
storage engines.
This seems to require too much change to be worth it.
2. Optimization
===============
SEL_TREE objects already represent intersections. The problems with
optimizations are:
- Cost formula(s)
- When N keys/conditions are present:
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
somehow avoid considering (2^n - n) possible options.
- Avoid producing (or even considering) apparently suboptimal plans:
= Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
a subset of columns covered by all other indexes.
= (TODO any other rules?)
- Correlation across selectivities. If there is a condition
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
can we consider satisfaction of AND-parts to be independent?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Igor): index_merge: non-ROR intersection (21)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: non-ROR intersection
CREATION DATE..: Thu, 21 May 2009, 21:32
SUPERVISOR.....: Monty
IMPLEMENTOR....: Igor
COPIES TO......: Rhuddleston, Sanja, Knielsen, Serg, Monty, Timour, Igor, Psergey
CATEGORY.......: Server-Sprint
TASK ID........: 21 (http://askmonty.org/worklog/?tid=21)
VERSION........: Server-5.3
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 25
ESTIMATE.......: 175 (hours remain)
ORIG. ESTIMATE.: 175
PROGRESS NOTES:
-=-=(Igor - Thu, 24 Jun 2010, 05:49)=-=-
Version updated.
--- /tmp/wklog.21.old.18774 2010-06-24 05:49:41.000000000 +0000
+++ /tmp/wklog.21.new.18774 2010-06-24 05:49:41.000000000 +0000
@@ -1 +1 @@
-Server-9.x
+Server-5.3
-=-=(Igor - Thu, 24 Jun 2010, 05:49)=-=-
Supervisor updated.
--- /tmp/wklog.21.old.18770 2010-06-24 05:49:14.000000000 +0000
+++ /tmp/wklog.21.new.18770 2010-06-24 05:49:14.000000000 +0000
@@ -1 +1 @@
-Knielsen
+Monty
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Category updated.
--- /tmp/wklog.21.old.18765 2010-06-24 05:48:53.000000000 +0000
+++ /tmp/wklog.21.new.18765 2010-06-24 05:48:53.000000000 +0000
@@ -1 +1 @@
-Server-RawIdeaBin
+Server-Sprint
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Status updated.
--- /tmp/wklog.21.old.18761 2010-06-24 05:48:43.000000000 +0000
+++ /tmp/wklog.21.new.18761 2010-06-24 05:48:43.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Observers changed: Knielsen,Monty,Psergey,Sanja,Igor,Rhuddleston,Timour,Serg
-=-=(Guest - Thu, 24 Jun 2010, 05:44)=-=-
I spent 25 hours in the month June 2010 to perform the following work for this task.
1. Compared tree possible algorithms to implement the operation of index intersection mentioned in
HLS by their labor/time consumption. Chose the algorithm that uses a modified Unique class (1.3) as
the most cheap requiring the least amount of efforts/time for its development.
2. Developed a design for a modification of the Unique class to support the operation of index
intersection.
3. Modified the merge_buffers procedure used by the Unique class to make it possible to use it not
only for the the operation of union, but for the operation of intersect as well.
Worked 25 hours and estimate 175 hours remain (original estimate increased by 200 hours).
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 21
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 21
-=-=(Guest - Wed, 03 Jun 2009, 01:17)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.30002 2009-06-03 01:17:32.000000000 +0300
+++ /tmp/wklog.21.new.30002 2009-06-03 01:17:32.000000000 +0300
@@ -7,13 +7,13 @@
The current optimization works with:
-WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
-WHERE key1_part1=1 OR key2_part1=3
+WHERE key1_part1=1 AND key2_part1=3
or
-WHERE key_part1<10 or key2_part1<100
+WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:06)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29694 2009-06-03 01:06:50.000000000 +0300
+++ /tmp/wklog.21.new.29694 2009-06-03 01:06:50.000000000 +0300
@@ -12,6 +12,8 @@
but not with:
WHERE key1_part1=1 OR key2_part1=3
+or
+WHERE key_part1<10 or key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
------------------------------------------------------------
-=-=(View All Progress Notes, 13 total)=-=-
http://askmonty.org/worklog/index.pl?tid=21&nolimit=1
DESCRIPTION:
At the moment index_merge supports intersection only for rowid-ordered streams.
This translates into a limitation that index_merge/intersect can only be
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
For example, assuming that key1 has 2 parts and key2 has 1 part.
The current optimization works with:
WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
WHERE key1_part1=1 AND key2_part1=3
or
WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Execution
1.1 Temptable
1.1.1 Improvement
1.2 Produce/merge sorted streams
1.3 Extend Unique class to handle intersection
1.4 Strategies that do not seem to be useful
1.4.1 Remove matches after having produced an ordered stream
1.4.2 Sparse rowid bitmaps
2. Optimization
</contents>
1. Execution
============
The primary task is to find means to compute an intersection of N unordered
streams. Besides general memory/cpu cost of computation, we consider:
- whether the produced rowid stream is ordered. If it is, it can be piped
into index_merge/intersect (as opposed to sort-intersect)
- whether the strategy can take advantage of the fact that some input streams
are already rowid-ordered
- startup cost (cost of producing the first output record)
We see the following possible strategies:
1.1 Temptable
-------------
[ This is our strategy of choice at the moment]
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
rowid binary($rowid_size),
count n,
primary key(rowid);
);
Then use this algorithm:
i1= {index with the least E(#records)};
for each record R in range_scan(i1)
temp_table.insert(R.rowid, count=1);
for each index idx except i1
{
for each R record in scan(idx) // (INNER-LOOP)
{
if (temp_table has R)
temptable[R].count++;
}
}
// The following loop can do ordered or unordered scan
// if we want it to be ordered scan, we probably better arrange so that
// 'count' column is part of the index.
for each record R in temp_table
{
if (R.count == number_of_streams)
emit(R.rowid);
}
The algorithm has an option to emit an ordered rowid stream.
In the above form, the cost to produce the first record is high. It's easy to
adjust the algorithm to make it low - we'll need to just start scanning all
indexes at once, and finish as soon as we got a full match, i.e. the
temptable[R].count++
operation resulted in the counter being equal to the number of merged scans.
1.1.1 Improvement
~~~~~~~~~~~~~~~~~
When running INNER-LOOP, we could count how many times we've done the
"count++" operation. If it has been done #records-in-temptable times, that
means that all further records will not have matches and we can finish the
scan, i.e. break out of the INNER-LOOP.
1.2 Produce/merge sorted streams
--------------------------------
For each of the merged scan, use filesort-like action to end up with an
ordered stream of rowids. Then merge the ordered streams.
By filesort-like action we mean
- Run over index, collect rowids in a buffer.
- When the buffer is full, sort it and dump into a temporary file.
After the above we'll end up with a number of sorted buffers on disk. We can
use mergebuff() function (it is part of filesort's functions) to produce one
ordered sequence (i.e. array, which may be partially on disk) of rowids.
Merging of ordered streams with help of priority queue is already implemented
in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
child_quick->get_next()
call with a call to read rowid from an ordered sequence.
1.3 Extend Unique class to handle intersection
----------------------------------------------
There is no point to use Unique object as a device that accumulates rowids of
a single scan then produces them in sorted order. One could do the same faster
with accumulating an array of rowids and then sorting it.
It's possible to use Unique object to collect/merge data from all scans though.
The idea is as follows:
- Unique should store <rowid, n_scans> pairs
- Duplicates are pairs with the same rowid
- Unique should try to avoid creating duplicates:
- don't add a duplicate into the in-memory part, instead combine two elements
together by adding their n_scans elements.
- combine duplicates when it sees them in Unique.get() call
- The data we get from Unique.get() should be filtered, all records that have
n_scans != number_of_scans_being_merged should be discarded.
If we're lucky to have started and finished a scan on some index (denote it
as S) without flushing the Unique in the process, then:
- there is no point in adding any new records into the Unique because their
absence in the Unique means that they don't have match in S and hence will
not get into the result of intersection.
- we need to only update the counters to be able to tell if the elements that
are already in the Unique will have matches in all scans.
1.4 Strategies that do not seem to be useful
--------------------------------------------
keeping them here so we don't consider them over and over
1.4.1 Remove matches after having produced an ordered stream
------------------------------------------------------------
We can dump everything into a rowid stream and get it sorted. Then we read it,
and if we see a rowid repeated $n_merged_scans times, it belongs to the
intersection (pass to output), otherwise it doesn't (skip).
This doesn't have any advantages over the produce/merge sorted streams
approach.
1.4.2 Sparse rowid bitmaps
--------------------------
Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
bitmaps assume there will always be enough memory to accommodate them.
PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
disk pages, not ids of individual records. It's hard for us to do something
similar because our rowids are opaque entities whose meaning depends on the
storage engines.
This seems to require too much change to be worth it.
2. Optimization
===============
SEL_TREE objects already represent intersections. The problems with
optimizations are:
- Cost formula(s)
- When N keys/conditions are present:
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
somehow avoid considering (2^n - n) possible options.
- Avoid producing (or even considering) apparently suboptimal plans:
= Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
a subset of columns covered by all other indexes.
= (TODO any other rules?)
- Correlation across selectivities. If there is a condition
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
can we consider satisfaction of AND-parts to be independent?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Igor): index_merge: non-ROR intersection (21)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: non-ROR intersection
CREATION DATE..: Thu, 21 May 2009, 21:32
SUPERVISOR.....: Monty
IMPLEMENTOR....: Igor
COPIES TO......: Rhuddleston, Sanja, Knielsen, Serg, Monty, Timour, Igor, Psergey
CATEGORY.......: Server-Sprint
TASK ID........: 21 (http://askmonty.org/worklog/?tid=21)
VERSION........: Server-5.3
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 25
ESTIMATE.......: 175 (hours remain)
ORIG. ESTIMATE.: 175
PROGRESS NOTES:
-=-=(Igor - Thu, 24 Jun 2010, 05:49)=-=-
Version updated.
--- /tmp/wklog.21.old.18774 2010-06-24 05:49:41.000000000 +0000
+++ /tmp/wklog.21.new.18774 2010-06-24 05:49:41.000000000 +0000
@@ -1 +1 @@
-Server-9.x
+Server-5.3
-=-=(Igor - Thu, 24 Jun 2010, 05:49)=-=-
Supervisor updated.
--- /tmp/wklog.21.old.18770 2010-06-24 05:49:14.000000000 +0000
+++ /tmp/wklog.21.new.18770 2010-06-24 05:49:14.000000000 +0000
@@ -1 +1 @@
-Knielsen
+Monty
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Category updated.
--- /tmp/wklog.21.old.18765 2010-06-24 05:48:53.000000000 +0000
+++ /tmp/wklog.21.new.18765 2010-06-24 05:48:53.000000000 +0000
@@ -1 +1 @@
-Server-RawIdeaBin
+Server-Sprint
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Status updated.
--- /tmp/wklog.21.old.18761 2010-06-24 05:48:43.000000000 +0000
+++ /tmp/wklog.21.new.18761 2010-06-24 05:48:43.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Observers changed: Knielsen,Monty,Psergey,Sanja,Igor,Rhuddleston,Timour,Serg
-=-=(Guest - Thu, 24 Jun 2010, 05:44)=-=-
I spent 25 hours in the month June 2010 to perform the following work for this task.
1. Compared tree possible algorithms to implement the operation of index intersection mentioned in
HLS by their labor/time consumption. Chose the algorithm that uses a modified Unique class (1.3) as
the most cheap requiring the least amount of efforts/time for its development.
2. Developed a design for a modification of the Unique class to support the operation of index
intersection.
3. Modified the merge_buffers procedure used by the Unique class to make it possible to use it not
only for the the operation of union, but for the operation of intersect as well.
Worked 25 hours and estimate 175 hours remain (original estimate increased by 200 hours).
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 21
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 21
-=-=(Guest - Wed, 03 Jun 2009, 01:17)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.30002 2009-06-03 01:17:32.000000000 +0300
+++ /tmp/wklog.21.new.30002 2009-06-03 01:17:32.000000000 +0300
@@ -7,13 +7,13 @@
The current optimization works with:
-WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
-WHERE key1_part1=1 OR key2_part1=3
+WHERE key1_part1=1 AND key2_part1=3
or
-WHERE key_part1<10 or key2_part1<100
+WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:06)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29694 2009-06-03 01:06:50.000000000 +0300
+++ /tmp/wklog.21.new.29694 2009-06-03 01:06:50.000000000 +0300
@@ -12,6 +12,8 @@
but not with:
WHERE key1_part1=1 OR key2_part1=3
+or
+WHERE key_part1<10 or key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
------------------------------------------------------------
-=-=(View All Progress Notes, 13 total)=-=-
http://askmonty.org/worklog/index.pl?tid=21&nolimit=1
DESCRIPTION:
At the moment index_merge supports intersection only for rowid-ordered streams.
This translates into a limitation that index_merge/intersect can only be
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
For example, assuming that key1 has 2 parts and key2 has 1 part.
The current optimization works with:
WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
WHERE key1_part1=1 AND key2_part1=3
or
WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Execution
1.1 Temptable
1.1.1 Improvement
1.2 Produce/merge sorted streams
1.3 Extend Unique class to handle intersection
1.4 Strategies that do not seem to be useful
1.4.1 Remove matches after having produced an ordered stream
1.4.2 Sparse rowid bitmaps
2. Optimization
</contents>
1. Execution
============
The primary task is to find means to compute an intersection of N unordered
streams. Besides general memory/cpu cost of computation, we consider:
- whether the produced rowid stream is ordered. If it is, it can be piped
into index_merge/intersect (as opposed to sort-intersect)
- whether the strategy can take advantage of the fact that some input streams
are already rowid-ordered
- startup cost (cost of producing the first output record)
We see the following possible strategies:
1.1 Temptable
-------------
[ This is our strategy of choice at the moment]
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
rowid binary($rowid_size),
count n,
primary key(rowid);
);
Then use this algorithm:
i1= {index with the least E(#records)};
for each record R in range_scan(i1)
temp_table.insert(R.rowid, count=1);
for each index idx except i1
{
for each R record in scan(idx) // (INNER-LOOP)
{
if (temp_table has R)
temptable[R].count++;
}
}
// The following loop can do ordered or unordered scan
// if we want it to be ordered scan, we probably better arrange so that
// 'count' column is part of the index.
for each record R in temp_table
{
if (R.count == number_of_streams)
emit(R.rowid);
}
The algorithm has an option to emit an ordered rowid stream.
In the above form, the cost to produce the first record is high. It's easy to
adjust the algorithm to make it low - we'll need to just start scanning all
indexes at once, and finish as soon as we got a full match, i.e. the
temptable[R].count++
operation resulted in the counter being equal to the number of merged scans.
1.1.1 Improvement
~~~~~~~~~~~~~~~~~
When running INNER-LOOP, we could count how many times we've done the
"count++" operation. If it has been done #records-in-temptable times, that
means that all further records will not have matches and we can finish the
scan, i.e. break out of the INNER-LOOP.
1.2 Produce/merge sorted streams
--------------------------------
For each of the merged scan, use filesort-like action to end up with an
ordered stream of rowids. Then merge the ordered streams.
By filesort-like action we mean
- Run over index, collect rowids in a buffer.
- When the buffer is full, sort it and dump into a temporary file.
After the above we'll end up with a number of sorted buffers on disk. We can
use mergebuff() function (it is part of filesort's functions) to produce one
ordered sequence (i.e. array, which may be partially on disk) of rowids.
Merging of ordered streams with help of priority queue is already implemented
in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
child_quick->get_next()
call with a call to read rowid from an ordered sequence.
1.3 Extend Unique class to handle intersection
----------------------------------------------
There is no point to use Unique object as a device that accumulates rowids of
a single scan then produces them in sorted order. One could do the same faster
with accumulating an array of rowids and then sorting it.
It's possible to use Unique object to collect/merge data from all scans though.
The idea is as follows:
- Unique should store <rowid, n_scans> pairs
- Duplicates are pairs with the same rowid
- Unique should try to avoid creating duplicates:
- don't add a duplicate into the in-memory part, instead combine two elements
together by adding their n_scans elements.
- combine duplicates when it sees them in Unique.get() call
- The data we get from Unique.get() should be filtered, all records that have
n_scans != number_of_scans_being_merged should be discarded.
If we're lucky to have started and finished a scan on some index (denote it
as S) without flushing the Unique in the process, then:
- there is no point in adding any new records into the Unique because their
absence in the Unique means that they don't have match in S and hence will
not get into the result of intersection.
- we need to only update the counters to be able to tell if the elements that
are already in the Unique will have matches in all scans.
1.4 Strategies that do not seem to be useful
--------------------------------------------
keeping them here so we don't consider them over and over
1.4.1 Remove matches after having produced an ordered stream
------------------------------------------------------------
We can dump everything into a rowid stream and get it sorted. Then we read it,
and if we see a rowid repeated $n_merged_scans times, it belongs to the
intersection (pass to output), otherwise it doesn't (skip).
This doesn't have any advantages over the produce/merge sorted streams
approach.
1.4.2 Sparse rowid bitmaps
--------------------------
Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
bitmaps assume there will always be enough memory to accommodate them.
PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
disk pages, not ids of individual records. It's hard for us to do something
similar because our rowids are opaque entities whose meaning depends on the
storage engines.
This seems to require too much change to be worth it.
2. Optimization
===============
SEL_TREE objects already represent intersections. The problems with
optimizations are:
- Cost formula(s)
- When N keys/conditions are present:
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
somehow avoid considering (2^n - n) possible options.
- Avoid producing (or even considering) apparently suboptimal plans:
= Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
a subset of columns covered by all other indexes.
= (TODO any other rules?)
- Correlation across selectivities. If there is a condition
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
can we consider satisfaction of AND-parts to be independent?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Igor): index_merge: non-ROR intersection (21)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: non-ROR intersection
CREATION DATE..: Thu, 21 May 2009, 21:32
SUPERVISOR.....: Monty
IMPLEMENTOR....: Igor
COPIES TO......: Rhuddleston, Sanja, Knielsen, Serg, Monty, Timour, Igor, Psergey
CATEGORY.......: Server-Sprint
TASK ID........: 21 (http://askmonty.org/worklog/?tid=21)
VERSION........: Server-5.3
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 25
ESTIMATE.......: 175 (hours remain)
ORIG. ESTIMATE.: 175
PROGRESS NOTES:
-=-=(Igor - Thu, 24 Jun 2010, 05:49)=-=-
Version updated.
--- /tmp/wklog.21.old.18774 2010-06-24 05:49:41.000000000 +0000
+++ /tmp/wklog.21.new.18774 2010-06-24 05:49:41.000000000 +0000
@@ -1 +1 @@
-Server-9.x
+Server-5.3
-=-=(Igor - Thu, 24 Jun 2010, 05:49)=-=-
Supervisor updated.
--- /tmp/wklog.21.old.18770 2010-06-24 05:49:14.000000000 +0000
+++ /tmp/wklog.21.new.18770 2010-06-24 05:49:14.000000000 +0000
@@ -1 +1 @@
-Knielsen
+Monty
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Category updated.
--- /tmp/wklog.21.old.18765 2010-06-24 05:48:53.000000000 +0000
+++ /tmp/wklog.21.new.18765 2010-06-24 05:48:53.000000000 +0000
@@ -1 +1 @@
-Server-RawIdeaBin
+Server-Sprint
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Status updated.
--- /tmp/wklog.21.old.18761 2010-06-24 05:48:43.000000000 +0000
+++ /tmp/wklog.21.new.18761 2010-06-24 05:48:43.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Observers changed: Knielsen,Monty,Psergey,Sanja,Igor,Rhuddleston,Timour,Serg
-=-=(Guest - Thu, 24 Jun 2010, 05:44)=-=-
I spent 25 hours in the month June 2010 to perform the following work for this task.
1. Compared tree possible algorithms to implement the operation of index intersection mentioned in
HLS by their labor/time consumption. Chose the algorithm that uses a modified Unique class (1.3) as
the most cheap requiring the least amount of efforts/time for its development.
2. Developed a design for a modification of the Unique class to support the operation of index
intersection.
3. Modified the merge_buffers procedure used by the Unique class to make it possible to use it not
only for the the operation of union, but for the operation of intersect as well.
Worked 25 hours and estimate 175 hours remain (original estimate increased by 200 hours).
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 21
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 21
-=-=(Guest - Wed, 03 Jun 2009, 01:17)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.30002 2009-06-03 01:17:32.000000000 +0300
+++ /tmp/wklog.21.new.30002 2009-06-03 01:17:32.000000000 +0300
@@ -7,13 +7,13 @@
The current optimization works with:
-WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
-WHERE key1_part1=1 OR key2_part1=3
+WHERE key1_part1=1 AND key2_part1=3
or
-WHERE key_part1<10 or key2_part1<100
+WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:06)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29694 2009-06-03 01:06:50.000000000 +0300
+++ /tmp/wklog.21.new.29694 2009-06-03 01:06:50.000000000 +0300
@@ -12,6 +12,8 @@
but not with:
WHERE key1_part1=1 OR key2_part1=3
+or
+WHERE key_part1<10 or key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
------------------------------------------------------------
-=-=(View All Progress Notes, 13 total)=-=-
http://askmonty.org/worklog/index.pl?tid=21&nolimit=1
DESCRIPTION:
At the moment index_merge supports intersection only for rowid-ordered streams.
This translates into a limitation that index_merge/intersect can only be
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
For example, assuming that key1 has 2 parts and key2 has 1 part.
The current optimization works with:
WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
WHERE key1_part1=1 AND key2_part1=3
or
WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Execution
1.1 Temptable
1.1.1 Improvement
1.2 Produce/merge sorted streams
1.3 Extend Unique class to handle intersection
1.4 Strategies that do not seem to be useful
1.4.1 Remove matches after having produced an ordered stream
1.4.2 Sparse rowid bitmaps
2. Optimization
</contents>
1. Execution
============
The primary task is to find means to compute an intersection of N unordered
streams. Besides general memory/cpu cost of computation, we consider:
- whether the produced rowid stream is ordered. If it is, it can be piped
into index_merge/intersect (as opposed to sort-intersect)
- whether the strategy can take advantage of the fact that some input streams
are already rowid-ordered
- startup cost (cost of producing the first output record)
We see the following possible strategies:
1.1 Temptable
-------------
[ This is our strategy of choice at the moment]
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
rowid binary($rowid_size),
count n,
primary key(rowid);
);
Then use this algorithm:
i1= {index with the least E(#records)};
for each record R in range_scan(i1)
temp_table.insert(R.rowid, count=1);
for each index idx except i1
{
for each R record in scan(idx) // (INNER-LOOP)
{
if (temp_table has R)
temptable[R].count++;
}
}
// The following loop can do ordered or unordered scan
// if we want it to be ordered scan, we probably better arrange so that
// 'count' column is part of the index.
for each record R in temp_table
{
if (R.count == number_of_streams)
emit(R.rowid);
}
The algorithm has an option to emit an ordered rowid stream.
In the above form, the cost to produce the first record is high. It's easy to
adjust the algorithm to make it low - we'll need to just start scanning all
indexes at once, and finish as soon as we got a full match, i.e. the
temptable[R].count++
operation resulted in the counter being equal to the number of merged scans.
1.1.1 Improvement
~~~~~~~~~~~~~~~~~
When running INNER-LOOP, we could count how many times we've done the
"count++" operation. If it has been done #records-in-temptable times, that
means that all further records will not have matches and we can finish the
scan, i.e. break out of the INNER-LOOP.
1.2 Produce/merge sorted streams
--------------------------------
For each of the merged scan, use filesort-like action to end up with an
ordered stream of rowids. Then merge the ordered streams.
By filesort-like action we mean
- Run over index, collect rowids in a buffer.
- When the buffer is full, sort it and dump into a temporary file.
After the above we'll end up with a number of sorted buffers on disk. We can
use mergebuff() function (it is part of filesort's functions) to produce one
ordered sequence (i.e. array, which may be partially on disk) of rowids.
Merging of ordered streams with help of priority queue is already implemented
in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
child_quick->get_next()
call with a call to read rowid from an ordered sequence.
1.3 Extend Unique class to handle intersection
----------------------------------------------
There is no point to use Unique object as a device that accumulates rowids of
a single scan then produces them in sorted order. One could do the same faster
with accumulating an array of rowids and then sorting it.
It's possible to use Unique object to collect/merge data from all scans though.
The idea is as follows:
- Unique should store <rowid, n_scans> pairs
- Duplicates are pairs with the same rowid
- Unique should try to avoid creating duplicates:
- don't add a duplicate into the in-memory part, instead combine two elements
together by adding their n_scans elements.
- combine duplicates when it sees them in Unique.get() call
- The data we get from Unique.get() should be filtered, all records that have
n_scans != number_of_scans_being_merged should be discarded.
If we're lucky to have started and finished a scan on some index (denote it
as S) without flushing the Unique in the process, then:
- there is no point in adding any new records into the Unique because their
absence in the Unique means that they don't have match in S and hence will
not get into the result of intersection.
- we need to only update the counters to be able to tell if the elements that
are already in the Unique will have matches in all scans.
1.4 Strategies that do not seem to be useful
--------------------------------------------
keeping them here so we don't consider them over and over
1.4.1 Remove matches after having produced an ordered stream
------------------------------------------------------------
We can dump everything into a rowid stream and get it sorted. Then we read it,
and if we see a rowid repeated $n_merged_scans times, it belongs to the
intersection (pass to output), otherwise it doesn't (skip).
This doesn't have any advantages over the produce/merge sorted streams
approach.
1.4.2 Sparse rowid bitmaps
--------------------------
Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
bitmaps assume there will always be enough memory to accommodate them.
PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
disk pages, not ids of individual records. It's hard for us to do something
similar because our rowids are opaque entities whose meaning depends on the
storage engines.
This seems to require too much change to be worth it.
2. Optimization
===============
SEL_TREE objects already represent intersections. The problems with
optimizations are:
- Cost formula(s)
- When N keys/conditions are present:
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
somehow avoid considering (2^n - n) possible options.
- Avoid producing (or even considering) apparently suboptimal plans:
= Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
a subset of columns covered by all other indexes.
= (TODO any other rules?)
- Correlation across selectivities. If there is a condition
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
can we consider satisfaction of AND-parts to be independent?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Igor): index_merge: non-ROR intersection (21)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: non-ROR intersection
CREATION DATE..: Thu, 21 May 2009, 21:32
SUPERVISOR.....: Monty
IMPLEMENTOR....: Igor
COPIES TO......: Rhuddleston, Sanja, Knielsen, Serg, Monty, Timour, Igor, Psergey
CATEGORY.......: Server-Sprint
TASK ID........: 21 (http://askmonty.org/worklog/?tid=21)
VERSION........: Server-5.3
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 25
ESTIMATE.......: 175 (hours remain)
ORIG. ESTIMATE.: 175
PROGRESS NOTES:
-=-=(Igor - Thu, 24 Jun 2010, 05:49)=-=-
Version updated.
--- /tmp/wklog.21.old.18774 2010-06-24 05:49:41.000000000 +0000
+++ /tmp/wklog.21.new.18774 2010-06-24 05:49:41.000000000 +0000
@@ -1 +1 @@
-Server-9.x
+Server-5.3
-=-=(Igor - Thu, 24 Jun 2010, 05:49)=-=-
Supervisor updated.
--- /tmp/wklog.21.old.18770 2010-06-24 05:49:14.000000000 +0000
+++ /tmp/wklog.21.new.18770 2010-06-24 05:49:14.000000000 +0000
@@ -1 +1 @@
-Knielsen
+Monty
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Category updated.
--- /tmp/wklog.21.old.18765 2010-06-24 05:48:53.000000000 +0000
+++ /tmp/wklog.21.new.18765 2010-06-24 05:48:53.000000000 +0000
@@ -1 +1 @@
-Server-RawIdeaBin
+Server-Sprint
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Status updated.
--- /tmp/wklog.21.old.18761 2010-06-24 05:48:43.000000000 +0000
+++ /tmp/wklog.21.new.18761 2010-06-24 05:48:43.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Observers changed: Knielsen,Monty,Psergey,Sanja,Igor,Rhuddleston,Timour,Serg
-=-=(Guest - Thu, 24 Jun 2010, 05:44)=-=-
I spent 25 hours in the month June 2010 to perform the following work for this task.
1. Compared tree possible algorithms to implement the operation of index intersection mentioned in
HLS by their labor/time consumption. Chose the algorithm that uses a modified Unique class (1.3) as
the most cheap requiring the least amount of efforts/time for its development.
2. Developed a design for a modification of the Unique class to support the operation of index
intersection.
3. Modified the merge_buffers procedure used by the Unique class to make it possible to use it not
only for the the operation of union, but for the operation of intersect as well.
Worked 25 hours and estimate 175 hours remain (original estimate increased by 200 hours).
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 21
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 21
-=-=(Guest - Wed, 03 Jun 2009, 01:17)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.30002 2009-06-03 01:17:32.000000000 +0300
+++ /tmp/wklog.21.new.30002 2009-06-03 01:17:32.000000000 +0300
@@ -7,13 +7,13 @@
The current optimization works with:
-WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
-WHERE key1_part1=1 OR key2_part1=3
+WHERE key1_part1=1 AND key2_part1=3
or
-WHERE key_part1<10 or key2_part1<100
+WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:06)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29694 2009-06-03 01:06:50.000000000 +0300
+++ /tmp/wklog.21.new.29694 2009-06-03 01:06:50.000000000 +0300
@@ -12,6 +12,8 @@
but not with:
WHERE key1_part1=1 OR key2_part1=3
+or
+WHERE key_part1<10 or key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
------------------------------------------------------------
-=-=(View All Progress Notes, 13 total)=-=-
http://askmonty.org/worklog/index.pl?tid=21&nolimit=1
DESCRIPTION:
At the moment index_merge supports intersection only for rowid-ordered streams.
This translates into a limitation that index_merge/intersect can only be
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
For example, assuming that key1 has 2 parts and key2 has 1 part.
The current optimization works with:
WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
WHERE key1_part1=1 AND key2_part1=3
or
WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Execution
1.1 Temptable
1.1.1 Improvement
1.2 Produce/merge sorted streams
1.3 Extend Unique class to handle intersection
1.4 Strategies that do not seem to be useful
1.4.1 Remove matches after having produced an ordered stream
1.4.2 Sparse rowid bitmaps
2. Optimization
</contents>
1. Execution
============
The primary task is to find means to compute an intersection of N unordered
streams. Besides general memory/cpu cost of computation, we consider:
- whether the produced rowid stream is ordered. If it is, it can be piped
into index_merge/intersect (as opposed to sort-intersect)
- whether the strategy can take advantage of the fact that some input streams
are already rowid-ordered
- startup cost (cost of producing the first output record)
We see the following possible strategies:
1.1 Temptable
-------------
[ This is our strategy of choice at the moment]
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
rowid binary($rowid_size),
count n,
primary key(rowid);
);
Then use this algorithm:
i1= {index with the least E(#records)};
for each record R in range_scan(i1)
temp_table.insert(R.rowid, count=1);
for each index idx except i1
{
for each R record in scan(idx) // (INNER-LOOP)
{
if (temp_table has R)
temptable[R].count++;
}
}
// The following loop can do ordered or unordered scan
// if we want it to be ordered scan, we probably better arrange so that
// 'count' column is part of the index.
for each record R in temp_table
{
if (R.count == number_of_streams)
emit(R.rowid);
}
The algorithm has an option to emit an ordered rowid stream.
In the above form, the cost to produce the first record is high. It's easy to
adjust the algorithm to make it low - we'll need to just start scanning all
indexes at once, and finish as soon as we got a full match, i.e. the
temptable[R].count++
operation resulted in the counter being equal to the number of merged scans.
1.1.1 Improvement
~~~~~~~~~~~~~~~~~
When running INNER-LOOP, we could count how many times we've done the
"count++" operation. If it has been done #records-in-temptable times, that
means that all further records will not have matches and we can finish the
scan, i.e. break out of the INNER-LOOP.
1.2 Produce/merge sorted streams
--------------------------------
For each of the merged scan, use filesort-like action to end up with an
ordered stream of rowids. Then merge the ordered streams.
By filesort-like action we mean
- Run over index, collect rowids in a buffer.
- When the buffer is full, sort it and dump into a temporary file.
After the above we'll end up with a number of sorted buffers on disk. We can
use mergebuff() function (it is part of filesort's functions) to produce one
ordered sequence (i.e. array, which may be partially on disk) of rowids.
Merging of ordered streams with help of priority queue is already implemented
in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
child_quick->get_next()
call with a call to read rowid from an ordered sequence.
1.3 Extend Unique class to handle intersection
----------------------------------------------
There is no point to use Unique object as a device that accumulates rowids of
a single scan then produces them in sorted order. One could do the same faster
with accumulating an array of rowids and then sorting it.
It's possible to use Unique object to collect/merge data from all scans though.
The idea is as follows:
- Unique should store <rowid, n_scans> pairs
- Duplicates are pairs with the same rowid
- Unique should try to avoid creating duplicates:
- don't add a duplicate into the in-memory part, instead combine two elements
together by adding their n_scans elements.
- combine duplicates when it sees them in Unique.get() call
- The data we get from Unique.get() should be filtered, all records that have
n_scans != number_of_scans_being_merged should be discarded.
If we're lucky to have started and finished a scan on some index (denote it
as S) without flushing the Unique in the process, then:
- there is no point in adding any new records into the Unique because their
absence in the Unique means that they don't have match in S and hence will
not get into the result of intersection.
- we need to only update the counters to be able to tell if the elements that
are already in the Unique will have matches in all scans.
1.4 Strategies that do not seem to be useful
--------------------------------------------
keeping them here so we don't consider them over and over
1.4.1 Remove matches after having produced an ordered stream
------------------------------------------------------------
We can dump everything into a rowid stream and get it sorted. Then we read it,
and if we see a rowid repeated $n_merged_scans times, it belongs to the
intersection (pass to output), otherwise it doesn't (skip).
This doesn't have any advantages over the produce/merge sorted streams
approach.
1.4.2 Sparse rowid bitmaps
--------------------------
Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
bitmaps assume there will always be enough memory to accommodate them.
PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
disk pages, not ids of individual records. It's hard for us to do something
similar because our rowids are opaque entities whose meaning depends on the
storage engines.
This seems to require too much change to be worth it.
2. Optimization
===============
SEL_TREE objects already represent intersections. The problems with
optimizations are:
- Cost formula(s)
- When N keys/conditions are present:
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
somehow avoid considering (2^n - n) possible options.
- Avoid producing (or even considering) apparently suboptimal plans:
= Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
a subset of columns covered by all other indexes.
= (TODO any other rules?)
- Correlation across selectivities. If there is a condition
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
can we consider satisfaction of AND-parts to be independent?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Igor): index_merge: non-ROR intersection (21)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: non-ROR intersection
CREATION DATE..: Thu, 21 May 2009, 21:32
SUPERVISOR.....: Monty
IMPLEMENTOR....: Igor
COPIES TO......: Rhuddleston, Sanja, Knielsen, Serg, Monty, Timour, Igor, Psergey
CATEGORY.......: Server-Sprint
TASK ID........: 21 (http://askmonty.org/worklog/?tid=21)
VERSION........: Server-5.3
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 25
ESTIMATE.......: 175 (hours remain)
ORIG. ESTIMATE.: 175
PROGRESS NOTES:
-=-=(Igor - Thu, 24 Jun 2010, 05:49)=-=-
Version updated.
--- /tmp/wklog.21.old.18774 2010-06-24 05:49:41.000000000 +0000
+++ /tmp/wklog.21.new.18774 2010-06-24 05:49:41.000000000 +0000
@@ -1 +1 @@
-Server-9.x
+Server-5.3
-=-=(Igor - Thu, 24 Jun 2010, 05:49)=-=-
Supervisor updated.
--- /tmp/wklog.21.old.18770 2010-06-24 05:49:14.000000000 +0000
+++ /tmp/wklog.21.new.18770 2010-06-24 05:49:14.000000000 +0000
@@ -1 +1 @@
-Knielsen
+Monty
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Category updated.
--- /tmp/wklog.21.old.18765 2010-06-24 05:48:53.000000000 +0000
+++ /tmp/wklog.21.new.18765 2010-06-24 05:48:53.000000000 +0000
@@ -1 +1 @@
-Server-RawIdeaBin
+Server-Sprint
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Status updated.
--- /tmp/wklog.21.old.18761 2010-06-24 05:48:43.000000000 +0000
+++ /tmp/wklog.21.new.18761 2010-06-24 05:48:43.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Observers changed: Knielsen,Monty,Psergey,Sanja,Igor,Rhuddleston,Timour,Serg
-=-=(Guest - Thu, 24 Jun 2010, 05:44)=-=-
I spent 25 hours in the month June 2010 to perform the following work for this task.
1. Compared tree possible algorithms to implement the operation of index intersection mentioned in
HLS by their labor/time consumption. Chose the algorithm that uses a modified Unique class (1.3) as
the most cheap requiring the least amount of efforts/time for its development.
2. Developed a design for a modification of the Unique class to support the operation of index
intersection.
3. Modified the merge_buffers procedure used by the Unique class to make it possible to use it not
only for the the operation of union, but for the operation of intersect as well.
Worked 25 hours and estimate 175 hours remain (original estimate increased by 200 hours).
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 21
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 21
-=-=(Guest - Wed, 03 Jun 2009, 01:17)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.30002 2009-06-03 01:17:32.000000000 +0300
+++ /tmp/wklog.21.new.30002 2009-06-03 01:17:32.000000000 +0300
@@ -7,13 +7,13 @@
The current optimization works with:
-WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
-WHERE key1_part1=1 OR key2_part1=3
+WHERE key1_part1=1 AND key2_part1=3
or
-WHERE key_part1<10 or key2_part1<100
+WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:06)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29694 2009-06-03 01:06:50.000000000 +0300
+++ /tmp/wklog.21.new.29694 2009-06-03 01:06:50.000000000 +0300
@@ -12,6 +12,8 @@
but not with:
WHERE key1_part1=1 OR key2_part1=3
+or
+WHERE key_part1<10 or key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
------------------------------------------------------------
-=-=(View All Progress Notes, 13 total)=-=-
http://askmonty.org/worklog/index.pl?tid=21&nolimit=1
DESCRIPTION:
At the moment index_merge supports intersection only for rowid-ordered streams.
This translates into a limitation that index_merge/intersect can only be
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
For example, assuming that key1 has 2 parts and key2 has 1 part.
The current optimization works with:
WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
WHERE key1_part1=1 AND key2_part1=3
or
WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Execution
1.1 Temptable
1.1.1 Improvement
1.2 Produce/merge sorted streams
1.3 Extend Unique class to handle intersection
1.4 Strategies that do not seem to be useful
1.4.1 Remove matches after having produced an ordered stream
1.4.2 Sparse rowid bitmaps
2. Optimization
</contents>
1. Execution
============
The primary task is to find means to compute an intersection of N unordered
streams. Besides general memory/cpu cost of computation, we consider:
- whether the produced rowid stream is ordered. If it is, it can be piped
into index_merge/intersect (as opposed to sort-intersect)
- whether the strategy can take advantage of the fact that some input streams
are already rowid-ordered
- startup cost (cost of producing the first output record)
We see the following possible strategies:
1.1 Temptable
-------------
[ This is our strategy of choice at the moment]
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
rowid binary($rowid_size),
count n,
primary key(rowid);
);
Then use this algorithm:
i1= {index with the least E(#records)};
for each record R in range_scan(i1)
temp_table.insert(R.rowid, count=1);
for each index idx except i1
{
for each R record in scan(idx) // (INNER-LOOP)
{
if (temp_table has R)
temptable[R].count++;
}
}
// The following loop can do ordered or unordered scan
// if we want it to be ordered scan, we probably better arrange so that
// 'count' column is part of the index.
for each record R in temp_table
{
if (R.count == number_of_streams)
emit(R.rowid);
}
The algorithm has an option to emit an ordered rowid stream.
In the above form, the cost to produce the first record is high. It's easy to
adjust the algorithm to make it low - we'll need to just start scanning all
indexes at once, and finish as soon as we got a full match, i.e. the
temptable[R].count++
operation resulted in the counter being equal to the number of merged scans.
1.1.1 Improvement
~~~~~~~~~~~~~~~~~
When running INNER-LOOP, we could count how many times we've done the
"count++" operation. If it has been done #records-in-temptable times, that
means that all further records will not have matches and we can finish the
scan, i.e. break out of the INNER-LOOP.
1.2 Produce/merge sorted streams
--------------------------------
For each of the merged scan, use filesort-like action to end up with an
ordered stream of rowids. Then merge the ordered streams.
By filesort-like action we mean
- Run over index, collect rowids in a buffer.
- When the buffer is full, sort it and dump into a temporary file.
After the above we'll end up with a number of sorted buffers on disk. We can
use mergebuff() function (it is part of filesort's functions) to produce one
ordered sequence (i.e. array, which may be partially on disk) of rowids.
Merging of ordered streams with help of priority queue is already implemented
in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
child_quick->get_next()
call with a call to read rowid from an ordered sequence.
1.3 Extend Unique class to handle intersection
----------------------------------------------
There is no point to use Unique object as a device that accumulates rowids of
a single scan then produces them in sorted order. One could do the same faster
with accumulating an array of rowids and then sorting it.
It's possible to use Unique object to collect/merge data from all scans though.
The idea is as follows:
- Unique should store <rowid, n_scans> pairs
- Duplicates are pairs with the same rowid
- Unique should try to avoid creating duplicates:
- don't add a duplicate into the in-memory part, instead combine two elements
together by adding their n_scans elements.
- combine duplicates when it sees them in Unique.get() call
- The data we get from Unique.get() should be filtered, all records that have
n_scans != number_of_scans_being_merged should be discarded.
If we're lucky to have started and finished a scan on some index (denote it
as S) without flushing the Unique in the process, then:
- there is no point in adding any new records into the Unique because their
absence in the Unique means that they don't have match in S and hence will
not get into the result of intersection.
- we need to only update the counters to be able to tell if the elements that
are already in the Unique will have matches in all scans.
1.4 Strategies that do not seem to be useful
--------------------------------------------
keeping them here so we don't consider them over and over
1.4.1 Remove matches after having produced an ordered stream
------------------------------------------------------------
We can dump everything into a rowid stream and get it sorted. Then we read it,
and if we see a rowid repeated $n_merged_scans times, it belongs to the
intersection (pass to output), otherwise it doesn't (skip).
This doesn't have any advantages over the produce/merge sorted streams
approach.
1.4.2 Sparse rowid bitmaps
--------------------------
Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
bitmaps assume there will always be enough memory to accommodate them.
PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
disk pages, not ids of individual records. It's hard for us to do something
similar because our rowids are opaque entities whose meaning depends on the
storage engines.
This seems to require too much change to be worth it.
2. Optimization
===============
SEL_TREE objects already represent intersections. The problems with
optimizations are:
- Cost formula(s)
- When N keys/conditions are present:
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
somehow avoid considering (2^n - n) possible options.
- Avoid producing (or even considering) apparently suboptimal plans:
= Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
a subset of columns covered by all other indexes.
= (TODO any other rules?)
- Correlation across selectivities. If there is a condition
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
can we consider satisfaction of AND-parts to be independent?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Igor): index_merge: non-ROR intersection (21)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: non-ROR intersection
CREATION DATE..: Thu, 21 May 2009, 21:32
SUPERVISOR.....: Monty
IMPLEMENTOR....: Igor
COPIES TO......: Rhuddleston, Sanja, Knielsen, Serg, Monty, Timour, Igor, Psergey
CATEGORY.......: Server-Sprint
TASK ID........: 21 (http://askmonty.org/worklog/?tid=21)
VERSION........: Server-5.3
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 25
ESTIMATE.......: 175 (hours remain)
ORIG. ESTIMATE.: 175
PROGRESS NOTES:
-=-=(Igor - Thu, 24 Jun 2010, 05:49)=-=-
Version updated.
--- /tmp/wklog.21.old.18774 2010-06-24 05:49:41.000000000 +0000
+++ /tmp/wklog.21.new.18774 2010-06-24 05:49:41.000000000 +0000
@@ -1 +1 @@
-Server-9.x
+Server-5.3
-=-=(Igor - Thu, 24 Jun 2010, 05:49)=-=-
Supervisor updated.
--- /tmp/wklog.21.old.18770 2010-06-24 05:49:14.000000000 +0000
+++ /tmp/wklog.21.new.18770 2010-06-24 05:49:14.000000000 +0000
@@ -1 +1 @@
-Knielsen
+Monty
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Category updated.
--- /tmp/wklog.21.old.18765 2010-06-24 05:48:53.000000000 +0000
+++ /tmp/wklog.21.new.18765 2010-06-24 05:48:53.000000000 +0000
@@ -1 +1 @@
-Server-RawIdeaBin
+Server-Sprint
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Status updated.
--- /tmp/wklog.21.old.18761 2010-06-24 05:48:43.000000000 +0000
+++ /tmp/wklog.21.new.18761 2010-06-24 05:48:43.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Observers changed: Knielsen,Monty,Psergey,Sanja,Igor,Rhuddleston,Timour,Serg
-=-=(Guest - Thu, 24 Jun 2010, 05:44)=-=-
I spent 25 hours in the month June 2010 to perform the following work for this task.
1. Compared tree possible algorithms to implement the operation of index intersection mentioned in
HLS by their labor/time consumption. Chose the algorithm that uses a modified Unique class (1.3) as
the most cheap requiring the least amount of efforts/time for its development.
2. Developed a design for a modification of the Unique class to support the operation of index
intersection.
3. Modified the merge_buffers procedure used by the Unique class to make it possible to use it not
only for the the operation of union, but for the operation of intersect as well.
Worked 25 hours and estimate 175 hours remain (original estimate increased by 200 hours).
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 21
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 21
-=-=(Guest - Wed, 03 Jun 2009, 01:17)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.30002 2009-06-03 01:17:32.000000000 +0300
+++ /tmp/wklog.21.new.30002 2009-06-03 01:17:32.000000000 +0300
@@ -7,13 +7,13 @@
The current optimization works with:
-WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
-WHERE key1_part1=1 OR key2_part1=3
+WHERE key1_part1=1 AND key2_part1=3
or
-WHERE key_part1<10 or key2_part1<100
+WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:06)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29694 2009-06-03 01:06:50.000000000 +0300
+++ /tmp/wklog.21.new.29694 2009-06-03 01:06:50.000000000 +0300
@@ -12,6 +12,8 @@
but not with:
WHERE key1_part1=1 OR key2_part1=3
+or
+WHERE key_part1<10 or key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
------------------------------------------------------------
-=-=(View All Progress Notes, 13 total)=-=-
http://askmonty.org/worklog/index.pl?tid=21&nolimit=1
DESCRIPTION:
At the moment index_merge supports intersection only for rowid-ordered streams.
This translates into a limitation that index_merge/intersect can only be
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
For example, assuming that key1 has 2 parts and key2 has 1 part.
The current optimization works with:
WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
WHERE key1_part1=1 AND key2_part1=3
or
WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Execution
1.1 Temptable
1.1.1 Improvement
1.2 Produce/merge sorted streams
1.3 Extend Unique class to handle intersection
1.4 Strategies that do not seem to be useful
1.4.1 Remove matches after having produced an ordered stream
1.4.2 Sparse rowid bitmaps
2. Optimization
</contents>
1. Execution
============
The primary task is to find means to compute an intersection of N unordered
streams. Besides general memory/cpu cost of computation, we consider:
- whether the produced rowid stream is ordered. If it is, it can be piped
into index_merge/intersect (as opposed to sort-intersect)
- whether the strategy can take advantage of the fact that some input streams
are already rowid-ordered
- startup cost (cost of producing the first output record)
We see the following possible strategies:
1.1 Temptable
-------------
[ This is our strategy of choice at the moment]
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
rowid binary($rowid_size),
count n,
primary key(rowid);
);
Then use this algorithm:
i1= {index with the least E(#records)};
for each record R in range_scan(i1)
temp_table.insert(R.rowid, count=1);
for each index idx except i1
{
for each R record in scan(idx) // (INNER-LOOP)
{
if (temp_table has R)
temptable[R].count++;
}
}
// The following loop can do ordered or unordered scan
// if we want it to be ordered scan, we probably better arrange so that
// 'count' column is part of the index.
for each record R in temp_table
{
if (R.count == number_of_streams)
emit(R.rowid);
}
The algorithm has an option to emit an ordered rowid stream.
In the above form, the cost to produce the first record is high. It's easy to
adjust the algorithm to make it low - we'll need to just start scanning all
indexes at once, and finish as soon as we got a full match, i.e. the
temptable[R].count++
operation resulted in the counter being equal to the number of merged scans.
1.1.1 Improvement
~~~~~~~~~~~~~~~~~
When running INNER-LOOP, we could count how many times we've done the
"count++" operation. If it has been done #records-in-temptable times, that
means that all further records will not have matches and we can finish the
scan, i.e. break out of the INNER-LOOP.
1.2 Produce/merge sorted streams
--------------------------------
For each of the merged scan, use filesort-like action to end up with an
ordered stream of rowids. Then merge the ordered streams.
By filesort-like action we mean
- Run over index, collect rowids in a buffer.
- When the buffer is full, sort it and dump into a temporary file.
After the above we'll end up with a number of sorted buffers on disk. We can
use mergebuff() function (it is part of filesort's functions) to produce one
ordered sequence (i.e. array, which may be partially on disk) of rowids.
Merging of ordered streams with help of priority queue is already implemented
in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
child_quick->get_next()
call with a call to read rowid from an ordered sequence.
1.3 Extend Unique class to handle intersection
----------------------------------------------
There is no point to use Unique object as a device that accumulates rowids of
a single scan then produces them in sorted order. One could do the same faster
with accumulating an array of rowids and then sorting it.
It's possible to use Unique object to collect/merge data from all scans though.
The idea is as follows:
- Unique should store <rowid, n_scans> pairs
- Duplicates are pairs with the same rowid
- Unique should try to avoid creating duplicates:
- don't add a duplicate into the in-memory part, instead combine two elements
together by adding their n_scans elements.
- combine duplicates when it sees them in Unique.get() call
- The data we get from Unique.get() should be filtered, all records that have
n_scans != number_of_scans_being_merged should be discarded.
If we're lucky to have started and finished a scan on some index (denote it
as S) without flushing the Unique in the process, then:
- there is no point in adding any new records into the Unique because their
absence in the Unique means that they don't have match in S and hence will
not get into the result of intersection.
- we need to only update the counters to be able to tell if the elements that
are already in the Unique will have matches in all scans.
1.4 Strategies that do not seem to be useful
--------------------------------------------
keeping them here so we don't consider them over and over
1.4.1 Remove matches after having produced an ordered stream
------------------------------------------------------------
We can dump everything into a rowid stream and get it sorted. Then we read it,
and if we see a rowid repeated $n_merged_scans times, it belongs to the
intersection (pass to output), otherwise it doesn't (skip).
This doesn't have any advantages over the produce/merge sorted streams
approach.
1.4.2 Sparse rowid bitmaps
--------------------------
Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
bitmaps assume there will always be enough memory to accommodate them.
PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
disk pages, not ids of individual records. It's hard for us to do something
similar because our rowids are opaque entities whose meaning depends on the
storage engines.
This seems to require too much change to be worth it.
2. Optimization
===============
SEL_TREE objects already represent intersections. The problems with
optimizations are:
- Cost formula(s)
- When N keys/conditions are present:
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
somehow avoid considering (2^n - n) possible options.
- Avoid producing (or even considering) apparently suboptimal plans:
= Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
a subset of columns covered by all other indexes.
= (TODO any other rules?)
- Correlation across selectivities. If there is a condition
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
can we consider satisfaction of AND-parts to be independent?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Igor): index_merge: non-ROR intersection (21)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: non-ROR intersection
CREATION DATE..: Thu, 21 May 2009, 21:32
SUPERVISOR.....: Monty
IMPLEMENTOR....: Igor
COPIES TO......: Rhuddleston, Sanja, Knielsen, Serg, Monty, Timour, Igor, Psergey
CATEGORY.......: Server-Sprint
TASK ID........: 21 (http://askmonty.org/worklog/?tid=21)
VERSION........: Server-5.3
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 25
ESTIMATE.......: 175 (hours remain)
ORIG. ESTIMATE.: 175
PROGRESS NOTES:
-=-=(Igor - Thu, 24 Jun 2010, 05:49)=-=-
Version updated.
--- /tmp/wklog.21.old.18774 2010-06-24 05:49:41.000000000 +0000
+++ /tmp/wklog.21.new.18774 2010-06-24 05:49:41.000000000 +0000
@@ -1 +1 @@
-Server-9.x
+Server-5.3
-=-=(Igor - Thu, 24 Jun 2010, 05:49)=-=-
Supervisor updated.
--- /tmp/wklog.21.old.18770 2010-06-24 05:49:14.000000000 +0000
+++ /tmp/wklog.21.new.18770 2010-06-24 05:49:14.000000000 +0000
@@ -1 +1 @@
-Knielsen
+Monty
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Category updated.
--- /tmp/wklog.21.old.18765 2010-06-24 05:48:53.000000000 +0000
+++ /tmp/wklog.21.new.18765 2010-06-24 05:48:53.000000000 +0000
@@ -1 +1 @@
-Server-RawIdeaBin
+Server-Sprint
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Status updated.
--- /tmp/wklog.21.old.18761 2010-06-24 05:48:43.000000000 +0000
+++ /tmp/wklog.21.new.18761 2010-06-24 05:48:43.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Observers changed: Knielsen,Monty,Psergey,Sanja,Igor,Rhuddleston,Timour,Serg
-=-=(Guest - Thu, 24 Jun 2010, 05:44)=-=-
I spent 25 hours in the month June 2010 to perform the following work for this task.
1. Compared tree possible algorithms to implement the operation of index intersection mentioned in
HLS by their labor/time consumption. Chose the algorithm that uses a modified Unique class (1.3) as
the most cheap requiring the least amount of efforts/time for its development.
2. Developed a design for a modification of the Unique class to support the operation of index
intersection.
3. Modified the merge_buffers procedure used by the Unique class to make it possible to use it not
only for the the operation of union, but for the operation of intersect as well.
Worked 25 hours and estimate 175 hours remain (original estimate increased by 200 hours).
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 21
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 21
-=-=(Guest - Wed, 03 Jun 2009, 01:17)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.30002 2009-06-03 01:17:32.000000000 +0300
+++ /tmp/wklog.21.new.30002 2009-06-03 01:17:32.000000000 +0300
@@ -7,13 +7,13 @@
The current optimization works with:
-WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
-WHERE key1_part1=1 OR key2_part1=3
+WHERE key1_part1=1 AND key2_part1=3
or
-WHERE key_part1<10 or key2_part1<100
+WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:06)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29694 2009-06-03 01:06:50.000000000 +0300
+++ /tmp/wklog.21.new.29694 2009-06-03 01:06:50.000000000 +0300
@@ -12,6 +12,8 @@
but not with:
WHERE key1_part1=1 OR key2_part1=3
+or
+WHERE key_part1<10 or key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
------------------------------------------------------------
-=-=(View All Progress Notes, 13 total)=-=-
http://askmonty.org/worklog/index.pl?tid=21&nolimit=1
DESCRIPTION:
At the moment index_merge supports intersection only for rowid-ordered streams.
This translates into a limitation that index_merge/intersect can only be
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
For example, assuming that key1 has 2 parts and key2 has 1 part.
The current optimization works with:
WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
WHERE key1_part1=1 AND key2_part1=3
or
WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Execution
1.1 Temptable
1.1.1 Improvement
1.2 Produce/merge sorted streams
1.3 Extend Unique class to handle intersection
1.4 Strategies that do not seem to be useful
1.4.1 Remove matches after having produced an ordered stream
1.4.2 Sparse rowid bitmaps
2. Optimization
</contents>
1. Execution
============
The primary task is to find means to compute an intersection of N unordered
streams. Besides general memory/cpu cost of computation, we consider:
- whether the produced rowid stream is ordered. If it is, it can be piped
into index_merge/intersect (as opposed to sort-intersect)
- whether the strategy can take advantage of the fact that some input streams
are already rowid-ordered
- startup cost (cost of producing the first output record)
We see the following possible strategies:
1.1 Temptable
-------------
[ This is our strategy of choice at the moment]
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
rowid binary($rowid_size),
count n,
primary key(rowid);
);
Then use this algorithm:
i1= {index with the least E(#records)};
for each record R in range_scan(i1)
temp_table.insert(R.rowid, count=1);
for each index idx except i1
{
for each R record in scan(idx) // (INNER-LOOP)
{
if (temp_table has R)
temptable[R].count++;
}
}
// The following loop can do ordered or unordered scan
// if we want it to be ordered scan, we probably better arrange so that
// 'count' column is part of the index.
for each record R in temp_table
{
if (R.count == number_of_streams)
emit(R.rowid);
}
The algorithm has an option to emit an ordered rowid stream.
In the above form, the cost to produce the first record is high. It's easy to
adjust the algorithm to make it low - we'll need to just start scanning all
indexes at once, and finish as soon as we got a full match, i.e. the
temptable[R].count++
operation resulted in the counter being equal to the number of merged scans.
1.1.1 Improvement
~~~~~~~~~~~~~~~~~
When running INNER-LOOP, we could count how many times we've done the
"count++" operation. If it has been done #records-in-temptable times, that
means that all further records will not have matches and we can finish the
scan, i.e. break out of the INNER-LOOP.
1.2 Produce/merge sorted streams
--------------------------------
For each of the merged scan, use filesort-like action to end up with an
ordered stream of rowids. Then merge the ordered streams.
By filesort-like action we mean
- Run over index, collect rowids in a buffer.
- When the buffer is full, sort it and dump into a temporary file.
After the above we'll end up with a number of sorted buffers on disk. We can
use mergebuff() function (it is part of filesort's functions) to produce one
ordered sequence (i.e. array, which may be partially on disk) of rowids.
Merging of ordered streams with help of priority queue is already implemented
in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
child_quick->get_next()
call with a call to read rowid from an ordered sequence.
1.3 Extend Unique class to handle intersection
----------------------------------------------
There is no point to use Unique object as a device that accumulates rowids of
a single scan then produces them in sorted order. One could do the same faster
with accumulating an array of rowids and then sorting it.
It's possible to use Unique object to collect/merge data from all scans though.
The idea is as follows:
- Unique should store <rowid, n_scans> pairs
- Duplicates are pairs with the same rowid
- Unique should try to avoid creating duplicates:
- don't add a duplicate into the in-memory part, instead combine two elements
together by adding their n_scans elements.
- combine duplicates when it sees them in Unique.get() call
- The data we get from Unique.get() should be filtered, all records that have
n_scans != number_of_scans_being_merged should be discarded.
If we're lucky to have started and finished a scan on some index (denote it
as S) without flushing the Unique in the process, then:
- there is no point in adding any new records into the Unique because their
absence in the Unique means that they don't have match in S and hence will
not get into the result of intersection.
- we need to only update the counters to be able to tell if the elements that
are already in the Unique will have matches in all scans.
1.4 Strategies that do not seem to be useful
--------------------------------------------
keeping them here so we don't consider them over and over
1.4.1 Remove matches after having produced an ordered stream
------------------------------------------------------------
We can dump everything into a rowid stream and get it sorted. Then we read it,
and if we see a rowid repeated $n_merged_scans times, it belongs to the
intersection (pass to output), otherwise it doesn't (skip).
This doesn't have any advantages over the produce/merge sorted streams
approach.
1.4.2 Sparse rowid bitmaps
--------------------------
Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
bitmaps assume there will always be enough memory to accommodate them.
PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
disk pages, not ids of individual records. It's hard for us to do something
similar because our rowids are opaque entities whose meaning depends on the
storage engines.
This seems to require too much change to be worth it.
2. Optimization
===============
SEL_TREE objects already represent intersections. The problems with
optimizations are:
- Cost formula(s)
- When N keys/conditions are present:
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
somehow avoid considering (2^n - n) possible options.
- Avoid producing (or even considering) apparently suboptimal plans:
= Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
a subset of columns covered by all other indexes.
= (TODO any other rules?)
- Correlation across selectivities. If there is a condition
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
can we consider satisfaction of AND-parts to be independent?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Igor): index_merge: non-ROR intersection (21)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: non-ROR intersection
CREATION DATE..: Thu, 21 May 2009, 21:32
SUPERVISOR.....: Monty
IMPLEMENTOR....: Igor
COPIES TO......: Rhuddleston, Sanja, Knielsen, Serg, Monty, Timour, Igor, Psergey
CATEGORY.......: Server-Sprint
TASK ID........: 21 (http://askmonty.org/worklog/?tid=21)
VERSION........: Server-9.x
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 25
ESTIMATE.......: 175 (hours remain)
ORIG. ESTIMATE.: 175
PROGRESS NOTES:
-=-=(Igor - Thu, 24 Jun 2010, 05:49)=-=-
Supervisor updated.
--- /tmp/wklog.21.old.18770 2010-06-24 05:49:14.000000000 +0000
+++ /tmp/wklog.21.new.18770 2010-06-24 05:49:14.000000000 +0000
@@ -1 +1 @@
-Knielsen
+Monty
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Category updated.
--- /tmp/wklog.21.old.18765 2010-06-24 05:48:53.000000000 +0000
+++ /tmp/wklog.21.new.18765 2010-06-24 05:48:53.000000000 +0000
@@ -1 +1 @@
-Server-RawIdeaBin
+Server-Sprint
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Status updated.
--- /tmp/wklog.21.old.18761 2010-06-24 05:48:43.000000000 +0000
+++ /tmp/wklog.21.new.18761 2010-06-24 05:48:43.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Observers changed: Knielsen,Monty,Psergey,Sanja,Igor,Rhuddleston,Timour,Serg
-=-=(Guest - Thu, 24 Jun 2010, 05:44)=-=-
I spent 25 hours in the month June 2010 to perform the following work for this task.
1. Compared tree possible algorithms to implement the operation of index intersection mentioned in
HLS by their labor/time consumption. Chose the algorithm that uses a modified Unique class (1.3) as
the most cheap requiring the least amount of efforts/time for its development.
2. Developed a design for a modification of the Unique class to support the operation of index
intersection.
3. Modified the merge_buffers procedure used by the Unique class to make it possible to use it not
only for the the operation of union, but for the operation of intersect as well.
Worked 25 hours and estimate 175 hours remain (original estimate increased by 200 hours).
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 21
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 21
-=-=(Guest - Wed, 03 Jun 2009, 01:17)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.30002 2009-06-03 01:17:32.000000000 +0300
+++ /tmp/wklog.21.new.30002 2009-06-03 01:17:32.000000000 +0300
@@ -7,13 +7,13 @@
The current optimization works with:
-WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
-WHERE key1_part1=1 OR key2_part1=3
+WHERE key1_part1=1 AND key2_part1=3
or
-WHERE key_part1<10 or key2_part1<100
+WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:06)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29694 2009-06-03 01:06:50.000000000 +0300
+++ /tmp/wklog.21.new.29694 2009-06-03 01:06:50.000000000 +0300
@@ -12,6 +12,8 @@
but not with:
WHERE key1_part1=1 OR key2_part1=3
+or
+WHERE key_part1<10 or key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:05)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29638 2009-06-03 01:05:01.000000000 +0300
+++ /tmp/wklog.21.new.29638 2009-06-03 01:05:01.000000000 +0300
@@ -3,5 +3,15 @@
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
+For example, assuming that key1 has 2 parts and key2 has 1 part.
+
+The current optimization works with:
+
+WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+
+but not with:
+
+WHERE key1_part1=1 OR key2_part1=3
+
This WL entry is to lift this limitation by developing algorithms that do
-intersection on non-ROR scans.
+intersection on non-ROR (rowid ordered retrieval) scans.
------------------------------------------------------------
-=-=(View All Progress Notes, 12 total)=-=-
http://askmonty.org/worklog/index.pl?tid=21&nolimit=1
DESCRIPTION:
At the moment index_merge supports intersection only for rowid-ordered streams.
This translates into a limitation that index_merge/intersect can only be
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
For example, assuming that key1 has 2 parts and key2 has 1 part.
The current optimization works with:
WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
WHERE key1_part1=1 AND key2_part1=3
or
WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Execution
1.1 Temptable
1.1.1 Improvement
1.2 Produce/merge sorted streams
1.3 Extend Unique class to handle intersection
1.4 Strategies that do not seem to be useful
1.4.1 Remove matches after having produced an ordered stream
1.4.2 Sparse rowid bitmaps
2. Optimization
</contents>
1. Execution
============
The primary task is to find means to compute an intersection of N unordered
streams. Besides general memory/cpu cost of computation, we consider:
- whether the produced rowid stream is ordered. If it is, it can be piped
into index_merge/intersect (as opposed to sort-intersect)
- whether the strategy can take advantage of the fact that some input streams
are already rowid-ordered
- startup cost (cost of producing the first output record)
We see the following possible strategies:
1.1 Temptable
-------------
[ This is our strategy of choice at the moment]
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
rowid binary($rowid_size),
count n,
primary key(rowid);
);
Then use this algorithm:
i1= {index with the least E(#records)};
for each record R in range_scan(i1)
temp_table.insert(R.rowid, count=1);
for each index idx except i1
{
for each R record in scan(idx) // (INNER-LOOP)
{
if (temp_table has R)
temptable[R].count++;
}
}
// The following loop can do ordered or unordered scan
// if we want it to be ordered scan, we probably better arrange so that
// 'count' column is part of the index.
for each record R in temp_table
{
if (R.count == number_of_streams)
emit(R.rowid);
}
The algorithm has an option to emit an ordered rowid stream.
In the above form, the cost to produce the first record is high. It's easy to
adjust the algorithm to make it low - we'll need to just start scanning all
indexes at once, and finish as soon as we got a full match, i.e. the
temptable[R].count++
operation resulted in the counter being equal to the number of merged scans.
1.1.1 Improvement
~~~~~~~~~~~~~~~~~
When running INNER-LOOP, we could count how many times we've done the
"count++" operation. If it has been done #records-in-temptable times, that
means that all further records will not have matches and we can finish the
scan, i.e. break out of the INNER-LOOP.
1.2 Produce/merge sorted streams
--------------------------------
For each of the merged scan, use filesort-like action to end up with an
ordered stream of rowids. Then merge the ordered streams.
By filesort-like action we mean
- Run over index, collect rowids in a buffer.
- When the buffer is full, sort it and dump into a temporary file.
After the above we'll end up with a number of sorted buffers on disk. We can
use mergebuff() function (it is part of filesort's functions) to produce one
ordered sequence (i.e. array, which may be partially on disk) of rowids.
Merging of ordered streams with help of priority queue is already implemented
in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
child_quick->get_next()
call with a call to read rowid from an ordered sequence.
1.3 Extend Unique class to handle intersection
----------------------------------------------
There is no point to use Unique object as a device that accumulates rowids of
a single scan then produces them in sorted order. One could do the same faster
with accumulating an array of rowids and then sorting it.
It's possible to use Unique object to collect/merge data from all scans though.
The idea is as follows:
- Unique should store <rowid, n_scans> pairs
- Duplicates are pairs with the same rowid
- Unique should try to avoid creating duplicates:
- don't add a duplicate into the in-memory part, instead combine two elements
together by adding their n_scans elements.
- combine duplicates when it sees them in Unique.get() call
- The data we get from Unique.get() should be filtered, all records that have
n_scans != number_of_scans_being_merged should be discarded.
If we're lucky to have started and finished a scan on some index (denote it
as S) without flushing the Unique in the process, then:
- there is no point in adding any new records into the Unique because their
absence in the Unique means that they don't have match in S and hence will
not get into the result of intersection.
- we need to only update the counters to be able to tell if the elements that
are already in the Unique will have matches in all scans.
1.4 Strategies that do not seem to be useful
--------------------------------------------
keeping them here so we don't consider them over and over
1.4.1 Remove matches after having produced an ordered stream
------------------------------------------------------------
We can dump everything into a rowid stream and get it sorted. Then we read it,
and if we see a rowid repeated $n_merged_scans times, it belongs to the
intersection (pass to output), otherwise it doesn't (skip).
This doesn't have any advantages over the produce/merge sorted streams
approach.
1.4.2 Sparse rowid bitmaps
--------------------------
Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
bitmaps assume there will always be enough memory to accommodate them.
PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
disk pages, not ids of individual records. It's hard for us to do something
similar because our rowids are opaque entities whose meaning depends on the
storage engines.
This seems to require too much change to be worth it.
2. Optimization
===============
SEL_TREE objects already represent intersections. The problems with
optimizations are:
- Cost formula(s)
- When N keys/conditions are present:
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
somehow avoid considering (2^n - n) possible options.
- Avoid producing (or even considering) apparently suboptimal plans:
= Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
a subset of columns covered by all other indexes.
= (TODO any other rules?)
- Correlation across selectivities. If there is a condition
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
can we consider satisfaction of AND-parts to be independent?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Igor): index_merge: non-ROR intersection (21)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: non-ROR intersection
CREATION DATE..: Thu, 21 May 2009, 21:32
SUPERVISOR.....: Monty
IMPLEMENTOR....: Igor
COPIES TO......: Rhuddleston, Sanja, Knielsen, Serg, Monty, Timour, Igor, Psergey
CATEGORY.......: Server-Sprint
TASK ID........: 21 (http://askmonty.org/worklog/?tid=21)
VERSION........: Server-9.x
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 25
ESTIMATE.......: 175 (hours remain)
ORIG. ESTIMATE.: 175
PROGRESS NOTES:
-=-=(Igor - Thu, 24 Jun 2010, 05:49)=-=-
Supervisor updated.
--- /tmp/wklog.21.old.18770 2010-06-24 05:49:14.000000000 +0000
+++ /tmp/wklog.21.new.18770 2010-06-24 05:49:14.000000000 +0000
@@ -1 +1 @@
-Knielsen
+Monty
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Category updated.
--- /tmp/wklog.21.old.18765 2010-06-24 05:48:53.000000000 +0000
+++ /tmp/wklog.21.new.18765 2010-06-24 05:48:53.000000000 +0000
@@ -1 +1 @@
-Server-RawIdeaBin
+Server-Sprint
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Status updated.
--- /tmp/wklog.21.old.18761 2010-06-24 05:48:43.000000000 +0000
+++ /tmp/wklog.21.new.18761 2010-06-24 05:48:43.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Observers changed: Knielsen,Monty,Psergey,Sanja,Igor,Rhuddleston,Timour,Serg
-=-=(Guest - Thu, 24 Jun 2010, 05:44)=-=-
I spent 25 hours in the month June 2010 to perform the following work for this task.
1. Compared tree possible algorithms to implement the operation of index intersection mentioned in
HLS by their labor/time consumption. Chose the algorithm that uses a modified Unique class (1.3) as
the most cheap requiring the least amount of efforts/time for its development.
2. Developed a design for a modification of the Unique class to support the operation of index
intersection.
3. Modified the merge_buffers procedure used by the Unique class to make it possible to use it not
only for the the operation of union, but for the operation of intersect as well.
Worked 25 hours and estimate 175 hours remain (original estimate increased by 200 hours).
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 21
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 21
-=-=(Guest - Wed, 03 Jun 2009, 01:17)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.30002 2009-06-03 01:17:32.000000000 +0300
+++ /tmp/wklog.21.new.30002 2009-06-03 01:17:32.000000000 +0300
@@ -7,13 +7,13 @@
The current optimization works with:
-WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
-WHERE key1_part1=1 OR key2_part1=3
+WHERE key1_part1=1 AND key2_part1=3
or
-WHERE key_part1<10 or key2_part1<100
+WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:06)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29694 2009-06-03 01:06:50.000000000 +0300
+++ /tmp/wklog.21.new.29694 2009-06-03 01:06:50.000000000 +0300
@@ -12,6 +12,8 @@
but not with:
WHERE key1_part1=1 OR key2_part1=3
+or
+WHERE key_part1<10 or key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:05)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29638 2009-06-03 01:05:01.000000000 +0300
+++ /tmp/wklog.21.new.29638 2009-06-03 01:05:01.000000000 +0300
@@ -3,5 +3,15 @@
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
+For example, assuming that key1 has 2 parts and key2 has 1 part.
+
+The current optimization works with:
+
+WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+
+but not with:
+
+WHERE key1_part1=1 OR key2_part1=3
+
This WL entry is to lift this limitation by developing algorithms that do
-intersection on non-ROR scans.
+intersection on non-ROR (rowid ordered retrieval) scans.
------------------------------------------------------------
-=-=(View All Progress Notes, 12 total)=-=-
http://askmonty.org/worklog/index.pl?tid=21&nolimit=1
DESCRIPTION:
At the moment index_merge supports intersection only for rowid-ordered streams.
This translates into a limitation that index_merge/intersect can only be
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
For example, assuming that key1 has 2 parts and key2 has 1 part.
The current optimization works with:
WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
WHERE key1_part1=1 AND key2_part1=3
or
WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Execution
1.1 Temptable
1.1.1 Improvement
1.2 Produce/merge sorted streams
1.3 Extend Unique class to handle intersection
1.4 Strategies that do not seem to be useful
1.4.1 Remove matches after having produced an ordered stream
1.4.2 Sparse rowid bitmaps
2. Optimization
</contents>
1. Execution
============
The primary task is to find means to compute an intersection of N unordered
streams. Besides general memory/cpu cost of computation, we consider:
- whether the produced rowid stream is ordered. If it is, it can be piped
into index_merge/intersect (as opposed to sort-intersect)
- whether the strategy can take advantage of the fact that some input streams
are already rowid-ordered
- startup cost (cost of producing the first output record)
We see the following possible strategies:
1.1 Temptable
-------------
[ This is our strategy of choice at the moment]
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
rowid binary($rowid_size),
count n,
primary key(rowid);
);
Then use this algorithm:
i1= {index with the least E(#records)};
for each record R in range_scan(i1)
temp_table.insert(R.rowid, count=1);
for each index idx except i1
{
for each R record in scan(idx) // (INNER-LOOP)
{
if (temp_table has R)
temptable[R].count++;
}
}
// The following loop can do ordered or unordered scan
// if we want it to be ordered scan, we probably better arrange so that
// 'count' column is part of the index.
for each record R in temp_table
{
if (R.count == number_of_streams)
emit(R.rowid);
}
The algorithm has an option to emit an ordered rowid stream.
In the above form, the cost to produce the first record is high. It's easy to
adjust the algorithm to make it low - we'll need to just start scanning all
indexes at once, and finish as soon as we got a full match, i.e. the
temptable[R].count++
operation resulted in the counter being equal to the number of merged scans.
1.1.1 Improvement
~~~~~~~~~~~~~~~~~
When running INNER-LOOP, we could count how many times we've done the
"count++" operation. If it has been done #records-in-temptable times, that
means that all further records will not have matches and we can finish the
scan, i.e. break out of the INNER-LOOP.
1.2 Produce/merge sorted streams
--------------------------------
For each of the merged scan, use filesort-like action to end up with an
ordered stream of rowids. Then merge the ordered streams.
By filesort-like action we mean
- Run over index, collect rowids in a buffer.
- When the buffer is full, sort it and dump into a temporary file.
After the above we'll end up with a number of sorted buffers on disk. We can
use mergebuff() function (it is part of filesort's functions) to produce one
ordered sequence (i.e. array, which may be partially on disk) of rowids.
Merging of ordered streams with help of priority queue is already implemented
in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
child_quick->get_next()
call with a call to read rowid from an ordered sequence.
1.3 Extend Unique class to handle intersection
----------------------------------------------
There is no point to use Unique object as a device that accumulates rowids of
a single scan then produces them in sorted order. One could do the same faster
with accumulating an array of rowids and then sorting it.
It's possible to use Unique object to collect/merge data from all scans though.
The idea is as follows:
- Unique should store <rowid, n_scans> pairs
- Duplicates are pairs with the same rowid
- Unique should try to avoid creating duplicates:
- don't add a duplicate into the in-memory part, instead combine two elements
together by adding their n_scans elements.
- combine duplicates when it sees them in Unique.get() call
- The data we get from Unique.get() should be filtered, all records that have
n_scans != number_of_scans_being_merged should be discarded.
If we're lucky to have started and finished a scan on some index (denote it
as S) without flushing the Unique in the process, then:
- there is no point in adding any new records into the Unique because their
absence in the Unique means that they don't have match in S and hence will
not get into the result of intersection.
- we need to only update the counters to be able to tell if the elements that
are already in the Unique will have matches in all scans.
1.4 Strategies that do not seem to be useful
--------------------------------------------
keeping them here so we don't consider them over and over
1.4.1 Remove matches after having produced an ordered stream
------------------------------------------------------------
We can dump everything into a rowid stream and get it sorted. Then we read it,
and if we see a rowid repeated $n_merged_scans times, it belongs to the
intersection (pass to output), otherwise it doesn't (skip).
This doesn't have any advantages over the produce/merge sorted streams
approach.
1.4.2 Sparse rowid bitmaps
--------------------------
Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
bitmaps assume there will always be enough memory to accommodate them.
PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
disk pages, not ids of individual records. It's hard for us to do something
similar because our rowids are opaque entities whose meaning depends on the
storage engines.
This seems to require too much change to be worth it.
2. Optimization
===============
SEL_TREE objects already represent intersections. The problems with
optimizations are:
- Cost formula(s)
- When N keys/conditions are present:
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
somehow avoid considering (2^n - n) possible options.
- Avoid producing (or even considering) apparently suboptimal plans:
= Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
a subset of columns covered by all other indexes.
= (TODO any other rules?)
- Correlation across selectivities. If there is a condition
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
can we consider satisfaction of AND-parts to be independent?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Igor): index_merge: non-ROR intersection (21)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: non-ROR intersection
CREATION DATE..: Thu, 21 May 2009, 21:32
SUPERVISOR.....: Monty
IMPLEMENTOR....: Igor
COPIES TO......: Rhuddleston, Sanja, Knielsen, Serg, Monty, Timour, Igor, Psergey
CATEGORY.......: Server-Sprint
TASK ID........: 21 (http://askmonty.org/worklog/?tid=21)
VERSION........: Server-9.x
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 25
ESTIMATE.......: 175 (hours remain)
ORIG. ESTIMATE.: 175
PROGRESS NOTES:
-=-=(Igor - Thu, 24 Jun 2010, 05:49)=-=-
Supervisor updated.
--- /tmp/wklog.21.old.18770 2010-06-24 05:49:14.000000000 +0000
+++ /tmp/wklog.21.new.18770 2010-06-24 05:49:14.000000000 +0000
@@ -1 +1 @@
-Knielsen
+Monty
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Category updated.
--- /tmp/wklog.21.old.18765 2010-06-24 05:48:53.000000000 +0000
+++ /tmp/wklog.21.new.18765 2010-06-24 05:48:53.000000000 +0000
@@ -1 +1 @@
-Server-RawIdeaBin
+Server-Sprint
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Status updated.
--- /tmp/wklog.21.old.18761 2010-06-24 05:48:43.000000000 +0000
+++ /tmp/wklog.21.new.18761 2010-06-24 05:48:43.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Observers changed: Knielsen,Monty,Psergey,Sanja,Igor,Rhuddleston,Timour,Serg
-=-=(Guest - Thu, 24 Jun 2010, 05:44)=-=-
I spent 25 hours in the month June 2010 to perform the following work for this task.
1. Compared tree possible algorithms to implement the operation of index intersection mentioned in
HLS by their labor/time consumption. Chose the algorithm that uses a modified Unique class (1.3) as
the most cheap requiring the least amount of efforts/time for its development.
2. Developed a design for a modification of the Unique class to support the operation of index
intersection.
3. Modified the merge_buffers procedure used by the Unique class to make it possible to use it not
only for the the operation of union, but for the operation of intersect as well.
Worked 25 hours and estimate 175 hours remain (original estimate increased by 200 hours).
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 21
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 21
-=-=(Guest - Wed, 03 Jun 2009, 01:17)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.30002 2009-06-03 01:17:32.000000000 +0300
+++ /tmp/wklog.21.new.30002 2009-06-03 01:17:32.000000000 +0300
@@ -7,13 +7,13 @@
The current optimization works with:
-WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
-WHERE key1_part1=1 OR key2_part1=3
+WHERE key1_part1=1 AND key2_part1=3
or
-WHERE key_part1<10 or key2_part1<100
+WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:06)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29694 2009-06-03 01:06:50.000000000 +0300
+++ /tmp/wklog.21.new.29694 2009-06-03 01:06:50.000000000 +0300
@@ -12,6 +12,8 @@
but not with:
WHERE key1_part1=1 OR key2_part1=3
+or
+WHERE key_part1<10 or key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:05)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29638 2009-06-03 01:05:01.000000000 +0300
+++ /tmp/wklog.21.new.29638 2009-06-03 01:05:01.000000000 +0300
@@ -3,5 +3,15 @@
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
+For example, assuming that key1 has 2 parts and key2 has 1 part.
+
+The current optimization works with:
+
+WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+
+but not with:
+
+WHERE key1_part1=1 OR key2_part1=3
+
This WL entry is to lift this limitation by developing algorithms that do
-intersection on non-ROR scans.
+intersection on non-ROR (rowid ordered retrieval) scans.
------------------------------------------------------------
-=-=(View All Progress Notes, 12 total)=-=-
http://askmonty.org/worklog/index.pl?tid=21&nolimit=1
DESCRIPTION:
At the moment index_merge supports intersection only for rowid-ordered streams.
This translates into a limitation that index_merge/intersect can only be
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
For example, assuming that key1 has 2 parts and key2 has 1 part.
The current optimization works with:
WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
WHERE key1_part1=1 AND key2_part1=3
or
WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Execution
1.1 Temptable
1.1.1 Improvement
1.2 Produce/merge sorted streams
1.3 Extend Unique class to handle intersection
1.4 Strategies that do not seem to be useful
1.4.1 Remove matches after having produced an ordered stream
1.4.2 Sparse rowid bitmaps
2. Optimization
</contents>
1. Execution
============
The primary task is to find means to compute an intersection of N unordered
streams. Besides general memory/cpu cost of computation, we consider:
- whether the produced rowid stream is ordered. If it is, it can be piped
into index_merge/intersect (as opposed to sort-intersect)
- whether the strategy can take advantage of the fact that some input streams
are already rowid-ordered
- startup cost (cost of producing the first output record)
We see the following possible strategies:
1.1 Temptable
-------------
[ This is our strategy of choice at the moment]
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
rowid binary($rowid_size),
count n,
primary key(rowid);
);
Then use this algorithm:
i1= {index with the least E(#records)};
for each record R in range_scan(i1)
temp_table.insert(R.rowid, count=1);
for each index idx except i1
{
for each R record in scan(idx) // (INNER-LOOP)
{
if (temp_table has R)
temptable[R].count++;
}
}
// The following loop can do ordered or unordered scan
// if we want it to be ordered scan, we probably better arrange so that
// 'count' column is part of the index.
for each record R in temp_table
{
if (R.count == number_of_streams)
emit(R.rowid);
}
The algorithm has an option to emit an ordered rowid stream.
In the above form, the cost to produce the first record is high. It's easy to
adjust the algorithm to make it low - we'll need to just start scanning all
indexes at once, and finish as soon as we got a full match, i.e. the
temptable[R].count++
operation resulted in the counter being equal to the number of merged scans.
1.1.1 Improvement
~~~~~~~~~~~~~~~~~
When running INNER-LOOP, we could count how many times we've done the
"count++" operation. If it has been done #records-in-temptable times, that
means that all further records will not have matches and we can finish the
scan, i.e. break out of the INNER-LOOP.
1.2 Produce/merge sorted streams
--------------------------------
For each of the merged scan, use filesort-like action to end up with an
ordered stream of rowids. Then merge the ordered streams.
By filesort-like action we mean
- Run over index, collect rowids in a buffer.
- When the buffer is full, sort it and dump into a temporary file.
After the above we'll end up with a number of sorted buffers on disk. We can
use mergebuff() function (it is part of filesort's functions) to produce one
ordered sequence (i.e. array, which may be partially on disk) of rowids.
Merging of ordered streams with help of priority queue is already implemented
in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
child_quick->get_next()
call with a call to read rowid from an ordered sequence.
1.3 Extend Unique class to handle intersection
----------------------------------------------
There is no point to use Unique object as a device that accumulates rowids of
a single scan then produces them in sorted order. One could do the same faster
with accumulating an array of rowids and then sorting it.
It's possible to use Unique object to collect/merge data from all scans though.
The idea is as follows:
- Unique should store <rowid, n_scans> pairs
- Duplicates are pairs with the same rowid
- Unique should try to avoid creating duplicates:
- don't add a duplicate into the in-memory part, instead combine two elements
together by adding their n_scans elements.
- combine duplicates when it sees them in Unique.get() call
- The data we get from Unique.get() should be filtered, all records that have
n_scans != number_of_scans_being_merged should be discarded.
If we're lucky to have started and finished a scan on some index (denote it
as S) without flushing the Unique in the process, then:
- there is no point in adding any new records into the Unique because their
absence in the Unique means that they don't have match in S and hence will
not get into the result of intersection.
- we need to only update the counters to be able to tell if the elements that
are already in the Unique will have matches in all scans.
1.4 Strategies that do not seem to be useful
--------------------------------------------
keeping them here so we don't consider them over and over
1.4.1 Remove matches after having produced an ordered stream
------------------------------------------------------------
We can dump everything into a rowid stream and get it sorted. Then we read it,
and if we see a rowid repeated $n_merged_scans times, it belongs to the
intersection (pass to output), otherwise it doesn't (skip).
This doesn't have any advantages over the produce/merge sorted streams
approach.
1.4.2 Sparse rowid bitmaps
--------------------------
Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
bitmaps assume there will always be enough memory to accommodate them.
PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
disk pages, not ids of individual records. It's hard for us to do something
similar because our rowids are opaque entities whose meaning depends on the
storage engines.
This seems to require too much change to be worth it.
2. Optimization
===============
SEL_TREE objects already represent intersections. The problems with
optimizations are:
- Cost formula(s)
- When N keys/conditions are present:
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
somehow avoid considering (2^n - n) possible options.
- Avoid producing (or even considering) apparently suboptimal plans:
= Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
a subset of columns covered by all other indexes.
= (TODO any other rules?)
- Correlation across selectivities. If there is a condition
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
can we consider satisfaction of AND-parts to be independent?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Igor): index_merge: non-ROR intersection (21)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: non-ROR intersection
CREATION DATE..: Thu, 21 May 2009, 21:32
SUPERVISOR.....: Monty
IMPLEMENTOR....: Igor
COPIES TO......: Rhuddleston, Sanja, Knielsen, Serg, Monty, Timour, Igor, Psergey
CATEGORY.......: Server-Sprint
TASK ID........: 21 (http://askmonty.org/worklog/?tid=21)
VERSION........: Server-9.x
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 25
ESTIMATE.......: 175 (hours remain)
ORIG. ESTIMATE.: 175
PROGRESS NOTES:
-=-=(Igor - Thu, 24 Jun 2010, 05:49)=-=-
Supervisor updated.
--- /tmp/wklog.21.old.18770 2010-06-24 05:49:14.000000000 +0000
+++ /tmp/wklog.21.new.18770 2010-06-24 05:49:14.000000000 +0000
@@ -1 +1 @@
-Knielsen
+Monty
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Category updated.
--- /tmp/wklog.21.old.18765 2010-06-24 05:48:53.000000000 +0000
+++ /tmp/wklog.21.new.18765 2010-06-24 05:48:53.000000000 +0000
@@ -1 +1 @@
-Server-RawIdeaBin
+Server-Sprint
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Status updated.
--- /tmp/wklog.21.old.18761 2010-06-24 05:48:43.000000000 +0000
+++ /tmp/wklog.21.new.18761 2010-06-24 05:48:43.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Observers changed: Knielsen,Monty,Psergey,Sanja,Igor,Rhuddleston,Timour,Serg
-=-=(Guest - Thu, 24 Jun 2010, 05:44)=-=-
I spent 25 hours in the month June 2010 to perform the following work for this task.
1. Compared tree possible algorithms to implement the operation of index intersection mentioned in
HLS by their labor/time consumption. Chose the algorithm that uses a modified Unique class (1.3) as
the most cheap requiring the least amount of efforts/time for its development.
2. Developed a design for a modification of the Unique class to support the operation of index
intersection.
3. Modified the merge_buffers procedure used by the Unique class to make it possible to use it not
only for the the operation of union, but for the operation of intersect as well.
Worked 25 hours and estimate 175 hours remain (original estimate increased by 200 hours).
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 21
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 21
-=-=(Guest - Wed, 03 Jun 2009, 01:17)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.30002 2009-06-03 01:17:32.000000000 +0300
+++ /tmp/wklog.21.new.30002 2009-06-03 01:17:32.000000000 +0300
@@ -7,13 +7,13 @@
The current optimization works with:
-WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
-WHERE key1_part1=1 OR key2_part1=3
+WHERE key1_part1=1 AND key2_part1=3
or
-WHERE key_part1<10 or key2_part1<100
+WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:06)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29694 2009-06-03 01:06:50.000000000 +0300
+++ /tmp/wklog.21.new.29694 2009-06-03 01:06:50.000000000 +0300
@@ -12,6 +12,8 @@
but not with:
WHERE key1_part1=1 OR key2_part1=3
+or
+WHERE key_part1<10 or key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:05)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29638 2009-06-03 01:05:01.000000000 +0300
+++ /tmp/wklog.21.new.29638 2009-06-03 01:05:01.000000000 +0300
@@ -3,5 +3,15 @@
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
+For example, assuming that key1 has 2 parts and key2 has 1 part.
+
+The current optimization works with:
+
+WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+
+but not with:
+
+WHERE key1_part1=1 OR key2_part1=3
+
This WL entry is to lift this limitation by developing algorithms that do
-intersection on non-ROR scans.
+intersection on non-ROR (rowid ordered retrieval) scans.
------------------------------------------------------------
-=-=(View All Progress Notes, 12 total)=-=-
http://askmonty.org/worklog/index.pl?tid=21&nolimit=1
DESCRIPTION:
At the moment index_merge supports intersection only for rowid-ordered streams.
This translates into a limitation that index_merge/intersect can only be
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
For example, assuming that key1 has 2 parts and key2 has 1 part.
The current optimization works with:
WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
WHERE key1_part1=1 AND key2_part1=3
or
WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Execution
1.1 Temptable
1.1.1 Improvement
1.2 Produce/merge sorted streams
1.3 Extend Unique class to handle intersection
1.4 Strategies that do not seem to be useful
1.4.1 Remove matches after having produced an ordered stream
1.4.2 Sparse rowid bitmaps
2. Optimization
</contents>
1. Execution
============
The primary task is to find means to compute an intersection of N unordered
streams. Besides general memory/cpu cost of computation, we consider:
- whether the produced rowid stream is ordered. If it is, it can be piped
into index_merge/intersect (as opposed to sort-intersect)
- whether the strategy can take advantage of the fact that some input streams
are already rowid-ordered
- startup cost (cost of producing the first output record)
We see the following possible strategies:
1.1 Temptable
-------------
[ This is our strategy of choice at the moment]
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
rowid binary($rowid_size),
count n,
primary key(rowid);
);
Then use this algorithm:
i1= {index with the least E(#records)};
for each record R in range_scan(i1)
temp_table.insert(R.rowid, count=1);
for each index idx except i1
{
for each R record in scan(idx) // (INNER-LOOP)
{
if (temp_table has R)
temptable[R].count++;
}
}
// The following loop can do ordered or unordered scan
// if we want it to be ordered scan, we probably better arrange so that
// 'count' column is part of the index.
for each record R in temp_table
{
if (R.count == number_of_streams)
emit(R.rowid);
}
The algorithm has an option to emit an ordered rowid stream.
In the above form, the cost to produce the first record is high. It's easy to
adjust the algorithm to make it low - we'll need to just start scanning all
indexes at once, and finish as soon as we got a full match, i.e. the
temptable[R].count++
operation resulted in the counter being equal to the number of merged scans.
1.1.1 Improvement
~~~~~~~~~~~~~~~~~
When running INNER-LOOP, we could count how many times we've done the
"count++" operation. If it has been done #records-in-temptable times, that
means that all further records will not have matches and we can finish the
scan, i.e. break out of the INNER-LOOP.
1.2 Produce/merge sorted streams
--------------------------------
For each of the merged scan, use filesort-like action to end up with an
ordered stream of rowids. Then merge the ordered streams.
By filesort-like action we mean
- Run over index, collect rowids in a buffer.
- When the buffer is full, sort it and dump into a temporary file.
After the above we'll end up with a number of sorted buffers on disk. We can
use mergebuff() function (it is part of filesort's functions) to produce one
ordered sequence (i.e. array, which may be partially on disk) of rowids.
Merging of ordered streams with help of priority queue is already implemented
in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
child_quick->get_next()
call with a call to read rowid from an ordered sequence.
1.3 Extend Unique class to handle intersection
----------------------------------------------
There is no point to use Unique object as a device that accumulates rowids of
a single scan then produces them in sorted order. One could do the same faster
with accumulating an array of rowids and then sorting it.
It's possible to use Unique object to collect/merge data from all scans though.
The idea is as follows:
- Unique should store <rowid, n_scans> pairs
- Duplicates are pairs with the same rowid
- Unique should try to avoid creating duplicates:
- don't add a duplicate into the in-memory part, instead combine two elements
together by adding their n_scans elements.
- combine duplicates when it sees them in Unique.get() call
- The data we get from Unique.get() should be filtered, all records that have
n_scans != number_of_scans_being_merged should be discarded.
If we're lucky to have started and finished a scan on some index (denote it
as S) without flushing the Unique in the process, then:
- there is no point in adding any new records into the Unique because their
absence in the Unique means that they don't have match in S and hence will
not get into the result of intersection.
- we need to only update the counters to be able to tell if the elements that
are already in the Unique will have matches in all scans.
1.4 Strategies that do not seem to be useful
--------------------------------------------
keeping them here so we don't consider them over and over
1.4.1 Remove matches after having produced an ordered stream
------------------------------------------------------------
We can dump everything into a rowid stream and get it sorted. Then we read it,
and if we see a rowid repeated $n_merged_scans times, it belongs to the
intersection (pass to output), otherwise it doesn't (skip).
This doesn't have any advantages over the produce/merge sorted streams
approach.
1.4.2 Sparse rowid bitmaps
--------------------------
Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
bitmaps assume there will always be enough memory to accommodate them.
PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
disk pages, not ids of individual records. It's hard for us to do something
similar because our rowids are opaque entities whose meaning depends on the
storage engines.
This seems to require too much change to be worth it.
2. Optimization
===============
SEL_TREE objects already represent intersections. The problems with
optimizations are:
- Cost formula(s)
- When N keys/conditions are present:
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
somehow avoid considering (2^n - n) possible options.
- Avoid producing (or even considering) apparently suboptimal plans:
= Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
a subset of columns covered by all other indexes.
= (TODO any other rules?)
- Correlation across selectivities. If there is a condition
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
can we consider satisfaction of AND-parts to be independent?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Igor): index_merge: non-ROR intersection (21)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: non-ROR intersection
CREATION DATE..: Thu, 21 May 2009, 21:32
SUPERVISOR.....: Monty
IMPLEMENTOR....: Igor
COPIES TO......: Rhuddleston, Sanja, Knielsen, Serg, Monty, Timour, Igor, Psergey
CATEGORY.......: Server-Sprint
TASK ID........: 21 (http://askmonty.org/worklog/?tid=21)
VERSION........: Server-9.x
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 25
ESTIMATE.......: 175 (hours remain)
ORIG. ESTIMATE.: 175
PROGRESS NOTES:
-=-=(Igor - Thu, 24 Jun 2010, 05:49)=-=-
Supervisor updated.
--- /tmp/wklog.21.old.18770 2010-06-24 05:49:14.000000000 +0000
+++ /tmp/wklog.21.new.18770 2010-06-24 05:49:14.000000000 +0000
@@ -1 +1 @@
-Knielsen
+Monty
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Category updated.
--- /tmp/wklog.21.old.18765 2010-06-24 05:48:53.000000000 +0000
+++ /tmp/wklog.21.new.18765 2010-06-24 05:48:53.000000000 +0000
@@ -1 +1 @@
-Server-RawIdeaBin
+Server-Sprint
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Status updated.
--- /tmp/wklog.21.old.18761 2010-06-24 05:48:43.000000000 +0000
+++ /tmp/wklog.21.new.18761 2010-06-24 05:48:43.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Observers changed: Knielsen,Monty,Psergey,Sanja,Igor,Rhuddleston,Timour,Serg
-=-=(Guest - Thu, 24 Jun 2010, 05:44)=-=-
I spent 25 hours in the month June 2010 to perform the following work for this task.
1. Compared tree possible algorithms to implement the operation of index intersection mentioned in
HLS by their labor/time consumption. Chose the algorithm that uses a modified Unique class (1.3) as
the most cheap requiring the least amount of efforts/time for its development.
2. Developed a design for a modification of the Unique class to support the operation of index
intersection.
3. Modified the merge_buffers procedure used by the Unique class to make it possible to use it not
only for the the operation of union, but for the operation of intersect as well.
Worked 25 hours and estimate 175 hours remain (original estimate increased by 200 hours).
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 21
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 21
-=-=(Guest - Wed, 03 Jun 2009, 01:17)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.30002 2009-06-03 01:17:32.000000000 +0300
+++ /tmp/wklog.21.new.30002 2009-06-03 01:17:32.000000000 +0300
@@ -7,13 +7,13 @@
The current optimization works with:
-WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
-WHERE key1_part1=1 OR key2_part1=3
+WHERE key1_part1=1 AND key2_part1=3
or
-WHERE key_part1<10 or key2_part1<100
+WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:06)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29694 2009-06-03 01:06:50.000000000 +0300
+++ /tmp/wklog.21.new.29694 2009-06-03 01:06:50.000000000 +0300
@@ -12,6 +12,8 @@
but not with:
WHERE key1_part1=1 OR key2_part1=3
+or
+WHERE key_part1<10 or key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:05)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29638 2009-06-03 01:05:01.000000000 +0300
+++ /tmp/wklog.21.new.29638 2009-06-03 01:05:01.000000000 +0300
@@ -3,5 +3,15 @@
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
+For example, assuming that key1 has 2 parts and key2 has 1 part.
+
+The current optimization works with:
+
+WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+
+but not with:
+
+WHERE key1_part1=1 OR key2_part1=3
+
This WL entry is to lift this limitation by developing algorithms that do
-intersection on non-ROR scans.
+intersection on non-ROR (rowid ordered retrieval) scans.
------------------------------------------------------------
-=-=(View All Progress Notes, 12 total)=-=-
http://askmonty.org/worklog/index.pl?tid=21&nolimit=1
DESCRIPTION:
At the moment index_merge supports intersection only for rowid-ordered streams.
This translates into a limitation that index_merge/intersect can only be
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
For example, assuming that key1 has 2 parts and key2 has 1 part.
The current optimization works with:
WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
WHERE key1_part1=1 AND key2_part1=3
or
WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Execution
1.1 Temptable
1.1.1 Improvement
1.2 Produce/merge sorted streams
1.3 Extend Unique class to handle intersection
1.4 Strategies that do not seem to be useful
1.4.1 Remove matches after having produced an ordered stream
1.4.2 Sparse rowid bitmaps
2. Optimization
</contents>
1. Execution
============
The primary task is to find means to compute an intersection of N unordered
streams. Besides general memory/cpu cost of computation, we consider:
- whether the produced rowid stream is ordered. If it is, it can be piped
into index_merge/intersect (as opposed to sort-intersect)
- whether the strategy can take advantage of the fact that some input streams
are already rowid-ordered
- startup cost (cost of producing the first output record)
We see the following possible strategies:
1.1 Temptable
-------------
[ This is our strategy of choice at the moment]
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
rowid binary($rowid_size),
count n,
primary key(rowid);
);
Then use this algorithm:
i1= {index with the least E(#records)};
for each record R in range_scan(i1)
temp_table.insert(R.rowid, count=1);
for each index idx except i1
{
for each R record in scan(idx) // (INNER-LOOP)
{
if (temp_table has R)
temptable[R].count++;
}
}
// The following loop can do ordered or unordered scan
// if we want it to be ordered scan, we probably better arrange so that
// 'count' column is part of the index.
for each record R in temp_table
{
if (R.count == number_of_streams)
emit(R.rowid);
}
The algorithm has an option to emit an ordered rowid stream.
In the above form, the cost to produce the first record is high. It's easy to
adjust the algorithm to make it low - we'll need to just start scanning all
indexes at once, and finish as soon as we got a full match, i.e. the
temptable[R].count++
operation resulted in the counter being equal to the number of merged scans.
1.1.1 Improvement
~~~~~~~~~~~~~~~~~
When running INNER-LOOP, we could count how many times we've done the
"count++" operation. If it has been done #records-in-temptable times, that
means that all further records will not have matches and we can finish the
scan, i.e. break out of the INNER-LOOP.
1.2 Produce/merge sorted streams
--------------------------------
For each of the merged scan, use filesort-like action to end up with an
ordered stream of rowids. Then merge the ordered streams.
By filesort-like action we mean
- Run over index, collect rowids in a buffer.
- When the buffer is full, sort it and dump into a temporary file.
After the above we'll end up with a number of sorted buffers on disk. We can
use mergebuff() function (it is part of filesort's functions) to produce one
ordered sequence (i.e. array, which may be partially on disk) of rowids.
Merging of ordered streams with help of priority queue is already implemented
in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
child_quick->get_next()
call with a call to read rowid from an ordered sequence.
1.3 Extend Unique class to handle intersection
----------------------------------------------
There is no point to use Unique object as a device that accumulates rowids of
a single scan then produces them in sorted order. One could do the same faster
with accumulating an array of rowids and then sorting it.
It's possible to use Unique object to collect/merge data from all scans though.
The idea is as follows:
- Unique should store <rowid, n_scans> pairs
- Duplicates are pairs with the same rowid
- Unique should try to avoid creating duplicates:
- don't add a duplicate into the in-memory part, instead combine two elements
together by adding their n_scans elements.
- combine duplicates when it sees them in Unique.get() call
- The data we get from Unique.get() should be filtered, all records that have
n_scans != number_of_scans_being_merged should be discarded.
If we're lucky to have started and finished a scan on some index (denote it
as S) without flushing the Unique in the process, then:
- there is no point in adding any new records into the Unique because their
absence in the Unique means that they don't have match in S and hence will
not get into the result of intersection.
- we need to only update the counters to be able to tell if the elements that
are already in the Unique will have matches in all scans.
1.4 Strategies that do not seem to be useful
--------------------------------------------
keeping them here so we don't consider them over and over
1.4.1 Remove matches after having produced an ordered stream
------------------------------------------------------------
We can dump everything into a rowid stream and get it sorted. Then we read it,
and if we see a rowid repeated $n_merged_scans times, it belongs to the
intersection (pass to output), otherwise it doesn't (skip).
This doesn't have any advantages over the produce/merge sorted streams
approach.
1.4.2 Sparse rowid bitmaps
--------------------------
Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
bitmaps assume there will always be enough memory to accommodate them.
PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
disk pages, not ids of individual records. It's hard for us to do something
similar because our rowids are opaque entities whose meaning depends on the
storage engines.
This seems to require too much change to be worth it.
2. Optimization
===============
SEL_TREE objects already represent intersections. The problems with
optimizations are:
- Cost formula(s)
- When N keys/conditions are present:
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
somehow avoid considering (2^n - n) possible options.
- Avoid producing (or even considering) apparently suboptimal plans:
= Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
a subset of columns covered by all other indexes.
= (TODO any other rules?)
- Correlation across selectivities. If there is a condition
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
can we consider satisfaction of AND-parts to be independent?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Igor): index_merge: non-ROR intersection (21)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: non-ROR intersection
CREATION DATE..: Thu, 21 May 2009, 21:32
SUPERVISOR.....: Monty
IMPLEMENTOR....: Igor
COPIES TO......: Rhuddleston, Sanja, Knielsen, Serg, Monty, Timour, Igor, Psergey
CATEGORY.......: Server-Sprint
TASK ID........: 21 (http://askmonty.org/worklog/?tid=21)
VERSION........: Server-9.x
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 25
ESTIMATE.......: 175 (hours remain)
ORIG. ESTIMATE.: 175
PROGRESS NOTES:
-=-=(Igor - Thu, 24 Jun 2010, 05:49)=-=-
Supervisor updated.
--- /tmp/wklog.21.old.18770 2010-06-24 05:49:14.000000000 +0000
+++ /tmp/wklog.21.new.18770 2010-06-24 05:49:14.000000000 +0000
@@ -1 +1 @@
-Knielsen
+Monty
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Category updated.
--- /tmp/wklog.21.old.18765 2010-06-24 05:48:53.000000000 +0000
+++ /tmp/wklog.21.new.18765 2010-06-24 05:48:53.000000000 +0000
@@ -1 +1 @@
-Server-RawIdeaBin
+Server-Sprint
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Status updated.
--- /tmp/wklog.21.old.18761 2010-06-24 05:48:43.000000000 +0000
+++ /tmp/wklog.21.new.18761 2010-06-24 05:48:43.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Observers changed: Knielsen,Monty,Psergey,Sanja,Igor,Rhuddleston,Timour,Serg
-=-=(Guest - Thu, 24 Jun 2010, 05:44)=-=-
I spent 25 hours in the month June 2010 to perform the following work for this task.
1. Compared tree possible algorithms to implement the operation of index intersection mentioned in
HLS by their labor/time consumption. Chose the algorithm that uses a modified Unique class (1.3) as
the most cheap requiring the least amount of efforts/time for its development.
2. Developed a design for a modification of the Unique class to support the operation of index
intersection.
3. Modified the merge_buffers procedure used by the Unique class to make it possible to use it not
only for the the operation of union, but for the operation of intersect as well.
Worked 25 hours and estimate 175 hours remain (original estimate increased by 200 hours).
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 21
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 21
-=-=(Guest - Wed, 03 Jun 2009, 01:17)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.30002 2009-06-03 01:17:32.000000000 +0300
+++ /tmp/wklog.21.new.30002 2009-06-03 01:17:32.000000000 +0300
@@ -7,13 +7,13 @@
The current optimization works with:
-WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
-WHERE key1_part1=1 OR key2_part1=3
+WHERE key1_part1=1 AND key2_part1=3
or
-WHERE key_part1<10 or key2_part1<100
+WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:06)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29694 2009-06-03 01:06:50.000000000 +0300
+++ /tmp/wklog.21.new.29694 2009-06-03 01:06:50.000000000 +0300
@@ -12,6 +12,8 @@
but not with:
WHERE key1_part1=1 OR key2_part1=3
+or
+WHERE key_part1<10 or key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:05)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29638 2009-06-03 01:05:01.000000000 +0300
+++ /tmp/wklog.21.new.29638 2009-06-03 01:05:01.000000000 +0300
@@ -3,5 +3,15 @@
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
+For example, assuming that key1 has 2 parts and key2 has 1 part.
+
+The current optimization works with:
+
+WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+
+but not with:
+
+WHERE key1_part1=1 OR key2_part1=3
+
This WL entry is to lift this limitation by developing algorithms that do
-intersection on non-ROR scans.
+intersection on non-ROR (rowid ordered retrieval) scans.
------------------------------------------------------------
-=-=(View All Progress Notes, 12 total)=-=-
http://askmonty.org/worklog/index.pl?tid=21&nolimit=1
DESCRIPTION:
At the moment index_merge supports intersection only for rowid-ordered streams.
This translates into a limitation that index_merge/intersect can only be
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
For example, assuming that key1 has 2 parts and key2 has 1 part.
The current optimization works with:
WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
WHERE key1_part1=1 AND key2_part1=3
or
WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Execution
1.1 Temptable
1.1.1 Improvement
1.2 Produce/merge sorted streams
1.3 Extend Unique class to handle intersection
1.4 Strategies that do not seem to be useful
1.4.1 Remove matches after having produced an ordered stream
1.4.2 Sparse rowid bitmaps
2. Optimization
</contents>
1. Execution
============
The primary task is to find means to compute an intersection of N unordered
streams. Besides general memory/cpu cost of computation, we consider:
- whether the produced rowid stream is ordered. If it is, it can be piped
into index_merge/intersect (as opposed to sort-intersect)
- whether the strategy can take advantage of the fact that some input streams
are already rowid-ordered
- startup cost (cost of producing the first output record)
We see the following possible strategies:
1.1 Temptable
-------------
[ This is our strategy of choice at the moment]
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
rowid binary($rowid_size),
count n,
primary key(rowid);
);
Then use this algorithm:
i1= {index with the least E(#records)};
for each record R in range_scan(i1)
temp_table.insert(R.rowid, count=1);
for each index idx except i1
{
for each R record in scan(idx) // (INNER-LOOP)
{
if (temp_table has R)
temptable[R].count++;
}
}
// The following loop can do ordered or unordered scan
// if we want it to be ordered scan, we probably better arrange so that
// 'count' column is part of the index.
for each record R in temp_table
{
if (R.count == number_of_streams)
emit(R.rowid);
}
The algorithm has an option to emit an ordered rowid stream.
In the above form, the cost to produce the first record is high. It's easy to
adjust the algorithm to make it low - we'll need to just start scanning all
indexes at once, and finish as soon as we got a full match, i.e. the
temptable[R].count++
operation resulted in the counter being equal to the number of merged scans.
1.1.1 Improvement
~~~~~~~~~~~~~~~~~
When running INNER-LOOP, we could count how many times we've done the
"count++" operation. If it has been done #records-in-temptable times, that
means that all further records will not have matches and we can finish the
scan, i.e. break out of the INNER-LOOP.
1.2 Produce/merge sorted streams
--------------------------------
For each of the merged scan, use filesort-like action to end up with an
ordered stream of rowids. Then merge the ordered streams.
By filesort-like action we mean
- Run over index, collect rowids in a buffer.
- When the buffer is full, sort it and dump into a temporary file.
After the above we'll end up with a number of sorted buffers on disk. We can
use mergebuff() function (it is part of filesort's functions) to produce one
ordered sequence (i.e. array, which may be partially on disk) of rowids.
Merging of ordered streams with help of priority queue is already implemented
in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
child_quick->get_next()
call with a call to read rowid from an ordered sequence.
1.3 Extend Unique class to handle intersection
----------------------------------------------
There is no point to use Unique object as a device that accumulates rowids of
a single scan then produces them in sorted order. One could do the same faster
with accumulating an array of rowids and then sorting it.
It's possible to use Unique object to collect/merge data from all scans though.
The idea is as follows:
- Unique should store <rowid, n_scans> pairs
- Duplicates are pairs with the same rowid
- Unique should try to avoid creating duplicates:
- don't add a duplicate into the in-memory part, instead combine two elements
together by adding their n_scans elements.
- combine duplicates when it sees them in Unique.get() call
- The data we get from Unique.get() should be filtered, all records that have
n_scans != number_of_scans_being_merged should be discarded.
If we're lucky to have started and finished a scan on some index (denote it
as S) without flushing the Unique in the process, then:
- there is no point in adding any new records into the Unique because their
absence in the Unique means that they don't have match in S and hence will
not get into the result of intersection.
- we need to only update the counters to be able to tell if the elements that
are already in the Unique will have matches in all scans.
1.4 Strategies that do not seem to be useful
--------------------------------------------
keeping them here so we don't consider them over and over
1.4.1 Remove matches after having produced an ordered stream
------------------------------------------------------------
We can dump everything into a rowid stream and get it sorted. Then we read it,
and if we see a rowid repeated $n_merged_scans times, it belongs to the
intersection (pass to output), otherwise it doesn't (skip).
This doesn't have any advantages over the produce/merge sorted streams
approach.
1.4.2 Sparse rowid bitmaps
--------------------------
Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
bitmaps assume there will always be enough memory to accommodate them.
PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
disk pages, not ids of individual records. It's hard for us to do something
similar because our rowids are opaque entities whose meaning depends on the
storage engines.
This seems to require too much change to be worth it.
2. Optimization
===============
SEL_TREE objects already represent intersections. The problems with
optimizations are:
- Cost formula(s)
- When N keys/conditions are present:
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
somehow avoid considering (2^n - n) possible options.
- Avoid producing (or even considering) apparently suboptimal plans:
= Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
a subset of columns covered by all other indexes.
= (TODO any other rules?)
- Correlation across selectivities. If there is a condition
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
can we consider satisfaction of AND-parts to be independent?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Igor): index_merge: non-ROR intersection (21)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: non-ROR intersection
CREATION DATE..: Thu, 21 May 2009, 21:32
SUPERVISOR.....: Monty
IMPLEMENTOR....: Igor
COPIES TO......: Rhuddleston, Sanja, Knielsen, Serg, Monty, Timour, Igor, Psergey
CATEGORY.......: Server-Sprint
TASK ID........: 21 (http://askmonty.org/worklog/?tid=21)
VERSION........: Server-9.x
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 25
ESTIMATE.......: 175 (hours remain)
ORIG. ESTIMATE.: 175
PROGRESS NOTES:
-=-=(Igor - Thu, 24 Jun 2010, 05:49)=-=-
Supervisor updated.
--- /tmp/wklog.21.old.18770 2010-06-24 05:49:14.000000000 +0000
+++ /tmp/wklog.21.new.18770 2010-06-24 05:49:14.000000000 +0000
@@ -1 +1 @@
-Knielsen
+Monty
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Category updated.
--- /tmp/wklog.21.old.18765 2010-06-24 05:48:53.000000000 +0000
+++ /tmp/wklog.21.new.18765 2010-06-24 05:48:53.000000000 +0000
@@ -1 +1 @@
-Server-RawIdeaBin
+Server-Sprint
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Status updated.
--- /tmp/wklog.21.old.18761 2010-06-24 05:48:43.000000000 +0000
+++ /tmp/wklog.21.new.18761 2010-06-24 05:48:43.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Observers changed: Knielsen,Monty,Psergey,Sanja,Igor,Rhuddleston,Timour,Serg
-=-=(Guest - Thu, 24 Jun 2010, 05:44)=-=-
I spent 25 hours in the month June 2010 to perform the following work for this task.
1. Compared tree possible algorithms to implement the operation of index intersection mentioned in
HLS by their labor/time consumption. Chose the algorithm that uses a modified Unique class (1.3) as
the most cheap requiring the least amount of efforts/time for its development.
2. Developed a design for a modification of the Unique class to support the operation of index
intersection.
3. Modified the merge_buffers procedure used by the Unique class to make it possible to use it not
only for the the operation of union, but for the operation of intersect as well.
Worked 25 hours and estimate 175 hours remain (original estimate increased by 200 hours).
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 21
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 21
-=-=(Guest - Wed, 03 Jun 2009, 01:17)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.30002 2009-06-03 01:17:32.000000000 +0300
+++ /tmp/wklog.21.new.30002 2009-06-03 01:17:32.000000000 +0300
@@ -7,13 +7,13 @@
The current optimization works with:
-WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
-WHERE key1_part1=1 OR key2_part1=3
+WHERE key1_part1=1 AND key2_part1=3
or
-WHERE key_part1<10 or key2_part1<100
+WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:06)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29694 2009-06-03 01:06:50.000000000 +0300
+++ /tmp/wklog.21.new.29694 2009-06-03 01:06:50.000000000 +0300
@@ -12,6 +12,8 @@
but not with:
WHERE key1_part1=1 OR key2_part1=3
+or
+WHERE key_part1<10 or key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:05)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29638 2009-06-03 01:05:01.000000000 +0300
+++ /tmp/wklog.21.new.29638 2009-06-03 01:05:01.000000000 +0300
@@ -3,5 +3,15 @@
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
+For example, assuming that key1 has 2 parts and key2 has 1 part.
+
+The current optimization works with:
+
+WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+
+but not with:
+
+WHERE key1_part1=1 OR key2_part1=3
+
This WL entry is to lift this limitation by developing algorithms that do
-intersection on non-ROR scans.
+intersection on non-ROR (rowid ordered retrieval) scans.
------------------------------------------------------------
-=-=(View All Progress Notes, 12 total)=-=-
http://askmonty.org/worklog/index.pl?tid=21&nolimit=1
DESCRIPTION:
At the moment index_merge supports intersection only for rowid-ordered streams.
This translates into a limitation that index_merge/intersect can only be
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
For example, assuming that key1 has 2 parts and key2 has 1 part.
The current optimization works with:
WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
WHERE key1_part1=1 AND key2_part1=3
or
WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Execution
1.1 Temptable
1.1.1 Improvement
1.2 Produce/merge sorted streams
1.3 Extend Unique class to handle intersection
1.4 Strategies that do not seem to be useful
1.4.1 Remove matches after having produced an ordered stream
1.4.2 Sparse rowid bitmaps
2. Optimization
</contents>
1. Execution
============
The primary task is to find means to compute an intersection of N unordered
streams. Besides general memory/cpu cost of computation, we consider:
- whether the produced rowid stream is ordered. If it is, it can be piped
into index_merge/intersect (as opposed to sort-intersect)
- whether the strategy can take advantage of the fact that some input streams
are already rowid-ordered
- startup cost (cost of producing the first output record)
We see the following possible strategies:
1.1 Temptable
-------------
[ This is our strategy of choice at the moment]
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
rowid binary($rowid_size),
count n,
primary key(rowid);
);
Then use this algorithm:
i1= {index with the least E(#records)};
for each record R in range_scan(i1)
temp_table.insert(R.rowid, count=1);
for each index idx except i1
{
for each R record in scan(idx) // (INNER-LOOP)
{
if (temp_table has R)
temptable[R].count++;
}
}
// The following loop can do ordered or unordered scan
// if we want it to be ordered scan, we probably better arrange so that
// 'count' column is part of the index.
for each record R in temp_table
{
if (R.count == number_of_streams)
emit(R.rowid);
}
The algorithm has an option to emit an ordered rowid stream.
In the above form, the cost to produce the first record is high. It's easy to
adjust the algorithm to make it low - we'll need to just start scanning all
indexes at once, and finish as soon as we got a full match, i.e. the
temptable[R].count++
operation resulted in the counter being equal to the number of merged scans.
1.1.1 Improvement
~~~~~~~~~~~~~~~~~
When running INNER-LOOP, we could count how many times we've done the
"count++" operation. If it has been done #records-in-temptable times, that
means that all further records will not have matches and we can finish the
scan, i.e. break out of the INNER-LOOP.
1.2 Produce/merge sorted streams
--------------------------------
For each of the merged scan, use filesort-like action to end up with an
ordered stream of rowids. Then merge the ordered streams.
By filesort-like action we mean
- Run over index, collect rowids in a buffer.
- When the buffer is full, sort it and dump into a temporary file.
After the above we'll end up with a number of sorted buffers on disk. We can
use mergebuff() function (it is part of filesort's functions) to produce one
ordered sequence (i.e. array, which may be partially on disk) of rowids.
Merging of ordered streams with help of priority queue is already implemented
in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
child_quick->get_next()
call with a call to read rowid from an ordered sequence.
1.3 Extend Unique class to handle intersection
----------------------------------------------
There is no point to use Unique object as a device that accumulates rowids of
a single scan then produces them in sorted order. One could do the same faster
with accumulating an array of rowids and then sorting it.
It's possible to use Unique object to collect/merge data from all scans though.
The idea is as follows:
- Unique should store <rowid, n_scans> pairs
- Duplicates are pairs with the same rowid
- Unique should try to avoid creating duplicates:
- don't add a duplicate into the in-memory part, instead combine two elements
together by adding their n_scans elements.
- combine duplicates when it sees them in Unique.get() call
- The data we get from Unique.get() should be filtered, all records that have
n_scans != number_of_scans_being_merged should be discarded.
If we're lucky to have started and finished a scan on some index (denote it
as S) without flushing the Unique in the process, then:
- there is no point in adding any new records into the Unique because their
absence in the Unique means that they don't have match in S and hence will
not get into the result of intersection.
- we need to only update the counters to be able to tell if the elements that
are already in the Unique will have matches in all scans.
1.4 Strategies that do not seem to be useful
--------------------------------------------
keeping them here so we don't consider them over and over
1.4.1 Remove matches after having produced an ordered stream
------------------------------------------------------------
We can dump everything into a rowid stream and get it sorted. Then we read it,
and if we see a rowid repeated $n_merged_scans times, it belongs to the
intersection (pass to output), otherwise it doesn't (skip).
This doesn't have any advantages over the produce/merge sorted streams
approach.
1.4.2 Sparse rowid bitmaps
--------------------------
Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
bitmaps assume there will always be enough memory to accommodate them.
PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
disk pages, not ids of individual records. It's hard for us to do something
similar because our rowids are opaque entities whose meaning depends on the
storage engines.
This seems to require too much change to be worth it.
2. Optimization
===============
SEL_TREE objects already represent intersections. The problems with
optimizations are:
- Cost formula(s)
- When N keys/conditions are present:
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
somehow avoid considering (2^n - n) possible options.
- Avoid producing (or even considering) apparently suboptimal plans:
= Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
a subset of columns covered by all other indexes.
= (TODO any other rules?)
- Correlation across selectivities. If there is a condition
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
can we consider satisfaction of AND-parts to be independent?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Igor): index_merge: non-ROR intersection (21)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: non-ROR intersection
CREATION DATE..: Thu, 21 May 2009, 21:32
SUPERVISOR.....: Monty
IMPLEMENTOR....: Igor
COPIES TO......: Rhuddleston, Sanja, Knielsen, Serg, Monty, Timour, Igor, Psergey
CATEGORY.......: Server-Sprint
TASK ID........: 21 (http://askmonty.org/worklog/?tid=21)
VERSION........: Server-9.x
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 25
ESTIMATE.......: 175 (hours remain)
ORIG. ESTIMATE.: 175
PROGRESS NOTES:
-=-=(Igor - Thu, 24 Jun 2010, 05:49)=-=-
Supervisor updated.
--- /tmp/wklog.21.old.18770 2010-06-24 05:49:14.000000000 +0000
+++ /tmp/wklog.21.new.18770 2010-06-24 05:49:14.000000000 +0000
@@ -1 +1 @@
-Knielsen
+Monty
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Category updated.
--- /tmp/wklog.21.old.18765 2010-06-24 05:48:53.000000000 +0000
+++ /tmp/wklog.21.new.18765 2010-06-24 05:48:53.000000000 +0000
@@ -1 +1 @@
-Server-RawIdeaBin
+Server-Sprint
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Status updated.
--- /tmp/wklog.21.old.18761 2010-06-24 05:48:43.000000000 +0000
+++ /tmp/wklog.21.new.18761 2010-06-24 05:48:43.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Observers changed: Knielsen,Monty,Psergey,Sanja,Igor,Rhuddleston,Timour,Serg
-=-=(Guest - Thu, 24 Jun 2010, 05:44)=-=-
I spent 25 hours in the month June 2010 to perform the following work for this task.
1. Compared tree possible algorithms to implement the operation of index intersection mentioned in
HLS by their labor/time consumption. Chose the algorithm that uses a modified Unique class (1.3) as
the most cheap requiring the least amount of efforts/time for its development.
2. Developed a design for a modification of the Unique class to support the operation of index
intersection.
3. Modified the merge_buffers procedure used by the Unique class to make it possible to use it not
only for the the operation of union, but for the operation of intersect as well.
Worked 25 hours and estimate 175 hours remain (original estimate increased by 200 hours).
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 21
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 21
-=-=(Guest - Wed, 03 Jun 2009, 01:17)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.30002 2009-06-03 01:17:32.000000000 +0300
+++ /tmp/wklog.21.new.30002 2009-06-03 01:17:32.000000000 +0300
@@ -7,13 +7,13 @@
The current optimization works with:
-WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
-WHERE key1_part1=1 OR key2_part1=3
+WHERE key1_part1=1 AND key2_part1=3
or
-WHERE key_part1<10 or key2_part1<100
+WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:06)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29694 2009-06-03 01:06:50.000000000 +0300
+++ /tmp/wklog.21.new.29694 2009-06-03 01:06:50.000000000 +0300
@@ -12,6 +12,8 @@
but not with:
WHERE key1_part1=1 OR key2_part1=3
+or
+WHERE key_part1<10 or key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:05)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29638 2009-06-03 01:05:01.000000000 +0300
+++ /tmp/wklog.21.new.29638 2009-06-03 01:05:01.000000000 +0300
@@ -3,5 +3,15 @@
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
+For example, assuming that key1 has 2 parts and key2 has 1 part.
+
+The current optimization works with:
+
+WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+
+but not with:
+
+WHERE key1_part1=1 OR key2_part1=3
+
This WL entry is to lift this limitation by developing algorithms that do
-intersection on non-ROR scans.
+intersection on non-ROR (rowid ordered retrieval) scans.
------------------------------------------------------------
-=-=(View All Progress Notes, 12 total)=-=-
http://askmonty.org/worklog/index.pl?tid=21&nolimit=1
DESCRIPTION:
At the moment index_merge supports intersection only for rowid-ordered streams.
This translates into a limitation that index_merge/intersect can only be
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
For example, assuming that key1 has 2 parts and key2 has 1 part.
The current optimization works with:
WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
WHERE key1_part1=1 AND key2_part1=3
or
WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Execution
1.1 Temptable
1.1.1 Improvement
1.2 Produce/merge sorted streams
1.3 Extend Unique class to handle intersection
1.4 Strategies that do not seem to be useful
1.4.1 Remove matches after having produced an ordered stream
1.4.2 Sparse rowid bitmaps
2. Optimization
</contents>
1. Execution
============
The primary task is to find means to compute an intersection of N unordered
streams. Besides general memory/cpu cost of computation, we consider:
- whether the produced rowid stream is ordered. If it is, it can be piped
into index_merge/intersect (as opposed to sort-intersect)
- whether the strategy can take advantage of the fact that some input streams
are already rowid-ordered
- startup cost (cost of producing the first output record)
We see the following possible strategies:
1.1 Temptable
-------------
[ This is our strategy of choice at the moment]
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
rowid binary($rowid_size),
count n,
primary key(rowid);
);
Then use this algorithm:
i1= {index with the least E(#records)};
for each record R in range_scan(i1)
temp_table.insert(R.rowid, count=1);
for each index idx except i1
{
for each R record in scan(idx) // (INNER-LOOP)
{
if (temp_table has R)
temptable[R].count++;
}
}
// The following loop can do ordered or unordered scan
// if we want it to be ordered scan, we probably better arrange so that
// 'count' column is part of the index.
for each record R in temp_table
{
if (R.count == number_of_streams)
emit(R.rowid);
}
The algorithm has an option to emit an ordered rowid stream.
In the above form, the cost to produce the first record is high. It's easy to
adjust the algorithm to make it low - we'll need to just start scanning all
indexes at once, and finish as soon as we got a full match, i.e. the
temptable[R].count++
operation resulted in the counter being equal to the number of merged scans.
1.1.1 Improvement
~~~~~~~~~~~~~~~~~
When running INNER-LOOP, we could count how many times we've done the
"count++" operation. If it has been done #records-in-temptable times, that
means that all further records will not have matches and we can finish the
scan, i.e. break out of the INNER-LOOP.
1.2 Produce/merge sorted streams
--------------------------------
For each of the merged scan, use filesort-like action to end up with an
ordered stream of rowids. Then merge the ordered streams.
By filesort-like action we mean
- Run over index, collect rowids in a buffer.
- When the buffer is full, sort it and dump into a temporary file.
After the above we'll end up with a number of sorted buffers on disk. We can
use mergebuff() function (it is part of filesort's functions) to produce one
ordered sequence (i.e. array, which may be partially on disk) of rowids.
Merging of ordered streams with help of priority queue is already implemented
in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
child_quick->get_next()
call with a call to read rowid from an ordered sequence.
1.3 Extend Unique class to handle intersection
----------------------------------------------
There is no point to use Unique object as a device that accumulates rowids of
a single scan then produces them in sorted order. One could do the same faster
with accumulating an array of rowids and then sorting it.
It's possible to use Unique object to collect/merge data from all scans though.
The idea is as follows:
- Unique should store <rowid, n_scans> pairs
- Duplicates are pairs with the same rowid
- Unique should try to avoid creating duplicates:
- don't add a duplicate into the in-memory part, instead combine two elements
together by adding their n_scans elements.
- combine duplicates when it sees them in Unique.get() call
- The data we get from Unique.get() should be filtered, all records that have
n_scans != number_of_scans_being_merged should be discarded.
If we're lucky to have started and finished a scan on some index (denote it
as S) without flushing the Unique in the process, then:
- there is no point in adding any new records into the Unique because their
absence in the Unique means that they don't have match in S and hence will
not get into the result of intersection.
- we need to only update the counters to be able to tell if the elements that
are already in the Unique will have matches in all scans.
1.4 Strategies that do not seem to be useful
--------------------------------------------
keeping them here so we don't consider them over and over
1.4.1 Remove matches after having produced an ordered stream
------------------------------------------------------------
We can dump everything into a rowid stream and get it sorted. Then we read it,
and if we see a rowid repeated $n_merged_scans times, it belongs to the
intersection (pass to output), otherwise it doesn't (skip).
This doesn't have any advantages over the produce/merge sorted streams
approach.
1.4.2 Sparse rowid bitmaps
--------------------------
Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
bitmaps assume there will always be enough memory to accommodate them.
PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
disk pages, not ids of individual records. It's hard for us to do something
similar because our rowids are opaque entities whose meaning depends on the
storage engines.
This seems to require too much change to be worth it.
2. Optimization
===============
SEL_TREE objects already represent intersections. The problems with
optimizations are:
- Cost formula(s)
- When N keys/conditions are present:
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
somehow avoid considering (2^n - n) possible options.
- Avoid producing (or even considering) apparently suboptimal plans:
= Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
a subset of columns covered by all other indexes.
= (TODO any other rules?)
- Correlation across selectivities. If there is a condition
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
can we consider satisfaction of AND-parts to be independent?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Igor): index_merge: non-ROR intersection (21)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: non-ROR intersection
CREATION DATE..: Thu, 21 May 2009, 21:32
SUPERVISOR.....: Knielsen
IMPLEMENTOR....: Igor
COPIES TO......: Rhuddleston, Sanja, Knielsen, Serg, Monty, Timour, Igor, Psergey
CATEGORY.......: Server-Sprint
TASK ID........: 21 (http://askmonty.org/worklog/?tid=21)
VERSION........: Server-9.x
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 25
ESTIMATE.......: 175 (hours remain)
ORIG. ESTIMATE.: 175
PROGRESS NOTES:
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Category updated.
--- /tmp/wklog.21.old.18765 2010-06-24 05:48:53.000000000 +0000
+++ /tmp/wklog.21.new.18765 2010-06-24 05:48:53.000000000 +0000
@@ -1 +1 @@
-Server-RawIdeaBin
+Server-Sprint
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Status updated.
--- /tmp/wklog.21.old.18761 2010-06-24 05:48:43.000000000 +0000
+++ /tmp/wklog.21.new.18761 2010-06-24 05:48:43.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Observers changed: Knielsen,Monty,Psergey,Sanja,Igor,Rhuddleston,Timour,Serg
-=-=(Guest - Thu, 24 Jun 2010, 05:44)=-=-
I spent 25 hours in the month June 2010 to perform the following work for this task.
1. Compared tree possible algorithms to implement the operation of index intersection mentioned in
HLS by their labor/time consumption. Chose the algorithm that uses a modified Unique class (1.3) as
the most cheap requiring the least amount of efforts/time for its development.
2. Developed a design for a modification of the Unique class to support the operation of index
intersection.
3. Modified the merge_buffers procedure used by the Unique class to make it possible to use it not
only for the the operation of union, but for the operation of intersect as well.
Worked 25 hours and estimate 175 hours remain (original estimate increased by 200 hours).
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 21
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 21
-=-=(Guest - Wed, 03 Jun 2009, 01:17)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.30002 2009-06-03 01:17:32.000000000 +0300
+++ /tmp/wklog.21.new.30002 2009-06-03 01:17:32.000000000 +0300
@@ -7,13 +7,13 @@
The current optimization works with:
-WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
-WHERE key1_part1=1 OR key2_part1=3
+WHERE key1_part1=1 AND key2_part1=3
or
-WHERE key_part1<10 or key2_part1<100
+WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:06)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29694 2009-06-03 01:06:50.000000000 +0300
+++ /tmp/wklog.21.new.29694 2009-06-03 01:06:50.000000000 +0300
@@ -12,6 +12,8 @@
but not with:
WHERE key1_part1=1 OR key2_part1=3
+or
+WHERE key_part1<10 or key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:05)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29638 2009-06-03 01:05:01.000000000 +0300
+++ /tmp/wklog.21.new.29638 2009-06-03 01:05:01.000000000 +0300
@@ -3,5 +3,15 @@
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
+For example, assuming that key1 has 2 parts and key2 has 1 part.
+
+The current optimization works with:
+
+WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+
+but not with:
+
+WHERE key1_part1=1 OR key2_part1=3
+
This WL entry is to lift this limitation by developing algorithms that do
-intersection on non-ROR scans.
+intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Guest - Tue, 26 May 2009, 14:04)=-=-
High-Level Specification modified.
--- /tmp/wklog.21.old.1802 2009-05-26 14:04:57.000000000 +0300
+++ /tmp/wklog.21.new.1802 2009-05-26 14:04:57.000000000 +0300
@@ -1,4 +1,3 @@
-
<contents>
1. Execution
1.1 Temptable
@@ -30,6 +29,8 @@
1.1 Temptable
-------------
+[ This is our strategy of choice at the moment]
+
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
@@ -168,3 +169,8 @@
a subset of columns covered by all other indexes.
= (TODO any other rules?)
+- Correlation across selectivities. If there is a condition
+
+ "cond(key1) AND cond(key2) AND ... AND cond(keyN)",
+
+ can we consider satisfaction of AND-parts to be independent?
------------------------------------------------------------
-=-=(View All Progress Notes, 11 total)=-=-
http://askmonty.org/worklog/index.pl?tid=21&nolimit=1
DESCRIPTION:
At the moment index_merge supports intersection only for rowid-ordered streams.
This translates into a limitation that index_merge/intersect can only be
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
For example, assuming that key1 has 2 parts and key2 has 1 part.
The current optimization works with:
WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
WHERE key1_part1=1 AND key2_part1=3
or
WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Execution
1.1 Temptable
1.1.1 Improvement
1.2 Produce/merge sorted streams
1.3 Extend Unique class to handle intersection
1.4 Strategies that do not seem to be useful
1.4.1 Remove matches after having produced an ordered stream
1.4.2 Sparse rowid bitmaps
2. Optimization
</contents>
1. Execution
============
The primary task is to find means to compute an intersection of N unordered
streams. Besides general memory/cpu cost of computation, we consider:
- whether the produced rowid stream is ordered. If it is, it can be piped
into index_merge/intersect (as opposed to sort-intersect)
- whether the strategy can take advantage of the fact that some input streams
are already rowid-ordered
- startup cost (cost of producing the first output record)
We see the following possible strategies:
1.1 Temptable
-------------
[ This is our strategy of choice at the moment]
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
rowid binary($rowid_size),
count n,
primary key(rowid);
);
Then use this algorithm:
i1= {index with the least E(#records)};
for each record R in range_scan(i1)
temp_table.insert(R.rowid, count=1);
for each index idx except i1
{
for each R record in scan(idx) // (INNER-LOOP)
{
if (temp_table has R)
temptable[R].count++;
}
}
// The following loop can do ordered or unordered scan
// if we want it to be ordered scan, we probably better arrange so that
// 'count' column is part of the index.
for each record R in temp_table
{
if (R.count == number_of_streams)
emit(R.rowid);
}
The algorithm has an option to emit an ordered rowid stream.
In the above form, the cost to produce the first record is high. It's easy to
adjust the algorithm to make it low - we'll need to just start scanning all
indexes at once, and finish as soon as we got a full match, i.e. the
temptable[R].count++
operation resulted in the counter being equal to the number of merged scans.
1.1.1 Improvement
~~~~~~~~~~~~~~~~~
When running INNER-LOOP, we could count how many times we've done the
"count++" operation. If it has been done #records-in-temptable times, that
means that all further records will not have matches and we can finish the
scan, i.e. break out of the INNER-LOOP.
1.2 Produce/merge sorted streams
--------------------------------
For each of the merged scan, use filesort-like action to end up with an
ordered stream of rowids. Then merge the ordered streams.
By filesort-like action we mean
- Run over index, collect rowids in a buffer.
- When the buffer is full, sort it and dump into a temporary file.
After the above we'll end up with a number of sorted buffers on disk. We can
use mergebuff() function (it is part of filesort's functions) to produce one
ordered sequence (i.e. array, which may be partially on disk) of rowids.
Merging of ordered streams with help of priority queue is already implemented
in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
child_quick->get_next()
call with a call to read rowid from an ordered sequence.
1.3 Extend Unique class to handle intersection
----------------------------------------------
There is no point to use Unique object as a device that accumulates rowids of
a single scan then produces them in sorted order. One could do the same faster
with accumulating an array of rowids and then sorting it.
It's possible to use Unique object to collect/merge data from all scans though.
The idea is as follows:
- Unique should store <rowid, n_scans> pairs
- Duplicates are pairs with the same rowid
- Unique should try to avoid creating duplicates:
- don't add a duplicate into the in-memory part, instead combine two elements
together by adding their n_scans elements.
- combine duplicates when it sees them in Unique.get() call
- The data we get from Unique.get() should be filtered, all records that have
n_scans != number_of_scans_being_merged should be discarded.
If we're lucky to have started and finished a scan on some index (denote it
as S) without flushing the Unique in the process, then:
- there is no point in adding any new records into the Unique because their
absence in the Unique means that they don't have match in S and hence will
not get into the result of intersection.
- we need to only update the counters to be able to tell if the elements that
are already in the Unique will have matches in all scans.
1.4 Strategies that do not seem to be useful
--------------------------------------------
keeping them here so we don't consider them over and over
1.4.1 Remove matches after having produced an ordered stream
------------------------------------------------------------
We can dump everything into a rowid stream and get it sorted. Then we read it,
and if we see a rowid repeated $n_merged_scans times, it belongs to the
intersection (pass to output), otherwise it doesn't (skip).
This doesn't have any advantages over the produce/merge sorted streams
approach.
1.4.2 Sparse rowid bitmaps
--------------------------
Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
bitmaps assume there will always be enough memory to accommodate them.
PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
disk pages, not ids of individual records. It's hard for us to do something
similar because our rowids are opaque entities whose meaning depends on the
storage engines.
This seems to require too much change to be worth it.
2. Optimization
===============
SEL_TREE objects already represent intersections. The problems with
optimizations are:
- Cost formula(s)
- When N keys/conditions are present:
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
somehow avoid considering (2^n - n) possible options.
- Avoid producing (or even considering) apparently suboptimal plans:
= Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
a subset of columns covered by all other indexes.
= (TODO any other rules?)
- Correlation across selectivities. If there is a condition
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
can we consider satisfaction of AND-parts to be independent?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Igor): index_merge: non-ROR intersection (21)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: non-ROR intersection
CREATION DATE..: Thu, 21 May 2009, 21:32
SUPERVISOR.....: Knielsen
IMPLEMENTOR....: Igor
COPIES TO......: Rhuddleston, Sanja, Knielsen, Serg, Monty, Timour, Igor, Psergey
CATEGORY.......: Server-Sprint
TASK ID........: 21 (http://askmonty.org/worklog/?tid=21)
VERSION........: Server-9.x
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 25
ESTIMATE.......: 175 (hours remain)
ORIG. ESTIMATE.: 175
PROGRESS NOTES:
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Category updated.
--- /tmp/wklog.21.old.18765 2010-06-24 05:48:53.000000000 +0000
+++ /tmp/wklog.21.new.18765 2010-06-24 05:48:53.000000000 +0000
@@ -1 +1 @@
-Server-RawIdeaBin
+Server-Sprint
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Status updated.
--- /tmp/wklog.21.old.18761 2010-06-24 05:48:43.000000000 +0000
+++ /tmp/wklog.21.new.18761 2010-06-24 05:48:43.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Observers changed: Knielsen,Monty,Psergey,Sanja,Igor,Rhuddleston,Timour,Serg
-=-=(Guest - Thu, 24 Jun 2010, 05:44)=-=-
I spent 25 hours in the month June 2010 to perform the following work for this task.
1. Compared tree possible algorithms to implement the operation of index intersection mentioned in
HLS by their labor/time consumption. Chose the algorithm that uses a modified Unique class (1.3) as
the most cheap requiring the least amount of efforts/time for its development.
2. Developed a design for a modification of the Unique class to support the operation of index
intersection.
3. Modified the merge_buffers procedure used by the Unique class to make it possible to use it not
only for the the operation of union, but for the operation of intersect as well.
Worked 25 hours and estimate 175 hours remain (original estimate increased by 200 hours).
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 21
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 21
-=-=(Guest - Wed, 03 Jun 2009, 01:17)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.30002 2009-06-03 01:17:32.000000000 +0300
+++ /tmp/wklog.21.new.30002 2009-06-03 01:17:32.000000000 +0300
@@ -7,13 +7,13 @@
The current optimization works with:
-WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
-WHERE key1_part1=1 OR key2_part1=3
+WHERE key1_part1=1 AND key2_part1=3
or
-WHERE key_part1<10 or key2_part1<100
+WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:06)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29694 2009-06-03 01:06:50.000000000 +0300
+++ /tmp/wklog.21.new.29694 2009-06-03 01:06:50.000000000 +0300
@@ -12,6 +12,8 @@
but not with:
WHERE key1_part1=1 OR key2_part1=3
+or
+WHERE key_part1<10 or key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:05)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29638 2009-06-03 01:05:01.000000000 +0300
+++ /tmp/wklog.21.new.29638 2009-06-03 01:05:01.000000000 +0300
@@ -3,5 +3,15 @@
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
+For example, assuming that key1 has 2 parts and key2 has 1 part.
+
+The current optimization works with:
+
+WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+
+but not with:
+
+WHERE key1_part1=1 OR key2_part1=3
+
This WL entry is to lift this limitation by developing algorithms that do
-intersection on non-ROR scans.
+intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Guest - Tue, 26 May 2009, 14:04)=-=-
High-Level Specification modified.
--- /tmp/wklog.21.old.1802 2009-05-26 14:04:57.000000000 +0300
+++ /tmp/wklog.21.new.1802 2009-05-26 14:04:57.000000000 +0300
@@ -1,4 +1,3 @@
-
<contents>
1. Execution
1.1 Temptable
@@ -30,6 +29,8 @@
1.1 Temptable
-------------
+[ This is our strategy of choice at the moment]
+
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
@@ -168,3 +169,8 @@
a subset of columns covered by all other indexes.
= (TODO any other rules?)
+- Correlation across selectivities. If there is a condition
+
+ "cond(key1) AND cond(key2) AND ... AND cond(keyN)",
+
+ can we consider satisfaction of AND-parts to be independent?
------------------------------------------------------------
-=-=(View All Progress Notes, 11 total)=-=-
http://askmonty.org/worklog/index.pl?tid=21&nolimit=1
DESCRIPTION:
At the moment index_merge supports intersection only for rowid-ordered streams.
This translates into a limitation that index_merge/intersect can only be
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
For example, assuming that key1 has 2 parts and key2 has 1 part.
The current optimization works with:
WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
WHERE key1_part1=1 AND key2_part1=3
or
WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Execution
1.1 Temptable
1.1.1 Improvement
1.2 Produce/merge sorted streams
1.3 Extend Unique class to handle intersection
1.4 Strategies that do not seem to be useful
1.4.1 Remove matches after having produced an ordered stream
1.4.2 Sparse rowid bitmaps
2. Optimization
</contents>
1. Execution
============
The primary task is to find means to compute an intersection of N unordered
streams. Besides general memory/cpu cost of computation, we consider:
- whether the produced rowid stream is ordered. If it is, it can be piped
into index_merge/intersect (as opposed to sort-intersect)
- whether the strategy can take advantage of the fact that some input streams
are already rowid-ordered
- startup cost (cost of producing the first output record)
We see the following possible strategies:
1.1 Temptable
-------------
[ This is our strategy of choice at the moment]
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
rowid binary($rowid_size),
count n,
primary key(rowid);
);
Then use this algorithm:
i1= {index with the least E(#records)};
for each record R in range_scan(i1)
temp_table.insert(R.rowid, count=1);
for each index idx except i1
{
for each R record in scan(idx) // (INNER-LOOP)
{
if (temp_table has R)
temptable[R].count++;
}
}
// The following loop can do ordered or unordered scan
// if we want it to be ordered scan, we probably better arrange so that
// 'count' column is part of the index.
for each record R in temp_table
{
if (R.count == number_of_streams)
emit(R.rowid);
}
The algorithm has an option to emit an ordered rowid stream.
In the above form, the cost to produce the first record is high. It's easy to
adjust the algorithm to make it low - we'll need to just start scanning all
indexes at once, and finish as soon as we got a full match, i.e. the
temptable[R].count++
operation resulted in the counter being equal to the number of merged scans.
1.1.1 Improvement
~~~~~~~~~~~~~~~~~
When running INNER-LOOP, we could count how many times we've done the
"count++" operation. If it has been done #records-in-temptable times, that
means that all further records will not have matches and we can finish the
scan, i.e. break out of the INNER-LOOP.
1.2 Produce/merge sorted streams
--------------------------------
For each of the merged scan, use filesort-like action to end up with an
ordered stream of rowids. Then merge the ordered streams.
By filesort-like action we mean
- Run over index, collect rowids in a buffer.
- When the buffer is full, sort it and dump into a temporary file.
After the above we'll end up with a number of sorted buffers on disk. We can
use mergebuff() function (it is part of filesort's functions) to produce one
ordered sequence (i.e. array, which may be partially on disk) of rowids.
Merging of ordered streams with help of priority queue is already implemented
in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
child_quick->get_next()
call with a call to read rowid from an ordered sequence.
1.3 Extend Unique class to handle intersection
----------------------------------------------
There is no point to use Unique object as a device that accumulates rowids of
a single scan then produces them in sorted order. One could do the same faster
with accumulating an array of rowids and then sorting it.
It's possible to use Unique object to collect/merge data from all scans though.
The idea is as follows:
- Unique should store <rowid, n_scans> pairs
- Duplicates are pairs with the same rowid
- Unique should try to avoid creating duplicates:
- don't add a duplicate into the in-memory part, instead combine two elements
together by adding their n_scans elements.
- combine duplicates when it sees them in Unique.get() call
- The data we get from Unique.get() should be filtered, all records that have
n_scans != number_of_scans_being_merged should be discarded.
If we're lucky to have started and finished a scan on some index (denote it
as S) without flushing the Unique in the process, then:
- there is no point in adding any new records into the Unique because their
absence in the Unique means that they don't have match in S and hence will
not get into the result of intersection.
- we need to only update the counters to be able to tell if the elements that
are already in the Unique will have matches in all scans.
1.4 Strategies that do not seem to be useful
--------------------------------------------
keeping them here so we don't consider them over and over
1.4.1 Remove matches after having produced an ordered stream
------------------------------------------------------------
We can dump everything into a rowid stream and get it sorted. Then we read it,
and if we see a rowid repeated $n_merged_scans times, it belongs to the
intersection (pass to output), otherwise it doesn't (skip).
This doesn't have any advantages over the produce/merge sorted streams
approach.
1.4.2 Sparse rowid bitmaps
--------------------------
Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
bitmaps assume there will always be enough memory to accommodate them.
PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
disk pages, not ids of individual records. It's hard for us to do something
similar because our rowids are opaque entities whose meaning depends on the
storage engines.
This seems to require too much change to be worth it.
2. Optimization
===============
SEL_TREE objects already represent intersections. The problems with
optimizations are:
- Cost formula(s)
- When N keys/conditions are present:
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
somehow avoid considering (2^n - n) possible options.
- Avoid producing (or even considering) apparently suboptimal plans:
= Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
a subset of columns covered by all other indexes.
= (TODO any other rules?)
- Correlation across selectivities. If there is a condition
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
can we consider satisfaction of AND-parts to be independent?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Igor): index_merge: non-ROR intersection (21)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: non-ROR intersection
CREATION DATE..: Thu, 21 May 2009, 21:32
SUPERVISOR.....: Knielsen
IMPLEMENTOR....: Igor
COPIES TO......: Rhuddleston, Sanja, Knielsen, Serg, Monty, Timour, Igor, Psergey
CATEGORY.......: Server-Sprint
TASK ID........: 21 (http://askmonty.org/worklog/?tid=21)
VERSION........: Server-9.x
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 25
ESTIMATE.......: 175 (hours remain)
ORIG. ESTIMATE.: 175
PROGRESS NOTES:
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Category updated.
--- /tmp/wklog.21.old.18765 2010-06-24 05:48:53.000000000 +0000
+++ /tmp/wklog.21.new.18765 2010-06-24 05:48:53.000000000 +0000
@@ -1 +1 @@
-Server-RawIdeaBin
+Server-Sprint
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Status updated.
--- /tmp/wklog.21.old.18761 2010-06-24 05:48:43.000000000 +0000
+++ /tmp/wklog.21.new.18761 2010-06-24 05:48:43.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Observers changed: Knielsen,Monty,Psergey,Sanja,Igor,Rhuddleston,Timour,Serg
-=-=(Guest - Thu, 24 Jun 2010, 05:44)=-=-
I spent 25 hours in the month June 2010 to perform the following work for this task.
1. Compared tree possible algorithms to implement the operation of index intersection mentioned in
HLS by their labor/time consumption. Chose the algorithm that uses a modified Unique class (1.3) as
the most cheap requiring the least amount of efforts/time for its development.
2. Developed a design for a modification of the Unique class to support the operation of index
intersection.
3. Modified the merge_buffers procedure used by the Unique class to make it possible to use it not
only for the the operation of union, but for the operation of intersect as well.
Worked 25 hours and estimate 175 hours remain (original estimate increased by 200 hours).
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 21
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 21
-=-=(Guest - Wed, 03 Jun 2009, 01:17)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.30002 2009-06-03 01:17:32.000000000 +0300
+++ /tmp/wklog.21.new.30002 2009-06-03 01:17:32.000000000 +0300
@@ -7,13 +7,13 @@
The current optimization works with:
-WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
-WHERE key1_part1=1 OR key2_part1=3
+WHERE key1_part1=1 AND key2_part1=3
or
-WHERE key_part1<10 or key2_part1<100
+WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:06)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29694 2009-06-03 01:06:50.000000000 +0300
+++ /tmp/wklog.21.new.29694 2009-06-03 01:06:50.000000000 +0300
@@ -12,6 +12,8 @@
but not with:
WHERE key1_part1=1 OR key2_part1=3
+or
+WHERE key_part1<10 or key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:05)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29638 2009-06-03 01:05:01.000000000 +0300
+++ /tmp/wklog.21.new.29638 2009-06-03 01:05:01.000000000 +0300
@@ -3,5 +3,15 @@
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
+For example, assuming that key1 has 2 parts and key2 has 1 part.
+
+The current optimization works with:
+
+WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+
+but not with:
+
+WHERE key1_part1=1 OR key2_part1=3
+
This WL entry is to lift this limitation by developing algorithms that do
-intersection on non-ROR scans.
+intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Guest - Tue, 26 May 2009, 14:04)=-=-
High-Level Specification modified.
--- /tmp/wklog.21.old.1802 2009-05-26 14:04:57.000000000 +0300
+++ /tmp/wklog.21.new.1802 2009-05-26 14:04:57.000000000 +0300
@@ -1,4 +1,3 @@
-
<contents>
1. Execution
1.1 Temptable
@@ -30,6 +29,8 @@
1.1 Temptable
-------------
+[ This is our strategy of choice at the moment]
+
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
@@ -168,3 +169,8 @@
a subset of columns covered by all other indexes.
= (TODO any other rules?)
+- Correlation across selectivities. If there is a condition
+
+ "cond(key1) AND cond(key2) AND ... AND cond(keyN)",
+
+ can we consider satisfaction of AND-parts to be independent?
------------------------------------------------------------
-=-=(View All Progress Notes, 11 total)=-=-
http://askmonty.org/worklog/index.pl?tid=21&nolimit=1
DESCRIPTION:
At the moment index_merge supports intersection only for rowid-ordered streams.
This translates into a limitation that index_merge/intersect can only be
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
For example, assuming that key1 has 2 parts and key2 has 1 part.
The current optimization works with:
WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
WHERE key1_part1=1 AND key2_part1=3
or
WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Execution
1.1 Temptable
1.1.1 Improvement
1.2 Produce/merge sorted streams
1.3 Extend Unique class to handle intersection
1.4 Strategies that do not seem to be useful
1.4.1 Remove matches after having produced an ordered stream
1.4.2 Sparse rowid bitmaps
2. Optimization
</contents>
1. Execution
============
The primary task is to find means to compute an intersection of N unordered
streams. Besides general memory/cpu cost of computation, we consider:
- whether the produced rowid stream is ordered. If it is, it can be piped
into index_merge/intersect (as opposed to sort-intersect)
- whether the strategy can take advantage of the fact that some input streams
are already rowid-ordered
- startup cost (cost of producing the first output record)
We see the following possible strategies:
1.1 Temptable
-------------
[ This is our strategy of choice at the moment]
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
rowid binary($rowid_size),
count n,
primary key(rowid);
);
Then use this algorithm:
i1= {index with the least E(#records)};
for each record R in range_scan(i1)
temp_table.insert(R.rowid, count=1);
for each index idx except i1
{
for each R record in scan(idx) // (INNER-LOOP)
{
if (temp_table has R)
temptable[R].count++;
}
}
// The following loop can do ordered or unordered scan
// if we want it to be ordered scan, we probably better arrange so that
// 'count' column is part of the index.
for each record R in temp_table
{
if (R.count == number_of_streams)
emit(R.rowid);
}
The algorithm has an option to emit an ordered rowid stream.
In the above form, the cost to produce the first record is high. It's easy to
adjust the algorithm to make it low - we'll need to just start scanning all
indexes at once, and finish as soon as we got a full match, i.e. the
temptable[R].count++
operation resulted in the counter being equal to the number of merged scans.
1.1.1 Improvement
~~~~~~~~~~~~~~~~~
When running INNER-LOOP, we could count how many times we've done the
"count++" operation. If it has been done #records-in-temptable times, that
means that all further records will not have matches and we can finish the
scan, i.e. break out of the INNER-LOOP.
1.2 Produce/merge sorted streams
--------------------------------
For each of the merged scan, use filesort-like action to end up with an
ordered stream of rowids. Then merge the ordered streams.
By filesort-like action we mean
- Run over index, collect rowids in a buffer.
- When the buffer is full, sort it and dump into a temporary file.
After the above we'll end up with a number of sorted buffers on disk. We can
use mergebuff() function (it is part of filesort's functions) to produce one
ordered sequence (i.e. array, which may be partially on disk) of rowids.
Merging of ordered streams with help of priority queue is already implemented
in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
child_quick->get_next()
call with a call to read rowid from an ordered sequence.
1.3 Extend Unique class to handle intersection
----------------------------------------------
There is no point to use Unique object as a device that accumulates rowids of
a single scan then produces them in sorted order. One could do the same faster
with accumulating an array of rowids and then sorting it.
It's possible to use Unique object to collect/merge data from all scans though.
The idea is as follows:
- Unique should store <rowid, n_scans> pairs
- Duplicates are pairs with the same rowid
- Unique should try to avoid creating duplicates:
- don't add a duplicate into the in-memory part, instead combine two elements
together by adding their n_scans elements.
- combine duplicates when it sees them in Unique.get() call
- The data we get from Unique.get() should be filtered, all records that have
n_scans != number_of_scans_being_merged should be discarded.
If we're lucky to have started and finished a scan on some index (denote it
as S) without flushing the Unique in the process, then:
- there is no point in adding any new records into the Unique because their
absence in the Unique means that they don't have match in S and hence will
not get into the result of intersection.
- we need to only update the counters to be able to tell if the elements that
are already in the Unique will have matches in all scans.
1.4 Strategies that do not seem to be useful
--------------------------------------------
keeping them here so we don't consider them over and over
1.4.1 Remove matches after having produced an ordered stream
------------------------------------------------------------
We can dump everything into a rowid stream and get it sorted. Then we read it,
and if we see a rowid repeated $n_merged_scans times, it belongs to the
intersection (pass to output), otherwise it doesn't (skip).
This doesn't have any advantages over the produce/merge sorted streams
approach.
1.4.2 Sparse rowid bitmaps
--------------------------
Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
bitmaps assume there will always be enough memory to accommodate them.
PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
disk pages, not ids of individual records. It's hard for us to do something
similar because our rowids are opaque entities whose meaning depends on the
storage engines.
This seems to require too much change to be worth it.
2. Optimization
===============
SEL_TREE objects already represent intersections. The problems with
optimizations are:
- Cost formula(s)
- When N keys/conditions are present:
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
somehow avoid considering (2^n - n) possible options.
- Avoid producing (or even considering) apparently suboptimal plans:
= Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
a subset of columns covered by all other indexes.
= (TODO any other rules?)
- Correlation across selectivities. If there is a condition
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
can we consider satisfaction of AND-parts to be independent?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Igor): index_merge: non-ROR intersection (21)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: non-ROR intersection
CREATION DATE..: Thu, 21 May 2009, 21:32
SUPERVISOR.....: Knielsen
IMPLEMENTOR....: Igor
COPIES TO......: Rhuddleston, Sanja, Knielsen, Serg, Monty, Timour, Igor, Psergey
CATEGORY.......: Server-Sprint
TASK ID........: 21 (http://askmonty.org/worklog/?tid=21)
VERSION........: Server-9.x
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 25
ESTIMATE.......: 175 (hours remain)
ORIG. ESTIMATE.: 175
PROGRESS NOTES:
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Category updated.
--- /tmp/wklog.21.old.18765 2010-06-24 05:48:53.000000000 +0000
+++ /tmp/wklog.21.new.18765 2010-06-24 05:48:53.000000000 +0000
@@ -1 +1 @@
-Server-RawIdeaBin
+Server-Sprint
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Status updated.
--- /tmp/wklog.21.old.18761 2010-06-24 05:48:43.000000000 +0000
+++ /tmp/wklog.21.new.18761 2010-06-24 05:48:43.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Observers changed: Knielsen,Monty,Psergey,Sanja,Igor,Rhuddleston,Timour,Serg
-=-=(Guest - Thu, 24 Jun 2010, 05:44)=-=-
I spent 25 hours in the month June 2010 to perform the following work for this task.
1. Compared tree possible algorithms to implement the operation of index intersection mentioned in
HLS by their labor/time consumption. Chose the algorithm that uses a modified Unique class (1.3) as
the most cheap requiring the least amount of efforts/time for its development.
2. Developed a design for a modification of the Unique class to support the operation of index
intersection.
3. Modified the merge_buffers procedure used by the Unique class to make it possible to use it not
only for the the operation of union, but for the operation of intersect as well.
Worked 25 hours and estimate 175 hours remain (original estimate increased by 200 hours).
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 21
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 21
-=-=(Guest - Wed, 03 Jun 2009, 01:17)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.30002 2009-06-03 01:17:32.000000000 +0300
+++ /tmp/wklog.21.new.30002 2009-06-03 01:17:32.000000000 +0300
@@ -7,13 +7,13 @@
The current optimization works with:
-WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
-WHERE key1_part1=1 OR key2_part1=3
+WHERE key1_part1=1 AND key2_part1=3
or
-WHERE key_part1<10 or key2_part1<100
+WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:06)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29694 2009-06-03 01:06:50.000000000 +0300
+++ /tmp/wklog.21.new.29694 2009-06-03 01:06:50.000000000 +0300
@@ -12,6 +12,8 @@
but not with:
WHERE key1_part1=1 OR key2_part1=3
+or
+WHERE key_part1<10 or key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:05)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29638 2009-06-03 01:05:01.000000000 +0300
+++ /tmp/wklog.21.new.29638 2009-06-03 01:05:01.000000000 +0300
@@ -3,5 +3,15 @@
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
+For example, assuming that key1 has 2 parts and key2 has 1 part.
+
+The current optimization works with:
+
+WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+
+but not with:
+
+WHERE key1_part1=1 OR key2_part1=3
+
This WL entry is to lift this limitation by developing algorithms that do
-intersection on non-ROR scans.
+intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Guest - Tue, 26 May 2009, 14:04)=-=-
High-Level Specification modified.
--- /tmp/wklog.21.old.1802 2009-05-26 14:04:57.000000000 +0300
+++ /tmp/wklog.21.new.1802 2009-05-26 14:04:57.000000000 +0300
@@ -1,4 +1,3 @@
-
<contents>
1. Execution
1.1 Temptable
@@ -30,6 +29,8 @@
1.1 Temptable
-------------
+[ This is our strategy of choice at the moment]
+
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
@@ -168,3 +169,8 @@
a subset of columns covered by all other indexes.
= (TODO any other rules?)
+- Correlation across selectivities. If there is a condition
+
+ "cond(key1) AND cond(key2) AND ... AND cond(keyN)",
+
+ can we consider satisfaction of AND-parts to be independent?
------------------------------------------------------------
-=-=(View All Progress Notes, 11 total)=-=-
http://askmonty.org/worklog/index.pl?tid=21&nolimit=1
DESCRIPTION:
At the moment index_merge supports intersection only for rowid-ordered streams.
This translates into a limitation that index_merge/intersect can only be
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
For example, assuming that key1 has 2 parts and key2 has 1 part.
The current optimization works with:
WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
WHERE key1_part1=1 AND key2_part1=3
or
WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Execution
1.1 Temptable
1.1.1 Improvement
1.2 Produce/merge sorted streams
1.3 Extend Unique class to handle intersection
1.4 Strategies that do not seem to be useful
1.4.1 Remove matches after having produced an ordered stream
1.4.2 Sparse rowid bitmaps
2. Optimization
</contents>
1. Execution
============
The primary task is to find means to compute an intersection of N unordered
streams. Besides general memory/cpu cost of computation, we consider:
- whether the produced rowid stream is ordered. If it is, it can be piped
into index_merge/intersect (as opposed to sort-intersect)
- whether the strategy can take advantage of the fact that some input streams
are already rowid-ordered
- startup cost (cost of producing the first output record)
We see the following possible strategies:
1.1 Temptable
-------------
[ This is our strategy of choice at the moment]
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
rowid binary($rowid_size),
count n,
primary key(rowid);
);
Then use this algorithm:
i1= {index with the least E(#records)};
for each record R in range_scan(i1)
temp_table.insert(R.rowid, count=1);
for each index idx except i1
{
for each R record in scan(idx) // (INNER-LOOP)
{
if (temp_table has R)
temptable[R].count++;
}
}
// The following loop can do ordered or unordered scan
// if we want it to be ordered scan, we probably better arrange so that
// 'count' column is part of the index.
for each record R in temp_table
{
if (R.count == number_of_streams)
emit(R.rowid);
}
The algorithm has an option to emit an ordered rowid stream.
In the above form, the cost to produce the first record is high. It's easy to
adjust the algorithm to make it low - we'll need to just start scanning all
indexes at once, and finish as soon as we got a full match, i.e. the
temptable[R].count++
operation resulted in the counter being equal to the number of merged scans.
1.1.1 Improvement
~~~~~~~~~~~~~~~~~
When running INNER-LOOP, we could count how many times we've done the
"count++" operation. If it has been done #records-in-temptable times, that
means that all further records will not have matches and we can finish the
scan, i.e. break out of the INNER-LOOP.
1.2 Produce/merge sorted streams
--------------------------------
For each of the merged scan, use filesort-like action to end up with an
ordered stream of rowids. Then merge the ordered streams.
By filesort-like action we mean
- Run over index, collect rowids in a buffer.
- When the buffer is full, sort it and dump into a temporary file.
After the above we'll end up with a number of sorted buffers on disk. We can
use mergebuff() function (it is part of filesort's functions) to produce one
ordered sequence (i.e. array, which may be partially on disk) of rowids.
Merging of ordered streams with help of priority queue is already implemented
in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
child_quick->get_next()
call with a call to read rowid from an ordered sequence.
1.3 Extend Unique class to handle intersection
----------------------------------------------
There is no point to use Unique object as a device that accumulates rowids of
a single scan then produces them in sorted order. One could do the same faster
with accumulating an array of rowids and then sorting it.
It's possible to use Unique object to collect/merge data from all scans though.
The idea is as follows:
- Unique should store <rowid, n_scans> pairs
- Duplicates are pairs with the same rowid
- Unique should try to avoid creating duplicates:
- don't add a duplicate into the in-memory part, instead combine two elements
together by adding their n_scans elements.
- combine duplicates when it sees them in Unique.get() call
- The data we get from Unique.get() should be filtered, all records that have
n_scans != number_of_scans_being_merged should be discarded.
If we're lucky to have started and finished a scan on some index (denote it
as S) without flushing the Unique in the process, then:
- there is no point in adding any new records into the Unique because their
absence in the Unique means that they don't have match in S and hence will
not get into the result of intersection.
- we need to only update the counters to be able to tell if the elements that
are already in the Unique will have matches in all scans.
1.4 Strategies that do not seem to be useful
--------------------------------------------
keeping them here so we don't consider them over and over
1.4.1 Remove matches after having produced an ordered stream
------------------------------------------------------------
We can dump everything into a rowid stream and get it sorted. Then we read it,
and if we see a rowid repeated $n_merged_scans times, it belongs to the
intersection (pass to output), otherwise it doesn't (skip).
This doesn't have any advantages over the produce/merge sorted streams
approach.
1.4.2 Sparse rowid bitmaps
--------------------------
Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
bitmaps assume there will always be enough memory to accommodate them.
PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
disk pages, not ids of individual records. It's hard for us to do something
similar because our rowids are opaque entities whose meaning depends on the
storage engines.
This seems to require too much change to be worth it.
2. Optimization
===============
SEL_TREE objects already represent intersections. The problems with
optimizations are:
- Cost formula(s)
- When N keys/conditions are present:
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
somehow avoid considering (2^n - n) possible options.
- Avoid producing (or even considering) apparently suboptimal plans:
= Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
a subset of columns covered by all other indexes.
= (TODO any other rules?)
- Correlation across selectivities. If there is a condition
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
can we consider satisfaction of AND-parts to be independent?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Igor): index_merge: non-ROR intersection (21)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: non-ROR intersection
CREATION DATE..: Thu, 21 May 2009, 21:32
SUPERVISOR.....: Knielsen
IMPLEMENTOR....: Igor
COPIES TO......: Rhuddleston, Sanja, Knielsen, Serg, Monty, Timour, Igor, Psergey
CATEGORY.......: Server-Sprint
TASK ID........: 21 (http://askmonty.org/worklog/?tid=21)
VERSION........: Server-9.x
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 25
ESTIMATE.......: 175 (hours remain)
ORIG. ESTIMATE.: 175
PROGRESS NOTES:
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Category updated.
--- /tmp/wklog.21.old.18765 2010-06-24 05:48:53.000000000 +0000
+++ /tmp/wklog.21.new.18765 2010-06-24 05:48:53.000000000 +0000
@@ -1 +1 @@
-Server-RawIdeaBin
+Server-Sprint
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Status updated.
--- /tmp/wklog.21.old.18761 2010-06-24 05:48:43.000000000 +0000
+++ /tmp/wklog.21.new.18761 2010-06-24 05:48:43.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Observers changed: Knielsen,Monty,Psergey,Sanja,Igor,Rhuddleston,Timour,Serg
-=-=(Guest - Thu, 24 Jun 2010, 05:44)=-=-
I spent 25 hours in the month June 2010 to perform the following work for this task.
1. Compared tree possible algorithms to implement the operation of index intersection mentioned in
HLS by their labor/time consumption. Chose the algorithm that uses a modified Unique class (1.3) as
the most cheap requiring the least amount of efforts/time for its development.
2. Developed a design for a modification of the Unique class to support the operation of index
intersection.
3. Modified the merge_buffers procedure used by the Unique class to make it possible to use it not
only for the the operation of union, but for the operation of intersect as well.
Worked 25 hours and estimate 175 hours remain (original estimate increased by 200 hours).
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 21
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 21
-=-=(Guest - Wed, 03 Jun 2009, 01:17)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.30002 2009-06-03 01:17:32.000000000 +0300
+++ /tmp/wklog.21.new.30002 2009-06-03 01:17:32.000000000 +0300
@@ -7,13 +7,13 @@
The current optimization works with:
-WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
-WHERE key1_part1=1 OR key2_part1=3
+WHERE key1_part1=1 AND key2_part1=3
or
-WHERE key_part1<10 or key2_part1<100
+WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:06)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29694 2009-06-03 01:06:50.000000000 +0300
+++ /tmp/wklog.21.new.29694 2009-06-03 01:06:50.000000000 +0300
@@ -12,6 +12,8 @@
but not with:
WHERE key1_part1=1 OR key2_part1=3
+or
+WHERE key_part1<10 or key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:05)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29638 2009-06-03 01:05:01.000000000 +0300
+++ /tmp/wklog.21.new.29638 2009-06-03 01:05:01.000000000 +0300
@@ -3,5 +3,15 @@
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
+For example, assuming that key1 has 2 parts and key2 has 1 part.
+
+The current optimization works with:
+
+WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+
+but not with:
+
+WHERE key1_part1=1 OR key2_part1=3
+
This WL entry is to lift this limitation by developing algorithms that do
-intersection on non-ROR scans.
+intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Guest - Tue, 26 May 2009, 14:04)=-=-
High-Level Specification modified.
--- /tmp/wklog.21.old.1802 2009-05-26 14:04:57.000000000 +0300
+++ /tmp/wklog.21.new.1802 2009-05-26 14:04:57.000000000 +0300
@@ -1,4 +1,3 @@
-
<contents>
1. Execution
1.1 Temptable
@@ -30,6 +29,8 @@
1.1 Temptable
-------------
+[ This is our strategy of choice at the moment]
+
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
@@ -168,3 +169,8 @@
a subset of columns covered by all other indexes.
= (TODO any other rules?)
+- Correlation across selectivities. If there is a condition
+
+ "cond(key1) AND cond(key2) AND ... AND cond(keyN)",
+
+ can we consider satisfaction of AND-parts to be independent?
------------------------------------------------------------
-=-=(View All Progress Notes, 11 total)=-=-
http://askmonty.org/worklog/index.pl?tid=21&nolimit=1
DESCRIPTION:
At the moment index_merge supports intersection only for rowid-ordered streams.
This translates into a limitation that index_merge/intersect can only be
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
For example, assuming that key1 has 2 parts and key2 has 1 part.
The current optimization works with:
WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
WHERE key1_part1=1 AND key2_part1=3
or
WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Execution
1.1 Temptable
1.1.1 Improvement
1.2 Produce/merge sorted streams
1.3 Extend Unique class to handle intersection
1.4 Strategies that do not seem to be useful
1.4.1 Remove matches after having produced an ordered stream
1.4.2 Sparse rowid bitmaps
2. Optimization
</contents>
1. Execution
============
The primary task is to find means to compute an intersection of N unordered
streams. Besides general memory/cpu cost of computation, we consider:
- whether the produced rowid stream is ordered. If it is, it can be piped
into index_merge/intersect (as opposed to sort-intersect)
- whether the strategy can take advantage of the fact that some input streams
are already rowid-ordered
- startup cost (cost of producing the first output record)
We see the following possible strategies:
1.1 Temptable
-------------
[ This is our strategy of choice at the moment]
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
rowid binary($rowid_size),
count n,
primary key(rowid);
);
Then use this algorithm:
i1= {index with the least E(#records)};
for each record R in range_scan(i1)
temp_table.insert(R.rowid, count=1);
for each index idx except i1
{
for each R record in scan(idx) // (INNER-LOOP)
{
if (temp_table has R)
temptable[R].count++;
}
}
// The following loop can do ordered or unordered scan
// if we want it to be ordered scan, we probably better arrange so that
// 'count' column is part of the index.
for each record R in temp_table
{
if (R.count == number_of_streams)
emit(R.rowid);
}
The algorithm has an option to emit an ordered rowid stream.
In the above form, the cost to produce the first record is high. It's easy to
adjust the algorithm to make it low - we'll need to just start scanning all
indexes at once, and finish as soon as we got a full match, i.e. the
temptable[R].count++
operation resulted in the counter being equal to the number of merged scans.
1.1.1 Improvement
~~~~~~~~~~~~~~~~~
When running INNER-LOOP, we could count how many times we've done the
"count++" operation. If it has been done #records-in-temptable times, that
means that all further records will not have matches and we can finish the
scan, i.e. break out of the INNER-LOOP.
1.2 Produce/merge sorted streams
--------------------------------
For each of the merged scan, use filesort-like action to end up with an
ordered stream of rowids. Then merge the ordered streams.
By filesort-like action we mean
- Run over index, collect rowids in a buffer.
- When the buffer is full, sort it and dump into a temporary file.
After the above we'll end up with a number of sorted buffers on disk. We can
use mergebuff() function (it is part of filesort's functions) to produce one
ordered sequence (i.e. array, which may be partially on disk) of rowids.
Merging of ordered streams with help of priority queue is already implemented
in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
child_quick->get_next()
call with a call to read rowid from an ordered sequence.
1.3 Extend Unique class to handle intersection
----------------------------------------------
There is no point to use Unique object as a device that accumulates rowids of
a single scan then produces them in sorted order. One could do the same faster
with accumulating an array of rowids and then sorting it.
It's possible to use Unique object to collect/merge data from all scans though.
The idea is as follows:
- Unique should store <rowid, n_scans> pairs
- Duplicates are pairs with the same rowid
- Unique should try to avoid creating duplicates:
- don't add a duplicate into the in-memory part, instead combine two elements
together by adding their n_scans elements.
- combine duplicates when it sees them in Unique.get() call
- The data we get from Unique.get() should be filtered, all records that have
n_scans != number_of_scans_being_merged should be discarded.
If we're lucky to have started and finished a scan on some index (denote it
as S) without flushing the Unique in the process, then:
- there is no point in adding any new records into the Unique because their
absence in the Unique means that they don't have match in S and hence will
not get into the result of intersection.
- we need to only update the counters to be able to tell if the elements that
are already in the Unique will have matches in all scans.
1.4 Strategies that do not seem to be useful
--------------------------------------------
keeping them here so we don't consider them over and over
1.4.1 Remove matches after having produced an ordered stream
------------------------------------------------------------
We can dump everything into a rowid stream and get it sorted. Then we read it,
and if we see a rowid repeated $n_merged_scans times, it belongs to the
intersection (pass to output), otherwise it doesn't (skip).
This doesn't have any advantages over the produce/merge sorted streams
approach.
1.4.2 Sparse rowid bitmaps
--------------------------
Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
bitmaps assume there will always be enough memory to accommodate them.
PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
disk pages, not ids of individual records. It's hard for us to do something
similar because our rowids are opaque entities whose meaning depends on the
storage engines.
This seems to require too much change to be worth it.
2. Optimization
===============
SEL_TREE objects already represent intersections. The problems with
optimizations are:
- Cost formula(s)
- When N keys/conditions are present:
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
somehow avoid considering (2^n - n) possible options.
- Avoid producing (or even considering) apparently suboptimal plans:
= Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
a subset of columns covered by all other indexes.
= (TODO any other rules?)
- Correlation across selectivities. If there is a condition
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
can we consider satisfaction of AND-parts to be independent?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Igor): index_merge: non-ROR intersection (21)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: non-ROR intersection
CREATION DATE..: Thu, 21 May 2009, 21:32
SUPERVISOR.....: Knielsen
IMPLEMENTOR....: Igor
COPIES TO......: Rhuddleston, Sanja, Knielsen, Serg, Monty, Timour, Igor, Psergey
CATEGORY.......: Server-Sprint
TASK ID........: 21 (http://askmonty.org/worklog/?tid=21)
VERSION........: Server-9.x
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 25
ESTIMATE.......: 175 (hours remain)
ORIG. ESTIMATE.: 175
PROGRESS NOTES:
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Category updated.
--- /tmp/wklog.21.old.18765 2010-06-24 05:48:53.000000000 +0000
+++ /tmp/wklog.21.new.18765 2010-06-24 05:48:53.000000000 +0000
@@ -1 +1 @@
-Server-RawIdeaBin
+Server-Sprint
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Status updated.
--- /tmp/wklog.21.old.18761 2010-06-24 05:48:43.000000000 +0000
+++ /tmp/wklog.21.new.18761 2010-06-24 05:48:43.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Observers changed: Knielsen,Monty,Psergey,Sanja,Igor,Rhuddleston,Timour,Serg
-=-=(Guest - Thu, 24 Jun 2010, 05:44)=-=-
I spent 25 hours in the month June 2010 to perform the following work for this task.
1. Compared tree possible algorithms to implement the operation of index intersection mentioned in
HLS by their labor/time consumption. Chose the algorithm that uses a modified Unique class (1.3) as
the most cheap requiring the least amount of efforts/time for its development.
2. Developed a design for a modification of the Unique class to support the operation of index
intersection.
3. Modified the merge_buffers procedure used by the Unique class to make it possible to use it not
only for the the operation of union, but for the operation of intersect as well.
Worked 25 hours and estimate 175 hours remain (original estimate increased by 200 hours).
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 21
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 21
-=-=(Guest - Wed, 03 Jun 2009, 01:17)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.30002 2009-06-03 01:17:32.000000000 +0300
+++ /tmp/wklog.21.new.30002 2009-06-03 01:17:32.000000000 +0300
@@ -7,13 +7,13 @@
The current optimization works with:
-WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
-WHERE key1_part1=1 OR key2_part1=3
+WHERE key1_part1=1 AND key2_part1=3
or
-WHERE key_part1<10 or key2_part1<100
+WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:06)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29694 2009-06-03 01:06:50.000000000 +0300
+++ /tmp/wklog.21.new.29694 2009-06-03 01:06:50.000000000 +0300
@@ -12,6 +12,8 @@
but not with:
WHERE key1_part1=1 OR key2_part1=3
+or
+WHERE key_part1<10 or key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:05)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29638 2009-06-03 01:05:01.000000000 +0300
+++ /tmp/wklog.21.new.29638 2009-06-03 01:05:01.000000000 +0300
@@ -3,5 +3,15 @@
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
+For example, assuming that key1 has 2 parts and key2 has 1 part.
+
+The current optimization works with:
+
+WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+
+but not with:
+
+WHERE key1_part1=1 OR key2_part1=3
+
This WL entry is to lift this limitation by developing algorithms that do
-intersection on non-ROR scans.
+intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Guest - Tue, 26 May 2009, 14:04)=-=-
High-Level Specification modified.
--- /tmp/wklog.21.old.1802 2009-05-26 14:04:57.000000000 +0300
+++ /tmp/wklog.21.new.1802 2009-05-26 14:04:57.000000000 +0300
@@ -1,4 +1,3 @@
-
<contents>
1. Execution
1.1 Temptable
@@ -30,6 +29,8 @@
1.1 Temptable
-------------
+[ This is our strategy of choice at the moment]
+
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
@@ -168,3 +169,8 @@
a subset of columns covered by all other indexes.
= (TODO any other rules?)
+- Correlation across selectivities. If there is a condition
+
+ "cond(key1) AND cond(key2) AND ... AND cond(keyN)",
+
+ can we consider satisfaction of AND-parts to be independent?
------------------------------------------------------------
-=-=(View All Progress Notes, 11 total)=-=-
http://askmonty.org/worklog/index.pl?tid=21&nolimit=1
DESCRIPTION:
At the moment index_merge supports intersection only for rowid-ordered streams.
This translates into a limitation that index_merge/intersect can only be
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
For example, assuming that key1 has 2 parts and key2 has 1 part.
The current optimization works with:
WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
WHERE key1_part1=1 AND key2_part1=3
or
WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Execution
1.1 Temptable
1.1.1 Improvement
1.2 Produce/merge sorted streams
1.3 Extend Unique class to handle intersection
1.4 Strategies that do not seem to be useful
1.4.1 Remove matches after having produced an ordered stream
1.4.2 Sparse rowid bitmaps
2. Optimization
</contents>
1. Execution
============
The primary task is to find means to compute an intersection of N unordered
streams. Besides general memory/cpu cost of computation, we consider:
- whether the produced rowid stream is ordered. If it is, it can be piped
into index_merge/intersect (as opposed to sort-intersect)
- whether the strategy can take advantage of the fact that some input streams
are already rowid-ordered
- startup cost (cost of producing the first output record)
We see the following possible strategies:
1.1 Temptable
-------------
[ This is our strategy of choice at the moment]
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
rowid binary($rowid_size),
count n,
primary key(rowid);
);
Then use this algorithm:
i1= {index with the least E(#records)};
for each record R in range_scan(i1)
temp_table.insert(R.rowid, count=1);
for each index idx except i1
{
for each R record in scan(idx) // (INNER-LOOP)
{
if (temp_table has R)
temptable[R].count++;
}
}
// The following loop can do ordered or unordered scan
// if we want it to be ordered scan, we probably better arrange so that
// 'count' column is part of the index.
for each record R in temp_table
{
if (R.count == number_of_streams)
emit(R.rowid);
}
The algorithm has an option to emit an ordered rowid stream.
In the above form, the cost to produce the first record is high. It's easy to
adjust the algorithm to make it low - we'll need to just start scanning all
indexes at once, and finish as soon as we got a full match, i.e. the
temptable[R].count++
operation resulted in the counter being equal to the number of merged scans.
1.1.1 Improvement
~~~~~~~~~~~~~~~~~
When running INNER-LOOP, we could count how many times we've done the
"count++" operation. If it has been done #records-in-temptable times, that
means that all further records will not have matches and we can finish the
scan, i.e. break out of the INNER-LOOP.
1.2 Produce/merge sorted streams
--------------------------------
For each of the merged scan, use filesort-like action to end up with an
ordered stream of rowids. Then merge the ordered streams.
By filesort-like action we mean
- Run over index, collect rowids in a buffer.
- When the buffer is full, sort it and dump into a temporary file.
After the above we'll end up with a number of sorted buffers on disk. We can
use mergebuff() function (it is part of filesort's functions) to produce one
ordered sequence (i.e. array, which may be partially on disk) of rowids.
Merging of ordered streams with help of priority queue is already implemented
in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
child_quick->get_next()
call with a call to read rowid from an ordered sequence.
1.3 Extend Unique class to handle intersection
----------------------------------------------
There is no point to use Unique object as a device that accumulates rowids of
a single scan then produces them in sorted order. One could do the same faster
with accumulating an array of rowids and then sorting it.
It's possible to use Unique object to collect/merge data from all scans though.
The idea is as follows:
- Unique should store <rowid, n_scans> pairs
- Duplicates are pairs with the same rowid
- Unique should try to avoid creating duplicates:
- don't add a duplicate into the in-memory part, instead combine two elements
together by adding their n_scans elements.
- combine duplicates when it sees them in Unique.get() call
- The data we get from Unique.get() should be filtered, all records that have
n_scans != number_of_scans_being_merged should be discarded.
If we're lucky to have started and finished a scan on some index (denote it
as S) without flushing the Unique in the process, then:
- there is no point in adding any new records into the Unique because their
absence in the Unique means that they don't have match in S and hence will
not get into the result of intersection.
- we need to only update the counters to be able to tell if the elements that
are already in the Unique will have matches in all scans.
1.4 Strategies that do not seem to be useful
--------------------------------------------
keeping them here so we don't consider them over and over
1.4.1 Remove matches after having produced an ordered stream
------------------------------------------------------------
We can dump everything into a rowid stream and get it sorted. Then we read it,
and if we see a rowid repeated $n_merged_scans times, it belongs to the
intersection (pass to output), otherwise it doesn't (skip).
This doesn't have any advantages over the produce/merge sorted streams
approach.
1.4.2 Sparse rowid bitmaps
--------------------------
Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
bitmaps assume there will always be enough memory to accommodate them.
PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
disk pages, not ids of individual records. It's hard for us to do something
similar because our rowids are opaque entities whose meaning depends on the
storage engines.
This seems to require too much change to be worth it.
2. Optimization
===============
SEL_TREE objects already represent intersections. The problems with
optimizations are:
- Cost formula(s)
- When N keys/conditions are present:
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
somehow avoid considering (2^n - n) possible options.
- Avoid producing (or even considering) apparently suboptimal plans:
= Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
a subset of columns covered by all other indexes.
= (TODO any other rules?)
- Correlation across selectivities. If there is a condition
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
can we consider satisfaction of AND-parts to be independent?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Igor): index_merge: non-ROR intersection (21)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: non-ROR intersection
CREATION DATE..: Thu, 21 May 2009, 21:32
SUPERVISOR.....: Knielsen
IMPLEMENTOR....: Igor
COPIES TO......: Rhuddleston, Sanja, Knielsen, Serg, Monty, Timour, Igor, Psergey
CATEGORY.......: Server-Sprint
TASK ID........: 21 (http://askmonty.org/worklog/?tid=21)
VERSION........: Server-9.x
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 25
ESTIMATE.......: 175 (hours remain)
ORIG. ESTIMATE.: 175
PROGRESS NOTES:
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Category updated.
--- /tmp/wklog.21.old.18765 2010-06-24 05:48:53.000000000 +0000
+++ /tmp/wklog.21.new.18765 2010-06-24 05:48:53.000000000 +0000
@@ -1 +1 @@
-Server-RawIdeaBin
+Server-Sprint
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Status updated.
--- /tmp/wklog.21.old.18761 2010-06-24 05:48:43.000000000 +0000
+++ /tmp/wklog.21.new.18761 2010-06-24 05:48:43.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Observers changed: Knielsen,Monty,Psergey,Sanja,Igor,Rhuddleston,Timour,Serg
-=-=(Guest - Thu, 24 Jun 2010, 05:44)=-=-
I spent 25 hours in the month June 2010 to perform the following work for this task.
1. Compared tree possible algorithms to implement the operation of index intersection mentioned in
HLS by their labor/time consumption. Chose the algorithm that uses a modified Unique class (1.3) as
the most cheap requiring the least amount of efforts/time for its development.
2. Developed a design for a modification of the Unique class to support the operation of index
intersection.
3. Modified the merge_buffers procedure used by the Unique class to make it possible to use it not
only for the the operation of union, but for the operation of intersect as well.
Worked 25 hours and estimate 175 hours remain (original estimate increased by 200 hours).
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 21
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 21
-=-=(Guest - Wed, 03 Jun 2009, 01:17)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.30002 2009-06-03 01:17:32.000000000 +0300
+++ /tmp/wklog.21.new.30002 2009-06-03 01:17:32.000000000 +0300
@@ -7,13 +7,13 @@
The current optimization works with:
-WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
-WHERE key1_part1=1 OR key2_part1=3
+WHERE key1_part1=1 AND key2_part1=3
or
-WHERE key_part1<10 or key2_part1<100
+WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:06)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29694 2009-06-03 01:06:50.000000000 +0300
+++ /tmp/wklog.21.new.29694 2009-06-03 01:06:50.000000000 +0300
@@ -12,6 +12,8 @@
but not with:
WHERE key1_part1=1 OR key2_part1=3
+or
+WHERE key_part1<10 or key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:05)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29638 2009-06-03 01:05:01.000000000 +0300
+++ /tmp/wklog.21.new.29638 2009-06-03 01:05:01.000000000 +0300
@@ -3,5 +3,15 @@
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
+For example, assuming that key1 has 2 parts and key2 has 1 part.
+
+The current optimization works with:
+
+WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+
+but not with:
+
+WHERE key1_part1=1 OR key2_part1=3
+
This WL entry is to lift this limitation by developing algorithms that do
-intersection on non-ROR scans.
+intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Guest - Tue, 26 May 2009, 14:04)=-=-
High-Level Specification modified.
--- /tmp/wklog.21.old.1802 2009-05-26 14:04:57.000000000 +0300
+++ /tmp/wklog.21.new.1802 2009-05-26 14:04:57.000000000 +0300
@@ -1,4 +1,3 @@
-
<contents>
1. Execution
1.1 Temptable
@@ -30,6 +29,8 @@
1.1 Temptable
-------------
+[ This is our strategy of choice at the moment]
+
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
@@ -168,3 +169,8 @@
a subset of columns covered by all other indexes.
= (TODO any other rules?)
+- Correlation across selectivities. If there is a condition
+
+ "cond(key1) AND cond(key2) AND ... AND cond(keyN)",
+
+ can we consider satisfaction of AND-parts to be independent?
------------------------------------------------------------
-=-=(View All Progress Notes, 11 total)=-=-
http://askmonty.org/worklog/index.pl?tid=21&nolimit=1
DESCRIPTION:
At the moment index_merge supports intersection only for rowid-ordered streams.
This translates into a limitation that index_merge/intersect can only be
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
For example, assuming that key1 has 2 parts and key2 has 1 part.
The current optimization works with:
WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
WHERE key1_part1=1 AND key2_part1=3
or
WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Execution
1.1 Temptable
1.1.1 Improvement
1.2 Produce/merge sorted streams
1.3 Extend Unique class to handle intersection
1.4 Strategies that do not seem to be useful
1.4.1 Remove matches after having produced an ordered stream
1.4.2 Sparse rowid bitmaps
2. Optimization
</contents>
1. Execution
============
The primary task is to find means to compute an intersection of N unordered
streams. Besides general memory/cpu cost of computation, we consider:
- whether the produced rowid stream is ordered. If it is, it can be piped
into index_merge/intersect (as opposed to sort-intersect)
- whether the strategy can take advantage of the fact that some input streams
are already rowid-ordered
- startup cost (cost of producing the first output record)
We see the following possible strategies:
1.1 Temptable
-------------
[ This is our strategy of choice at the moment]
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
rowid binary($rowid_size),
count n,
primary key(rowid);
);
Then use this algorithm:
i1= {index with the least E(#records)};
for each record R in range_scan(i1)
temp_table.insert(R.rowid, count=1);
for each index idx except i1
{
for each R record in scan(idx) // (INNER-LOOP)
{
if (temp_table has R)
temptable[R].count++;
}
}
// The following loop can do ordered or unordered scan
// if we want it to be ordered scan, we probably better arrange so that
// 'count' column is part of the index.
for each record R in temp_table
{
if (R.count == number_of_streams)
emit(R.rowid);
}
The algorithm has an option to emit an ordered rowid stream.
In the above form, the cost to produce the first record is high. It's easy to
adjust the algorithm to make it low - we'll need to just start scanning all
indexes at once, and finish as soon as we got a full match, i.e. the
temptable[R].count++
operation resulted in the counter being equal to the number of merged scans.
1.1.1 Improvement
~~~~~~~~~~~~~~~~~
When running INNER-LOOP, we could count how many times we've done the
"count++" operation. If it has been done #records-in-temptable times, that
means that all further records will not have matches and we can finish the
scan, i.e. break out of the INNER-LOOP.
1.2 Produce/merge sorted streams
--------------------------------
For each of the merged scan, use filesort-like action to end up with an
ordered stream of rowids. Then merge the ordered streams.
By filesort-like action we mean
- Run over index, collect rowids in a buffer.
- When the buffer is full, sort it and dump into a temporary file.
After the above we'll end up with a number of sorted buffers on disk. We can
use mergebuff() function (it is part of filesort's functions) to produce one
ordered sequence (i.e. array, which may be partially on disk) of rowids.
Merging of ordered streams with help of priority queue is already implemented
in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
child_quick->get_next()
call with a call to read rowid from an ordered sequence.
1.3 Extend Unique class to handle intersection
----------------------------------------------
There is no point to use Unique object as a device that accumulates rowids of
a single scan then produces them in sorted order. One could do the same faster
with accumulating an array of rowids and then sorting it.
It's possible to use Unique object to collect/merge data from all scans though.
The idea is as follows:
- Unique should store <rowid, n_scans> pairs
- Duplicates are pairs with the same rowid
- Unique should try to avoid creating duplicates:
- don't add a duplicate into the in-memory part, instead combine two elements
together by adding their n_scans elements.
- combine duplicates when it sees them in Unique.get() call
- The data we get from Unique.get() should be filtered, all records that have
n_scans != number_of_scans_being_merged should be discarded.
If we're lucky to have started and finished a scan on some index (denote it
as S) without flushing the Unique in the process, then:
- there is no point in adding any new records into the Unique because their
absence in the Unique means that they don't have match in S and hence will
not get into the result of intersection.
- we need to only update the counters to be able to tell if the elements that
are already in the Unique will have matches in all scans.
1.4 Strategies that do not seem to be useful
--------------------------------------------
keeping them here so we don't consider them over and over
1.4.1 Remove matches after having produced an ordered stream
------------------------------------------------------------
We can dump everything into a rowid stream and get it sorted. Then we read it,
and if we see a rowid repeated $n_merged_scans times, it belongs to the
intersection (pass to output), otherwise it doesn't (skip).
This doesn't have any advantages over the produce/merge sorted streams
approach.
1.4.2 Sparse rowid bitmaps
--------------------------
Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
bitmaps assume there will always be enough memory to accommodate them.
PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
disk pages, not ids of individual records. It's hard for us to do something
similar because our rowids are opaque entities whose meaning depends on the
storage engines.
This seems to require too much change to be worth it.
2. Optimization
===============
SEL_TREE objects already represent intersections. The problems with
optimizations are:
- Cost formula(s)
- When N keys/conditions are present:
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
somehow avoid considering (2^n - n) possible options.
- Avoid producing (or even considering) apparently suboptimal plans:
= Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
a subset of columns covered by all other indexes.
= (TODO any other rules?)
- Correlation across selectivities. If there is a condition
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
can we consider satisfaction of AND-parts to be independent?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Igor): index_merge: non-ROR intersection (21)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: non-ROR intersection
CREATION DATE..: Thu, 21 May 2009, 21:32
SUPERVISOR.....: Knielsen
IMPLEMENTOR....: Igor
COPIES TO......: Rhuddleston, Sanja, Knielsen, Serg, Monty, Timour, Igor, Psergey
CATEGORY.......: Server-Sprint
TASK ID........: 21 (http://askmonty.org/worklog/?tid=21)
VERSION........: Server-9.x
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 25
ESTIMATE.......: 175 (hours remain)
ORIG. ESTIMATE.: 175
PROGRESS NOTES:
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Category updated.
--- /tmp/wklog.21.old.18765 2010-06-24 05:48:53.000000000 +0000
+++ /tmp/wklog.21.new.18765 2010-06-24 05:48:53.000000000 +0000
@@ -1 +1 @@
-Server-RawIdeaBin
+Server-Sprint
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Status updated.
--- /tmp/wklog.21.old.18761 2010-06-24 05:48:43.000000000 +0000
+++ /tmp/wklog.21.new.18761 2010-06-24 05:48:43.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Observers changed: Knielsen,Monty,Psergey,Sanja,Igor,Rhuddleston,Timour,Serg
-=-=(Guest - Thu, 24 Jun 2010, 05:44)=-=-
I spent 25 hours in the month June 2010 to perform the following work for this task.
1. Compared tree possible algorithms to implement the operation of index intersection mentioned in
HLS by their labor/time consumption. Chose the algorithm that uses a modified Unique class (1.3) as
the most cheap requiring the least amount of efforts/time for its development.
2. Developed a design for a modification of the Unique class to support the operation of index
intersection.
3. Modified the merge_buffers procedure used by the Unique class to make it possible to use it not
only for the the operation of union, but for the operation of intersect as well.
Worked 25 hours and estimate 175 hours remain (original estimate increased by 200 hours).
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 21
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 21
-=-=(Guest - Wed, 03 Jun 2009, 01:17)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.30002 2009-06-03 01:17:32.000000000 +0300
+++ /tmp/wklog.21.new.30002 2009-06-03 01:17:32.000000000 +0300
@@ -7,13 +7,13 @@
The current optimization works with:
-WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
-WHERE key1_part1=1 OR key2_part1=3
+WHERE key1_part1=1 AND key2_part1=3
or
-WHERE key_part1<10 or key2_part1<100
+WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:06)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29694 2009-06-03 01:06:50.000000000 +0300
+++ /tmp/wklog.21.new.29694 2009-06-03 01:06:50.000000000 +0300
@@ -12,6 +12,8 @@
but not with:
WHERE key1_part1=1 OR key2_part1=3
+or
+WHERE key_part1<10 or key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:05)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29638 2009-06-03 01:05:01.000000000 +0300
+++ /tmp/wklog.21.new.29638 2009-06-03 01:05:01.000000000 +0300
@@ -3,5 +3,15 @@
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
+For example, assuming that key1 has 2 parts and key2 has 1 part.
+
+The current optimization works with:
+
+WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+
+but not with:
+
+WHERE key1_part1=1 OR key2_part1=3
+
This WL entry is to lift this limitation by developing algorithms that do
-intersection on non-ROR scans.
+intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Guest - Tue, 26 May 2009, 14:04)=-=-
High-Level Specification modified.
--- /tmp/wklog.21.old.1802 2009-05-26 14:04:57.000000000 +0300
+++ /tmp/wklog.21.new.1802 2009-05-26 14:04:57.000000000 +0300
@@ -1,4 +1,3 @@
-
<contents>
1. Execution
1.1 Temptable
@@ -30,6 +29,8 @@
1.1 Temptable
-------------
+[ This is our strategy of choice at the moment]
+
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
@@ -168,3 +169,8 @@
a subset of columns covered by all other indexes.
= (TODO any other rules?)
+- Correlation across selectivities. If there is a condition
+
+ "cond(key1) AND cond(key2) AND ... AND cond(keyN)",
+
+ can we consider satisfaction of AND-parts to be independent?
------------------------------------------------------------
-=-=(View All Progress Notes, 11 total)=-=-
http://askmonty.org/worklog/index.pl?tid=21&nolimit=1
DESCRIPTION:
At the moment index_merge supports intersection only for rowid-ordered streams.
This translates into a limitation that index_merge/intersect can only be
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
For example, assuming that key1 has 2 parts and key2 has 1 part.
The current optimization works with:
WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
WHERE key1_part1=1 AND key2_part1=3
or
WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Execution
1.1 Temptable
1.1.1 Improvement
1.2 Produce/merge sorted streams
1.3 Extend Unique class to handle intersection
1.4 Strategies that do not seem to be useful
1.4.1 Remove matches after having produced an ordered stream
1.4.2 Sparse rowid bitmaps
2. Optimization
</contents>
1. Execution
============
The primary task is to find means to compute an intersection of N unordered
streams. Besides general memory/cpu cost of computation, we consider:
- whether the produced rowid stream is ordered. If it is, it can be piped
into index_merge/intersect (as opposed to sort-intersect)
- whether the strategy can take advantage of the fact that some input streams
are already rowid-ordered
- startup cost (cost of producing the first output record)
We see the following possible strategies:
1.1 Temptable
-------------
[ This is our strategy of choice at the moment]
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
rowid binary($rowid_size),
count n,
primary key(rowid);
);
Then use this algorithm:
i1= {index with the least E(#records)};
for each record R in range_scan(i1)
temp_table.insert(R.rowid, count=1);
for each index idx except i1
{
for each R record in scan(idx) // (INNER-LOOP)
{
if (temp_table has R)
temptable[R].count++;
}
}
// The following loop can do ordered or unordered scan
// if we want it to be ordered scan, we probably better arrange so that
// 'count' column is part of the index.
for each record R in temp_table
{
if (R.count == number_of_streams)
emit(R.rowid);
}
The algorithm has an option to emit an ordered rowid stream.
In the above form, the cost to produce the first record is high. It's easy to
adjust the algorithm to make it low - we'll need to just start scanning all
indexes at once, and finish as soon as we got a full match, i.e. the
temptable[R].count++
operation resulted in the counter being equal to the number of merged scans.
1.1.1 Improvement
~~~~~~~~~~~~~~~~~
When running INNER-LOOP, we could count how many times we've done the
"count++" operation. If it has been done #records-in-temptable times, that
means that all further records will not have matches and we can finish the
scan, i.e. break out of the INNER-LOOP.
1.2 Produce/merge sorted streams
--------------------------------
For each of the merged scan, use filesort-like action to end up with an
ordered stream of rowids. Then merge the ordered streams.
By filesort-like action we mean
- Run over index, collect rowids in a buffer.
- When the buffer is full, sort it and dump into a temporary file.
After the above we'll end up with a number of sorted buffers on disk. We can
use mergebuff() function (it is part of filesort's functions) to produce one
ordered sequence (i.e. array, which may be partially on disk) of rowids.
Merging of ordered streams with help of priority queue is already implemented
in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
child_quick->get_next()
call with a call to read rowid from an ordered sequence.
1.3 Extend Unique class to handle intersection
----------------------------------------------
There is no point to use Unique object as a device that accumulates rowids of
a single scan then produces them in sorted order. One could do the same faster
with accumulating an array of rowids and then sorting it.
It's possible to use Unique object to collect/merge data from all scans though.
The idea is as follows:
- Unique should store <rowid, n_scans> pairs
- Duplicates are pairs with the same rowid
- Unique should try to avoid creating duplicates:
- don't add a duplicate into the in-memory part, instead combine two elements
together by adding their n_scans elements.
- combine duplicates when it sees them in Unique.get() call
- The data we get from Unique.get() should be filtered, all records that have
n_scans != number_of_scans_being_merged should be discarded.
If we're lucky to have started and finished a scan on some index (denote it
as S) without flushing the Unique in the process, then:
- there is no point in adding any new records into the Unique because their
absence in the Unique means that they don't have match in S and hence will
not get into the result of intersection.
- we need to only update the counters to be able to tell if the elements that
are already in the Unique will have matches in all scans.
1.4 Strategies that do not seem to be useful
--------------------------------------------
keeping them here so we don't consider them over and over
1.4.1 Remove matches after having produced an ordered stream
------------------------------------------------------------
We can dump everything into a rowid stream and get it sorted. Then we read it,
and if we see a rowid repeated $n_merged_scans times, it belongs to the
intersection (pass to output), otherwise it doesn't (skip).
This doesn't have any advantages over the produce/merge sorted streams
approach.
1.4.2 Sparse rowid bitmaps
--------------------------
Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
bitmaps assume there will always be enough memory to accommodate them.
PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
disk pages, not ids of individual records. It's hard for us to do something
similar because our rowids are opaque entities whose meaning depends on the
storage engines.
This seems to require too much change to be worth it.
2. Optimization
===============
SEL_TREE objects already represent intersections. The problems with
optimizations are:
- Cost formula(s)
- When N keys/conditions are present:
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
somehow avoid considering (2^n - n) possible options.
- Avoid producing (or even considering) apparently suboptimal plans:
= Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
a subset of columns covered by all other indexes.
= (TODO any other rules?)
- Correlation across selectivities. If there is a condition
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
can we consider satisfaction of AND-parts to be independent?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Igor): index_merge: non-ROR intersection (21)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: non-ROR intersection
CREATION DATE..: Thu, 21 May 2009, 21:32
SUPERVISOR.....: Knielsen
IMPLEMENTOR....: Igor
COPIES TO......: Rhuddleston, Sanja, Knielsen, Serg, Monty, Timour, Igor, Psergey
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 21 (http://askmonty.org/worklog/?tid=21)
VERSION........: Server-9.x
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 25
ESTIMATE.......: 175 (hours remain)
ORIG. ESTIMATE.: 175
PROGRESS NOTES:
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Status updated.
--- /tmp/wklog.21.old.18761 2010-06-24 05:48:43.000000000 +0000
+++ /tmp/wklog.21.new.18761 2010-06-24 05:48:43.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Observers changed: Knielsen,Monty,Psergey,Sanja,Igor,Rhuddleston,Timour,Serg
-=-=(Guest - Thu, 24 Jun 2010, 05:44)=-=-
I spent 25 hours in the month June 2010 to perform the following work for this task.
1. Compared tree possible algorithms to implement the operation of index intersection mentioned in
HLS by their labor/time consumption. Chose the algorithm that uses a modified Unique class (1.3) as
the most cheap requiring the least amount of efforts/time for its development.
2. Developed a design for a modification of the Unique class to support the operation of index
intersection.
3. Modified the merge_buffers procedure used by the Unique class to make it possible to use it not
only for the the operation of union, but for the operation of intersect as well.
Worked 25 hours and estimate 175 hours remain (original estimate increased by 200 hours).
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 21
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 21
-=-=(Guest - Wed, 03 Jun 2009, 01:17)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.30002 2009-06-03 01:17:32.000000000 +0300
+++ /tmp/wklog.21.new.30002 2009-06-03 01:17:32.000000000 +0300
@@ -7,13 +7,13 @@
The current optimization works with:
-WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
-WHERE key1_part1=1 OR key2_part1=3
+WHERE key1_part1=1 AND key2_part1=3
or
-WHERE key_part1<10 or key2_part1<100
+WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:06)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29694 2009-06-03 01:06:50.000000000 +0300
+++ /tmp/wklog.21.new.29694 2009-06-03 01:06:50.000000000 +0300
@@ -12,6 +12,8 @@
but not with:
WHERE key1_part1=1 OR key2_part1=3
+or
+WHERE key_part1<10 or key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:05)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29638 2009-06-03 01:05:01.000000000 +0300
+++ /tmp/wklog.21.new.29638 2009-06-03 01:05:01.000000000 +0300
@@ -3,5 +3,15 @@
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
+For example, assuming that key1 has 2 parts and key2 has 1 part.
+
+The current optimization works with:
+
+WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+
+but not with:
+
+WHERE key1_part1=1 OR key2_part1=3
+
This WL entry is to lift this limitation by developing algorithms that do
-intersection on non-ROR scans.
+intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Guest - Tue, 26 May 2009, 14:04)=-=-
High-Level Specification modified.
--- /tmp/wklog.21.old.1802 2009-05-26 14:04:57.000000000 +0300
+++ /tmp/wklog.21.new.1802 2009-05-26 14:04:57.000000000 +0300
@@ -1,4 +1,3 @@
-
<contents>
1. Execution
1.1 Temptable
@@ -30,6 +29,8 @@
1.1 Temptable
-------------
+[ This is our strategy of choice at the moment]
+
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
@@ -168,3 +169,8 @@
a subset of columns covered by all other indexes.
= (TODO any other rules?)
+- Correlation across selectivities. If there is a condition
+
+ "cond(key1) AND cond(key2) AND ... AND cond(keyN)",
+
+ can we consider satisfaction of AND-parts to be independent?
-=-=(Psergey - Thu, 21 May 2009, 21:33)=-=-
High-Level Specification modified.
--- /tmp/wklog.21.old.25705 2009-05-21 21:33:02.000000000 +0300
+++ /tmp/wklog.21.new.25705 2009-05-21 21:33:02.000000000 +0300
@@ -1 +1,170 @@
+<contents>
+1. Execution
+1.1 Temptable
+1.1.1 Improvement
+1.2 Produce/merge sorted streams
+1.3 Extend Unique class to handle intersection
+1.4 Strategies that do not seem to be useful
+1.4.1 Remove matches after having produced an ordered stream
+1.4.2 Sparse rowid bitmaps
+2. Optimization
+
+</contents>
+
+1. Execution
+============
+
+The primary task is to find means to compute an intersection of N unordered
+streams. Besides general memory/cpu cost of computation, we consider:
+
+- whether the produced rowid stream is ordered. If it is, it can be piped
+ into index_merge/intersect (as opposed to sort-intersect)
+
+- whether the strategy can take advantage of the fact that some input streams
+ are already rowid-ordered
+
+- startup cost (cost of producing the first output record)
+
+We see the following possible strategies:
+
+1.1 Temptable
+-------------
+Use a temporary heap-grow-out-to-myisam table with a primary key:
+
+create table temp_table (
+ rowid binary($rowid_size),
+ count n,
+ primary key(rowid);
+);
+
+Then use this algorithm:
+
+ i1= {index with the least E(#records)};
+
+ for each record R in range_scan(i1)
+ temp_table.insert(R.rowid, count=1);
+
+ for each index idx except i1
+ {
+ for each R record in scan(idx) // (INNER-LOOP)
+ {
+ if (temp_table has R)
+ temptable[R].count++;
+ }
+ }
+
+ // The following loop can do ordered or unordered scan
+ // if we want it to be ordered scan, we probably better arrange so that
+ // 'count' column is part of the index.
+ for each record R in temp_table
+ {
+ if (R.count == number_of_streams)
+ emit(R.rowid);
+ }
+
+The algorithm has an option to emit an ordered rowid stream.
+
+In the above form, the cost to produce the first record is high. It's easy to
+adjust the algorithm to make it low - we'll need to just start scanning all
+indexes at once, and finish as soon as we got a full match, i.e. the
+
+ temptable[R].count++
+
+operation resulted in the counter being equal to the number of merged scans.
+
+1.1.1 Improvement
+~~~~~~~~~~~~~~~~~
+When running INNER-LOOP, we could count how many times we've done the
+"count++" operation. If it has been done #records-in-temptable times, that
+means that all further records will not have matches and we can finish the
+scan, i.e. break out of the INNER-LOOP.
+
+1.2 Produce/merge sorted streams
+--------------------------------
+For each of the merged scan, use filesort-like action to end up with an
+ordered stream of rowids. Then merge the ordered streams.
+
+By filesort-like action we mean
+ - Run over index, collect rowids in a buffer.
+ - When the buffer is full, sort it and dump into a temporary file.
+After the above we'll end up with a number of sorted buffers on disk. We can
+use mergebuff() function (it is part of filesort's functions) to produce one
+ordered sequence (i.e. array, which may be partially on disk) of rowids.
+
+Merging of ordered streams with help of priority queue is already implemented
+in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
+
+ child_quick->get_next()
+
+call with a call to read rowid from an ordered sequence.
+
+1.3 Extend Unique class to handle intersection
+----------------------------------------------
+There is no point to use Unique object as a device that accumulates rowids of
+a single scan then produces them in sorted order. One could do the same faster
+with accumulating an array of rowids and then sorting it.
+
+It's possible to use Unique object to collect/merge data from all scans though.
+The idea is as follows:
+
+- Unique should store <rowid, n_scans> pairs
+- Duplicates are pairs with the same rowid
+- Unique should try to avoid creating duplicates:
+ - don't add a duplicate into the in-memory part, instead combine two elements
+ together by adding their n_scans elements.
+ - combine duplicates when it sees them in Unique.get() call
+- The data we get from Unique.get() should be filtered, all records that have
+ n_scans != number_of_scans_being_merged should be discarded.
+
+If we're lucky to have started and finished a scan on some index (denote it
+as S) without flushing the Unique in the process, then:
+- there is no point in adding any new records into the Unique because their
+ absence in the Unique means that they don't have match in S and hence will
+ not get into the result of intersection.
+- we need to only update the counters to be able to tell if the elements that
+ are already in the Unique will have matches in all scans.
+
+1.4 Strategies that do not seem to be useful
+--------------------------------------------
+
+keeping them here so we don't consider them over and over
+
+1.4.1 Remove matches after having produced an ordered stream
+------------------------------------------------------------
+We can dump everything into a rowid stream and get it sorted. Then we read it,
+and if we see a rowid repeated $n_merged_scans times, it belongs to the
+intersection (pass to output), otherwise it doesn't (skip).
+This doesn't have any advantages over the produce/merge sorted streams
+approach.
+
+1.4.2 Sparse rowid bitmaps
+--------------------------
+Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
+bitmaps assume there will always be enough memory to accommodate them.
+
+PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
+disk pages, not ids of individual records. It's hard for us to do something
+similar because our rowids are opaque entities whose meaning depends on the
+storage engines.
+
+This seems to require too much change to be worth it.
+
+2. Optimization
+===============
+
+SEL_TREE objects already represent intersections. The problems with
+optimizations are:
+
+- Cost formula(s)
+- When N keys/conditions are present:
+
+ "cond(key1) AND cond(key2) AND ... AND cond(keyN)",
+
+ somehow avoid considering (2^n - n) possible options.
+
+- Avoid producing (or even considering) apparently suboptimal plans:
+ = Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
+ a subset of columns covered by all other indexes.
+ = (TODO any other rules?)
+
DESCRIPTION:
At the moment index_merge supports intersection only for rowid-ordered streams.
This translates into a limitation that index_merge/intersect can only be
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
For example, assuming that key1 has 2 parts and key2 has 1 part.
The current optimization works with:
WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
WHERE key1_part1=1 AND key2_part1=3
or
WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Execution
1.1 Temptable
1.1.1 Improvement
1.2 Produce/merge sorted streams
1.3 Extend Unique class to handle intersection
1.4 Strategies that do not seem to be useful
1.4.1 Remove matches after having produced an ordered stream
1.4.2 Sparse rowid bitmaps
2. Optimization
</contents>
1. Execution
============
The primary task is to find means to compute an intersection of N unordered
streams. Besides general memory/cpu cost of computation, we consider:
- whether the produced rowid stream is ordered. If it is, it can be piped
into index_merge/intersect (as opposed to sort-intersect)
- whether the strategy can take advantage of the fact that some input streams
are already rowid-ordered
- startup cost (cost of producing the first output record)
We see the following possible strategies:
1.1 Temptable
-------------
[ This is our strategy of choice at the moment]
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
rowid binary($rowid_size),
count n,
primary key(rowid);
);
Then use this algorithm:
i1= {index with the least E(#records)};
for each record R in range_scan(i1)
temp_table.insert(R.rowid, count=1);
for each index idx except i1
{
for each R record in scan(idx) // (INNER-LOOP)
{
if (temp_table has R)
temptable[R].count++;
}
}
// The following loop can do ordered or unordered scan
// if we want it to be ordered scan, we probably better arrange so that
// 'count' column is part of the index.
for each record R in temp_table
{
if (R.count == number_of_streams)
emit(R.rowid);
}
The algorithm has an option to emit an ordered rowid stream.
In the above form, the cost to produce the first record is high. It's easy to
adjust the algorithm to make it low - we'll need to just start scanning all
indexes at once, and finish as soon as we got a full match, i.e. the
temptable[R].count++
operation resulted in the counter being equal to the number of merged scans.
1.1.1 Improvement
~~~~~~~~~~~~~~~~~
When running INNER-LOOP, we could count how many times we've done the
"count++" operation. If it has been done #records-in-temptable times, that
means that all further records will not have matches and we can finish the
scan, i.e. break out of the INNER-LOOP.
1.2 Produce/merge sorted streams
--------------------------------
For each of the merged scan, use filesort-like action to end up with an
ordered stream of rowids. Then merge the ordered streams.
By filesort-like action we mean
- Run over index, collect rowids in a buffer.
- When the buffer is full, sort it and dump into a temporary file.
After the above we'll end up with a number of sorted buffers on disk. We can
use mergebuff() function (it is part of filesort's functions) to produce one
ordered sequence (i.e. array, which may be partially on disk) of rowids.
Merging of ordered streams with help of priority queue is already implemented
in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
child_quick->get_next()
call with a call to read rowid from an ordered sequence.
1.3 Extend Unique class to handle intersection
----------------------------------------------
There is no point to use Unique object as a device that accumulates rowids of
a single scan then produces them in sorted order. One could do the same faster
with accumulating an array of rowids and then sorting it.
It's possible to use Unique object to collect/merge data from all scans though.
The idea is as follows:
- Unique should store <rowid, n_scans> pairs
- Duplicates are pairs with the same rowid
- Unique should try to avoid creating duplicates:
- don't add a duplicate into the in-memory part, instead combine two elements
together by adding their n_scans elements.
- combine duplicates when it sees them in Unique.get() call
- The data we get from Unique.get() should be filtered, all records that have
n_scans != number_of_scans_being_merged should be discarded.
If we're lucky to have started and finished a scan on some index (denote it
as S) without flushing the Unique in the process, then:
- there is no point in adding any new records into the Unique because their
absence in the Unique means that they don't have match in S and hence will
not get into the result of intersection.
- we need to only update the counters to be able to tell if the elements that
are already in the Unique will have matches in all scans.
1.4 Strategies that do not seem to be useful
--------------------------------------------
keeping them here so we don't consider them over and over
1.4.1 Remove matches after having produced an ordered stream
------------------------------------------------------------
We can dump everything into a rowid stream and get it sorted. Then we read it,
and if we see a rowid repeated $n_merged_scans times, it belongs to the
intersection (pass to output), otherwise it doesn't (skip).
This doesn't have any advantages over the produce/merge sorted streams
approach.
1.4.2 Sparse rowid bitmaps
--------------------------
Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
bitmaps assume there will always be enough memory to accommodate them.
PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
disk pages, not ids of individual records. It's hard for us to do something
similar because our rowids are opaque entities whose meaning depends on the
storage engines.
This seems to require too much change to be worth it.
2. Optimization
===============
SEL_TREE objects already represent intersections. The problems with
optimizations are:
- Cost formula(s)
- When N keys/conditions are present:
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
somehow avoid considering (2^n - n) possible options.
- Avoid producing (or even considering) apparently suboptimal plans:
= Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
a subset of columns covered by all other indexes.
= (TODO any other rules?)
- Correlation across selectivities. If there is a condition
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
can we consider satisfaction of AND-parts to be independent?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Igor): index_merge: non-ROR intersection (21)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: non-ROR intersection
CREATION DATE..: Thu, 21 May 2009, 21:32
SUPERVISOR.....: Knielsen
IMPLEMENTOR....: Igor
COPIES TO......: Rhuddleston, Sanja, Knielsen, Serg, Monty, Timour, Igor, Psergey
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 21 (http://askmonty.org/worklog/?tid=21)
VERSION........: Server-9.x
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 25
ESTIMATE.......: 175 (hours remain)
ORIG. ESTIMATE.: 175
PROGRESS NOTES:
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Status updated.
--- /tmp/wklog.21.old.18761 2010-06-24 05:48:43.000000000 +0000
+++ /tmp/wklog.21.new.18761 2010-06-24 05:48:43.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Observers changed: Knielsen,Monty,Psergey,Sanja,Igor,Rhuddleston,Timour,Serg
-=-=(Guest - Thu, 24 Jun 2010, 05:44)=-=-
I spent 25 hours in the month June 2010 to perform the following work for this task.
1. Compared tree possible algorithms to implement the operation of index intersection mentioned in
HLS by their labor/time consumption. Chose the algorithm that uses a modified Unique class (1.3) as
the most cheap requiring the least amount of efforts/time for its development.
2. Developed a design for a modification of the Unique class to support the operation of index
intersection.
3. Modified the merge_buffers procedure used by the Unique class to make it possible to use it not
only for the the operation of union, but for the operation of intersect as well.
Worked 25 hours and estimate 175 hours remain (original estimate increased by 200 hours).
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 21
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 21
-=-=(Guest - Wed, 03 Jun 2009, 01:17)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.30002 2009-06-03 01:17:32.000000000 +0300
+++ /tmp/wklog.21.new.30002 2009-06-03 01:17:32.000000000 +0300
@@ -7,13 +7,13 @@
The current optimization works with:
-WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
-WHERE key1_part1=1 OR key2_part1=3
+WHERE key1_part1=1 AND key2_part1=3
or
-WHERE key_part1<10 or key2_part1<100
+WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:06)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29694 2009-06-03 01:06:50.000000000 +0300
+++ /tmp/wklog.21.new.29694 2009-06-03 01:06:50.000000000 +0300
@@ -12,6 +12,8 @@
but not with:
WHERE key1_part1=1 OR key2_part1=3
+or
+WHERE key_part1<10 or key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:05)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29638 2009-06-03 01:05:01.000000000 +0300
+++ /tmp/wklog.21.new.29638 2009-06-03 01:05:01.000000000 +0300
@@ -3,5 +3,15 @@
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
+For example, assuming that key1 has 2 parts and key2 has 1 part.
+
+The current optimization works with:
+
+WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+
+but not with:
+
+WHERE key1_part1=1 OR key2_part1=3
+
This WL entry is to lift this limitation by developing algorithms that do
-intersection on non-ROR scans.
+intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Guest - Tue, 26 May 2009, 14:04)=-=-
High-Level Specification modified.
--- /tmp/wklog.21.old.1802 2009-05-26 14:04:57.000000000 +0300
+++ /tmp/wklog.21.new.1802 2009-05-26 14:04:57.000000000 +0300
@@ -1,4 +1,3 @@
-
<contents>
1. Execution
1.1 Temptable
@@ -30,6 +29,8 @@
1.1 Temptable
-------------
+[ This is our strategy of choice at the moment]
+
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
@@ -168,3 +169,8 @@
a subset of columns covered by all other indexes.
= (TODO any other rules?)
+- Correlation across selectivities. If there is a condition
+
+ "cond(key1) AND cond(key2) AND ... AND cond(keyN)",
+
+ can we consider satisfaction of AND-parts to be independent?
-=-=(Psergey - Thu, 21 May 2009, 21:33)=-=-
High-Level Specification modified.
--- /tmp/wklog.21.old.25705 2009-05-21 21:33:02.000000000 +0300
+++ /tmp/wklog.21.new.25705 2009-05-21 21:33:02.000000000 +0300
@@ -1 +1,170 @@
+<contents>
+1. Execution
+1.1 Temptable
+1.1.1 Improvement
+1.2 Produce/merge sorted streams
+1.3 Extend Unique class to handle intersection
+1.4 Strategies that do not seem to be useful
+1.4.1 Remove matches after having produced an ordered stream
+1.4.2 Sparse rowid bitmaps
+2. Optimization
+
+</contents>
+
+1. Execution
+============
+
+The primary task is to find means to compute an intersection of N unordered
+streams. Besides general memory/cpu cost of computation, we consider:
+
+- whether the produced rowid stream is ordered. If it is, it can be piped
+ into index_merge/intersect (as opposed to sort-intersect)
+
+- whether the strategy can take advantage of the fact that some input streams
+ are already rowid-ordered
+
+- startup cost (cost of producing the first output record)
+
+We see the following possible strategies:
+
+1.1 Temptable
+-------------
+Use a temporary heap-grow-out-to-myisam table with a primary key:
+
+create table temp_table (
+ rowid binary($rowid_size),
+ count n,
+ primary key(rowid);
+);
+
+Then use this algorithm:
+
+ i1= {index with the least E(#records)};
+
+ for each record R in range_scan(i1)
+ temp_table.insert(R.rowid, count=1);
+
+ for each index idx except i1
+ {
+ for each R record in scan(idx) // (INNER-LOOP)
+ {
+ if (temp_table has R)
+ temptable[R].count++;
+ }
+ }
+
+ // The following loop can do ordered or unordered scan
+ // if we want it to be ordered scan, we probably better arrange so that
+ // 'count' column is part of the index.
+ for each record R in temp_table
+ {
+ if (R.count == number_of_streams)
+ emit(R.rowid);
+ }
+
+The algorithm has an option to emit an ordered rowid stream.
+
+In the above form, the cost to produce the first record is high. It's easy to
+adjust the algorithm to make it low - we'll need to just start scanning all
+indexes at once, and finish as soon as we got a full match, i.e. the
+
+ temptable[R].count++
+
+operation resulted in the counter being equal to the number of merged scans.
+
+1.1.1 Improvement
+~~~~~~~~~~~~~~~~~
+When running INNER-LOOP, we could count how many times we've done the
+"count++" operation. If it has been done #records-in-temptable times, that
+means that all further records will not have matches and we can finish the
+scan, i.e. break out of the INNER-LOOP.
+
+1.2 Produce/merge sorted streams
+--------------------------------
+For each of the merged scan, use filesort-like action to end up with an
+ordered stream of rowids. Then merge the ordered streams.
+
+By filesort-like action we mean
+ - Run over index, collect rowids in a buffer.
+ - When the buffer is full, sort it and dump into a temporary file.
+After the above we'll end up with a number of sorted buffers on disk. We can
+use mergebuff() function (it is part of filesort's functions) to produce one
+ordered sequence (i.e. array, which may be partially on disk) of rowids.
+
+Merging of ordered streams with help of priority queue is already implemented
+in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
+
+ child_quick->get_next()
+
+call with a call to read rowid from an ordered sequence.
+
+1.3 Extend Unique class to handle intersection
+----------------------------------------------
+There is no point to use Unique object as a device that accumulates rowids of
+a single scan then produces them in sorted order. One could do the same faster
+with accumulating an array of rowids and then sorting it.
+
+It's possible to use Unique object to collect/merge data from all scans though.
+The idea is as follows:
+
+- Unique should store <rowid, n_scans> pairs
+- Duplicates are pairs with the same rowid
+- Unique should try to avoid creating duplicates:
+ - don't add a duplicate into the in-memory part, instead combine two elements
+ together by adding their n_scans elements.
+ - combine duplicates when it sees them in Unique.get() call
+- The data we get from Unique.get() should be filtered, all records that have
+ n_scans != number_of_scans_being_merged should be discarded.
+
+If we're lucky to have started and finished a scan on some index (denote it
+as S) without flushing the Unique in the process, then:
+- there is no point in adding any new records into the Unique because their
+ absence in the Unique means that they don't have match in S and hence will
+ not get into the result of intersection.
+- we need to only update the counters to be able to tell if the elements that
+ are already in the Unique will have matches in all scans.
+
+1.4 Strategies that do not seem to be useful
+--------------------------------------------
+
+keeping them here so we don't consider them over and over
+
+1.4.1 Remove matches after having produced an ordered stream
+------------------------------------------------------------
+We can dump everything into a rowid stream and get it sorted. Then we read it,
+and if we see a rowid repeated $n_merged_scans times, it belongs to the
+intersection (pass to output), otherwise it doesn't (skip).
+This doesn't have any advantages over the produce/merge sorted streams
+approach.
+
+1.4.2 Sparse rowid bitmaps
+--------------------------
+Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
+bitmaps assume there will always be enough memory to accommodate them.
+
+PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
+disk pages, not ids of individual records. It's hard for us to do something
+similar because our rowids are opaque entities whose meaning depends on the
+storage engines.
+
+This seems to require too much change to be worth it.
+
+2. Optimization
+===============
+
+SEL_TREE objects already represent intersections. The problems with
+optimizations are:
+
+- Cost formula(s)
+- When N keys/conditions are present:
+
+ "cond(key1) AND cond(key2) AND ... AND cond(keyN)",
+
+ somehow avoid considering (2^n - n) possible options.
+
+- Avoid producing (or even considering) apparently suboptimal plans:
+ = Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
+ a subset of columns covered by all other indexes.
+ = (TODO any other rules?)
+
DESCRIPTION:
At the moment index_merge supports intersection only for rowid-ordered streams.
This translates into a limitation that index_merge/intersect can only be
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
For example, assuming that key1 has 2 parts and key2 has 1 part.
The current optimization works with:
WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
WHERE key1_part1=1 AND key2_part1=3
or
WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Execution
1.1 Temptable
1.1.1 Improvement
1.2 Produce/merge sorted streams
1.3 Extend Unique class to handle intersection
1.4 Strategies that do not seem to be useful
1.4.1 Remove matches after having produced an ordered stream
1.4.2 Sparse rowid bitmaps
2. Optimization
</contents>
1. Execution
============
The primary task is to find means to compute an intersection of N unordered
streams. Besides general memory/cpu cost of computation, we consider:
- whether the produced rowid stream is ordered. If it is, it can be piped
into index_merge/intersect (as opposed to sort-intersect)
- whether the strategy can take advantage of the fact that some input streams
are already rowid-ordered
- startup cost (cost of producing the first output record)
We see the following possible strategies:
1.1 Temptable
-------------
[ This is our strategy of choice at the moment]
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
rowid binary($rowid_size),
count n,
primary key(rowid);
);
Then use this algorithm:
i1= {index with the least E(#records)};
for each record R in range_scan(i1)
temp_table.insert(R.rowid, count=1);
for each index idx except i1
{
for each R record in scan(idx) // (INNER-LOOP)
{
if (temp_table has R)
temptable[R].count++;
}
}
// The following loop can do ordered or unordered scan
// if we want it to be ordered scan, we probably better arrange so that
// 'count' column is part of the index.
for each record R in temp_table
{
if (R.count == number_of_streams)
emit(R.rowid);
}
The algorithm has an option to emit an ordered rowid stream.
In the above form, the cost to produce the first record is high. It's easy to
adjust the algorithm to make it low - we'll need to just start scanning all
indexes at once, and finish as soon as we got a full match, i.e. the
temptable[R].count++
operation resulted in the counter being equal to the number of merged scans.
1.1.1 Improvement
~~~~~~~~~~~~~~~~~
When running INNER-LOOP, we could count how many times we've done the
"count++" operation. If it has been done #records-in-temptable times, that
means that all further records will not have matches and we can finish the
scan, i.e. break out of the INNER-LOOP.
1.2 Produce/merge sorted streams
--------------------------------
For each of the merged scan, use filesort-like action to end up with an
ordered stream of rowids. Then merge the ordered streams.
By filesort-like action we mean
- Run over index, collect rowids in a buffer.
- When the buffer is full, sort it and dump into a temporary file.
After the above we'll end up with a number of sorted buffers on disk. We can
use mergebuff() function (it is part of filesort's functions) to produce one
ordered sequence (i.e. array, which may be partially on disk) of rowids.
Merging of ordered streams with help of priority queue is already implemented
in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
child_quick->get_next()
call with a call to read rowid from an ordered sequence.
1.3 Extend Unique class to handle intersection
----------------------------------------------
There is no point to use Unique object as a device that accumulates rowids of
a single scan then produces them in sorted order. One could do the same faster
with accumulating an array of rowids and then sorting it.
It's possible to use Unique object to collect/merge data from all scans though.
The idea is as follows:
- Unique should store <rowid, n_scans> pairs
- Duplicates are pairs with the same rowid
- Unique should try to avoid creating duplicates:
- don't add a duplicate into the in-memory part, instead combine two elements
together by adding their n_scans elements.
- combine duplicates when it sees them in Unique.get() call
- The data we get from Unique.get() should be filtered, all records that have
n_scans != number_of_scans_being_merged should be discarded.
If we're lucky to have started and finished a scan on some index (denote it
as S) without flushing the Unique in the process, then:
- there is no point in adding any new records into the Unique because their
absence in the Unique means that they don't have match in S and hence will
not get into the result of intersection.
- we need to only update the counters to be able to tell if the elements that
are already in the Unique will have matches in all scans.
1.4 Strategies that do not seem to be useful
--------------------------------------------
keeping them here so we don't consider them over and over
1.4.1 Remove matches after having produced an ordered stream
------------------------------------------------------------
We can dump everything into a rowid stream and get it sorted. Then we read it,
and if we see a rowid repeated $n_merged_scans times, it belongs to the
intersection (pass to output), otherwise it doesn't (skip).
This doesn't have any advantages over the produce/merge sorted streams
approach.
1.4.2 Sparse rowid bitmaps
--------------------------
Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
bitmaps assume there will always be enough memory to accommodate them.
PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
disk pages, not ids of individual records. It's hard for us to do something
similar because our rowids are opaque entities whose meaning depends on the
storage engines.
This seems to require too much change to be worth it.
2. Optimization
===============
SEL_TREE objects already represent intersections. The problems with
optimizations are:
- Cost formula(s)
- When N keys/conditions are present:
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
somehow avoid considering (2^n - n) possible options.
- Avoid producing (or even considering) apparently suboptimal plans:
= Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
a subset of columns covered by all other indexes.
= (TODO any other rules?)
- Correlation across selectivities. If there is a condition
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
can we consider satisfaction of AND-parts to be independent?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Igor): index_merge: non-ROR intersection (21)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: non-ROR intersection
CREATION DATE..: Thu, 21 May 2009, 21:32
SUPERVISOR.....: Knielsen
IMPLEMENTOR....: Igor
COPIES TO......: Rhuddleston, Sanja, Knielsen, Serg, Monty, Timour, Igor, Psergey
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 21 (http://askmonty.org/worklog/?tid=21)
VERSION........: Server-9.x
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 25
ESTIMATE.......: 175 (hours remain)
ORIG. ESTIMATE.: 175
PROGRESS NOTES:
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Status updated.
--- /tmp/wklog.21.old.18761 2010-06-24 05:48:43.000000000 +0000
+++ /tmp/wklog.21.new.18761 2010-06-24 05:48:43.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Observers changed: Knielsen,Monty,Psergey,Sanja,Igor,Rhuddleston,Timour,Serg
-=-=(Guest - Thu, 24 Jun 2010, 05:44)=-=-
I spent 25 hours in the month June 2010 to perform the following work for this task.
1. Compared tree possible algorithms to implement the operation of index intersection mentioned in
HLS by their labor/time consumption. Chose the algorithm that uses a modified Unique class (1.3) as
the most cheap requiring the least amount of efforts/time for its development.
2. Developed a design for a modification of the Unique class to support the operation of index
intersection.
3. Modified the merge_buffers procedure used by the Unique class to make it possible to use it not
only for the the operation of union, but for the operation of intersect as well.
Worked 25 hours and estimate 175 hours remain (original estimate increased by 200 hours).
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 21
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 21
-=-=(Guest - Wed, 03 Jun 2009, 01:17)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.30002 2009-06-03 01:17:32.000000000 +0300
+++ /tmp/wklog.21.new.30002 2009-06-03 01:17:32.000000000 +0300
@@ -7,13 +7,13 @@
The current optimization works with:
-WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
-WHERE key1_part1=1 OR key2_part1=3
+WHERE key1_part1=1 AND key2_part1=3
or
-WHERE key_part1<10 or key2_part1<100
+WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:06)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29694 2009-06-03 01:06:50.000000000 +0300
+++ /tmp/wklog.21.new.29694 2009-06-03 01:06:50.000000000 +0300
@@ -12,6 +12,8 @@
but not with:
WHERE key1_part1=1 OR key2_part1=3
+or
+WHERE key_part1<10 or key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:05)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29638 2009-06-03 01:05:01.000000000 +0300
+++ /tmp/wklog.21.new.29638 2009-06-03 01:05:01.000000000 +0300
@@ -3,5 +3,15 @@
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
+For example, assuming that key1 has 2 parts and key2 has 1 part.
+
+The current optimization works with:
+
+WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+
+but not with:
+
+WHERE key1_part1=1 OR key2_part1=3
+
This WL entry is to lift this limitation by developing algorithms that do
-intersection on non-ROR scans.
+intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Guest - Tue, 26 May 2009, 14:04)=-=-
High-Level Specification modified.
--- /tmp/wklog.21.old.1802 2009-05-26 14:04:57.000000000 +0300
+++ /tmp/wklog.21.new.1802 2009-05-26 14:04:57.000000000 +0300
@@ -1,4 +1,3 @@
-
<contents>
1. Execution
1.1 Temptable
@@ -30,6 +29,8 @@
1.1 Temptable
-------------
+[ This is our strategy of choice at the moment]
+
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
@@ -168,3 +169,8 @@
a subset of columns covered by all other indexes.
= (TODO any other rules?)
+- Correlation across selectivities. If there is a condition
+
+ "cond(key1) AND cond(key2) AND ... AND cond(keyN)",
+
+ can we consider satisfaction of AND-parts to be independent?
-=-=(Psergey - Thu, 21 May 2009, 21:33)=-=-
High-Level Specification modified.
--- /tmp/wklog.21.old.25705 2009-05-21 21:33:02.000000000 +0300
+++ /tmp/wklog.21.new.25705 2009-05-21 21:33:02.000000000 +0300
@@ -1 +1,170 @@
+<contents>
+1. Execution
+1.1 Temptable
+1.1.1 Improvement
+1.2 Produce/merge sorted streams
+1.3 Extend Unique class to handle intersection
+1.4 Strategies that do not seem to be useful
+1.4.1 Remove matches after having produced an ordered stream
+1.4.2 Sparse rowid bitmaps
+2. Optimization
+
+</contents>
+
+1. Execution
+============
+
+The primary task is to find means to compute an intersection of N unordered
+streams. Besides general memory/cpu cost of computation, we consider:
+
+- whether the produced rowid stream is ordered. If it is, it can be piped
+ into index_merge/intersect (as opposed to sort-intersect)
+
+- whether the strategy can take advantage of the fact that some input streams
+ are already rowid-ordered
+
+- startup cost (cost of producing the first output record)
+
+We see the following possible strategies:
+
+1.1 Temptable
+-------------
+Use a temporary heap-grow-out-to-myisam table with a primary key:
+
+create table temp_table (
+ rowid binary($rowid_size),
+ count n,
+ primary key(rowid);
+);
+
+Then use this algorithm:
+
+ i1= {index with the least E(#records)};
+
+ for each record R in range_scan(i1)
+ temp_table.insert(R.rowid, count=1);
+
+ for each index idx except i1
+ {
+ for each R record in scan(idx) // (INNER-LOOP)
+ {
+ if (temp_table has R)
+ temptable[R].count++;
+ }
+ }
+
+ // The following loop can do ordered or unordered scan
+ // if we want it to be ordered scan, we probably better arrange so that
+ // 'count' column is part of the index.
+ for each record R in temp_table
+ {
+ if (R.count == number_of_streams)
+ emit(R.rowid);
+ }
+
+The algorithm has an option to emit an ordered rowid stream.
+
+In the above form, the cost to produce the first record is high. It's easy to
+adjust the algorithm to make it low - we'll need to just start scanning all
+indexes at once, and finish as soon as we got a full match, i.e. the
+
+ temptable[R].count++
+
+operation resulted in the counter being equal to the number of merged scans.
+
+1.1.1 Improvement
+~~~~~~~~~~~~~~~~~
+When running INNER-LOOP, we could count how many times we've done the
+"count++" operation. If it has been done #records-in-temptable times, that
+means that all further records will not have matches and we can finish the
+scan, i.e. break out of the INNER-LOOP.
+
+1.2 Produce/merge sorted streams
+--------------------------------
+For each of the merged scan, use filesort-like action to end up with an
+ordered stream of rowids. Then merge the ordered streams.
+
+By filesort-like action we mean
+ - Run over index, collect rowids in a buffer.
+ - When the buffer is full, sort it and dump into a temporary file.
+After the above we'll end up with a number of sorted buffers on disk. We can
+use mergebuff() function (it is part of filesort's functions) to produce one
+ordered sequence (i.e. array, which may be partially on disk) of rowids.
+
+Merging of ordered streams with help of priority queue is already implemented
+in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
+
+ child_quick->get_next()
+
+call with a call to read rowid from an ordered sequence.
+
+1.3 Extend Unique class to handle intersection
+----------------------------------------------
+There is no point to use Unique object as a device that accumulates rowids of
+a single scan then produces them in sorted order. One could do the same faster
+with accumulating an array of rowids and then sorting it.
+
+It's possible to use Unique object to collect/merge data from all scans though.
+The idea is as follows:
+
+- Unique should store <rowid, n_scans> pairs
+- Duplicates are pairs with the same rowid
+- Unique should try to avoid creating duplicates:
+ - don't add a duplicate into the in-memory part, instead combine two elements
+ together by adding their n_scans elements.
+ - combine duplicates when it sees them in Unique.get() call
+- The data we get from Unique.get() should be filtered, all records that have
+ n_scans != number_of_scans_being_merged should be discarded.
+
+If we're lucky to have started and finished a scan on some index (denote it
+as S) without flushing the Unique in the process, then:
+- there is no point in adding any new records into the Unique because their
+ absence in the Unique means that they don't have match in S and hence will
+ not get into the result of intersection.
+- we need to only update the counters to be able to tell if the elements that
+ are already in the Unique will have matches in all scans.
+
+1.4 Strategies that do not seem to be useful
+--------------------------------------------
+
+keeping them here so we don't consider them over and over
+
+1.4.1 Remove matches after having produced an ordered stream
+------------------------------------------------------------
+We can dump everything into a rowid stream and get it sorted. Then we read it,
+and if we see a rowid repeated $n_merged_scans times, it belongs to the
+intersection (pass to output), otherwise it doesn't (skip).
+This doesn't have any advantages over the produce/merge sorted streams
+approach.
+
+1.4.2 Sparse rowid bitmaps
+--------------------------
+Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
+bitmaps assume there will always be enough memory to accommodate them.
+
+PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
+disk pages, not ids of individual records. It's hard for us to do something
+similar because our rowids are opaque entities whose meaning depends on the
+storage engines.
+
+This seems to require too much change to be worth it.
+
+2. Optimization
+===============
+
+SEL_TREE objects already represent intersections. The problems with
+optimizations are:
+
+- Cost formula(s)
+- When N keys/conditions are present:
+
+ "cond(key1) AND cond(key2) AND ... AND cond(keyN)",
+
+ somehow avoid considering (2^n - n) possible options.
+
+- Avoid producing (or even considering) apparently suboptimal plans:
+ = Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
+ a subset of columns covered by all other indexes.
+ = (TODO any other rules?)
+
DESCRIPTION:
At the moment index_merge supports intersection only for rowid-ordered streams.
This translates into a limitation that index_merge/intersect can only be
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
For example, assuming that key1 has 2 parts and key2 has 1 part.
The current optimization works with:
WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
WHERE key1_part1=1 AND key2_part1=3
or
WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Execution
1.1 Temptable
1.1.1 Improvement
1.2 Produce/merge sorted streams
1.3 Extend Unique class to handle intersection
1.4 Strategies that do not seem to be useful
1.4.1 Remove matches after having produced an ordered stream
1.4.2 Sparse rowid bitmaps
2. Optimization
</contents>
1. Execution
============
The primary task is to find means to compute an intersection of N unordered
streams. Besides general memory/cpu cost of computation, we consider:
- whether the produced rowid stream is ordered. If it is, it can be piped
into index_merge/intersect (as opposed to sort-intersect)
- whether the strategy can take advantage of the fact that some input streams
are already rowid-ordered
- startup cost (cost of producing the first output record)
We see the following possible strategies:
1.1 Temptable
-------------
[ This is our strategy of choice at the moment]
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
rowid binary($rowid_size),
count n,
primary key(rowid);
);
Then use this algorithm:
i1= {index with the least E(#records)};
for each record R in range_scan(i1)
temp_table.insert(R.rowid, count=1);
for each index idx except i1
{
for each R record in scan(idx) // (INNER-LOOP)
{
if (temp_table has R)
temptable[R].count++;
}
}
// The following loop can do ordered or unordered scan
// if we want it to be ordered scan, we probably better arrange so that
// 'count' column is part of the index.
for each record R in temp_table
{
if (R.count == number_of_streams)
emit(R.rowid);
}
The algorithm has an option to emit an ordered rowid stream.
In the above form, the cost to produce the first record is high. It's easy to
adjust the algorithm to make it low - we'll need to just start scanning all
indexes at once, and finish as soon as we got a full match, i.e. the
temptable[R].count++
operation resulted in the counter being equal to the number of merged scans.
1.1.1 Improvement
~~~~~~~~~~~~~~~~~
When running INNER-LOOP, we could count how many times we've done the
"count++" operation. If it has been done #records-in-temptable times, that
means that all further records will not have matches and we can finish the
scan, i.e. break out of the INNER-LOOP.
1.2 Produce/merge sorted streams
--------------------------------
For each of the merged scan, use filesort-like action to end up with an
ordered stream of rowids. Then merge the ordered streams.
By filesort-like action we mean
- Run over index, collect rowids in a buffer.
- When the buffer is full, sort it and dump into a temporary file.
After the above we'll end up with a number of sorted buffers on disk. We can
use mergebuff() function (it is part of filesort's functions) to produce one
ordered sequence (i.e. array, which may be partially on disk) of rowids.
Merging of ordered streams with help of priority queue is already implemented
in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
child_quick->get_next()
call with a call to read rowid from an ordered sequence.
1.3 Extend Unique class to handle intersection
----------------------------------------------
There is no point to use Unique object as a device that accumulates rowids of
a single scan then produces them in sorted order. One could do the same faster
with accumulating an array of rowids and then sorting it.
It's possible to use Unique object to collect/merge data from all scans though.
The idea is as follows:
- Unique should store <rowid, n_scans> pairs
- Duplicates are pairs with the same rowid
- Unique should try to avoid creating duplicates:
- don't add a duplicate into the in-memory part, instead combine two elements
together by adding their n_scans elements.
- combine duplicates when it sees them in Unique.get() call
- The data we get from Unique.get() should be filtered, all records that have
n_scans != number_of_scans_being_merged should be discarded.
If we're lucky to have started and finished a scan on some index (denote it
as S) without flushing the Unique in the process, then:
- there is no point in adding any new records into the Unique because their
absence in the Unique means that they don't have match in S and hence will
not get into the result of intersection.
- we need to only update the counters to be able to tell if the elements that
are already in the Unique will have matches in all scans.
1.4 Strategies that do not seem to be useful
--------------------------------------------
keeping them here so we don't consider them over and over
1.4.1 Remove matches after having produced an ordered stream
------------------------------------------------------------
We can dump everything into a rowid stream and get it sorted. Then we read it,
and if we see a rowid repeated $n_merged_scans times, it belongs to the
intersection (pass to output), otherwise it doesn't (skip).
This doesn't have any advantages over the produce/merge sorted streams
approach.
1.4.2 Sparse rowid bitmaps
--------------------------
Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
bitmaps assume there will always be enough memory to accommodate them.
PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
disk pages, not ids of individual records. It's hard for us to do something
similar because our rowids are opaque entities whose meaning depends on the
storage engines.
This seems to require too much change to be worth it.
2. Optimization
===============
SEL_TREE objects already represent intersections. The problems with
optimizations are:
- Cost formula(s)
- When N keys/conditions are present:
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
somehow avoid considering (2^n - n) possible options.
- Avoid producing (or even considering) apparently suboptimal plans:
= Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
a subset of columns covered by all other indexes.
= (TODO any other rules?)
- Correlation across selectivities. If there is a condition
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
can we consider satisfaction of AND-parts to be independent?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Igor): index_merge: non-ROR intersection (21)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: non-ROR intersection
CREATION DATE..: Thu, 21 May 2009, 21:32
SUPERVISOR.....: Knielsen
IMPLEMENTOR....: Igor
COPIES TO......: Rhuddleston, Sanja, Knielsen, Serg, Monty, Timour, Igor, Psergey
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 21 (http://askmonty.org/worklog/?tid=21)
VERSION........: Server-9.x
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 25
ESTIMATE.......: 175 (hours remain)
ORIG. ESTIMATE.: 175
PROGRESS NOTES:
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Status updated.
--- /tmp/wklog.21.old.18761 2010-06-24 05:48:43.000000000 +0000
+++ /tmp/wklog.21.new.18761 2010-06-24 05:48:43.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Observers changed: Knielsen,Monty,Psergey,Sanja,Igor,Rhuddleston,Timour,Serg
-=-=(Guest - Thu, 24 Jun 2010, 05:44)=-=-
I spent 25 hours in the month June 2010 to perform the following work for this task.
1. Compared tree possible algorithms to implement the operation of index intersection mentioned in
HLS by their labor/time consumption. Chose the algorithm that uses a modified Unique class (1.3) as
the most cheap requiring the least amount of efforts/time for its development.
2. Developed a design for a modification of the Unique class to support the operation of index
intersection.
3. Modified the merge_buffers procedure used by the Unique class to make it possible to use it not
only for the the operation of union, but for the operation of intersect as well.
Worked 25 hours and estimate 175 hours remain (original estimate increased by 200 hours).
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 21
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 21
-=-=(Guest - Wed, 03 Jun 2009, 01:17)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.30002 2009-06-03 01:17:32.000000000 +0300
+++ /tmp/wklog.21.new.30002 2009-06-03 01:17:32.000000000 +0300
@@ -7,13 +7,13 @@
The current optimization works with:
-WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
-WHERE key1_part1=1 OR key2_part1=3
+WHERE key1_part1=1 AND key2_part1=3
or
-WHERE key_part1<10 or key2_part1<100
+WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:06)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29694 2009-06-03 01:06:50.000000000 +0300
+++ /tmp/wklog.21.new.29694 2009-06-03 01:06:50.000000000 +0300
@@ -12,6 +12,8 @@
but not with:
WHERE key1_part1=1 OR key2_part1=3
+or
+WHERE key_part1<10 or key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:05)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29638 2009-06-03 01:05:01.000000000 +0300
+++ /tmp/wklog.21.new.29638 2009-06-03 01:05:01.000000000 +0300
@@ -3,5 +3,15 @@
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
+For example, assuming that key1 has 2 parts and key2 has 1 part.
+
+The current optimization works with:
+
+WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+
+but not with:
+
+WHERE key1_part1=1 OR key2_part1=3
+
This WL entry is to lift this limitation by developing algorithms that do
-intersection on non-ROR scans.
+intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Guest - Tue, 26 May 2009, 14:04)=-=-
High-Level Specification modified.
--- /tmp/wklog.21.old.1802 2009-05-26 14:04:57.000000000 +0300
+++ /tmp/wklog.21.new.1802 2009-05-26 14:04:57.000000000 +0300
@@ -1,4 +1,3 @@
-
<contents>
1. Execution
1.1 Temptable
@@ -30,6 +29,8 @@
1.1 Temptable
-------------
+[ This is our strategy of choice at the moment]
+
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
@@ -168,3 +169,8 @@
a subset of columns covered by all other indexes.
= (TODO any other rules?)
+- Correlation across selectivities. If there is a condition
+
+ "cond(key1) AND cond(key2) AND ... AND cond(keyN)",
+
+ can we consider satisfaction of AND-parts to be independent?
-=-=(Psergey - Thu, 21 May 2009, 21:33)=-=-
High-Level Specification modified.
--- /tmp/wklog.21.old.25705 2009-05-21 21:33:02.000000000 +0300
+++ /tmp/wklog.21.new.25705 2009-05-21 21:33:02.000000000 +0300
@@ -1 +1,170 @@
+<contents>
+1. Execution
+1.1 Temptable
+1.1.1 Improvement
+1.2 Produce/merge sorted streams
+1.3 Extend Unique class to handle intersection
+1.4 Strategies that do not seem to be useful
+1.4.1 Remove matches after having produced an ordered stream
+1.4.2 Sparse rowid bitmaps
+2. Optimization
+
+</contents>
+
+1. Execution
+============
+
+The primary task is to find means to compute an intersection of N unordered
+streams. Besides general memory/cpu cost of computation, we consider:
+
+- whether the produced rowid stream is ordered. If it is, it can be piped
+ into index_merge/intersect (as opposed to sort-intersect)
+
+- whether the strategy can take advantage of the fact that some input streams
+ are already rowid-ordered
+
+- startup cost (cost of producing the first output record)
+
+We see the following possible strategies:
+
+1.1 Temptable
+-------------
+Use a temporary heap-grow-out-to-myisam table with a primary key:
+
+create table temp_table (
+ rowid binary($rowid_size),
+ count n,
+ primary key(rowid);
+);
+
+Then use this algorithm:
+
+ i1= {index with the least E(#records)};
+
+ for each record R in range_scan(i1)
+ temp_table.insert(R.rowid, count=1);
+
+ for each index idx except i1
+ {
+ for each R record in scan(idx) // (INNER-LOOP)
+ {
+ if (temp_table has R)
+ temptable[R].count++;
+ }
+ }
+
+ // The following loop can do ordered or unordered scan
+ // if we want it to be ordered scan, we probably better arrange so that
+ // 'count' column is part of the index.
+ for each record R in temp_table
+ {
+ if (R.count == number_of_streams)
+ emit(R.rowid);
+ }
+
+The algorithm has an option to emit an ordered rowid stream.
+
+In the above form, the cost to produce the first record is high. It's easy to
+adjust the algorithm to make it low - we'll need to just start scanning all
+indexes at once, and finish as soon as we got a full match, i.e. the
+
+ temptable[R].count++
+
+operation resulted in the counter being equal to the number of merged scans.
+
+1.1.1 Improvement
+~~~~~~~~~~~~~~~~~
+When running INNER-LOOP, we could count how many times we've done the
+"count++" operation. If it has been done #records-in-temptable times, that
+means that all further records will not have matches and we can finish the
+scan, i.e. break out of the INNER-LOOP.
+
+1.2 Produce/merge sorted streams
+--------------------------------
+For each of the merged scan, use filesort-like action to end up with an
+ordered stream of rowids. Then merge the ordered streams.
+
+By filesort-like action we mean
+ - Run over index, collect rowids in a buffer.
+ - When the buffer is full, sort it and dump into a temporary file.
+After the above we'll end up with a number of sorted buffers on disk. We can
+use mergebuff() function (it is part of filesort's functions) to produce one
+ordered sequence (i.e. array, which may be partially on disk) of rowids.
+
+Merging of ordered streams with help of priority queue is already implemented
+in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
+
+ child_quick->get_next()
+
+call with a call to read rowid from an ordered sequence.
+
+1.3 Extend Unique class to handle intersection
+----------------------------------------------
+There is no point to use Unique object as a device that accumulates rowids of
+a single scan then produces them in sorted order. One could do the same faster
+with accumulating an array of rowids and then sorting it.
+
+It's possible to use Unique object to collect/merge data from all scans though.
+The idea is as follows:
+
+- Unique should store <rowid, n_scans> pairs
+- Duplicates are pairs with the same rowid
+- Unique should try to avoid creating duplicates:
+ - don't add a duplicate into the in-memory part, instead combine two elements
+ together by adding their n_scans elements.
+ - combine duplicates when it sees them in Unique.get() call
+- The data we get from Unique.get() should be filtered, all records that have
+ n_scans != number_of_scans_being_merged should be discarded.
+
+If we're lucky to have started and finished a scan on some index (denote it
+as S) without flushing the Unique in the process, then:
+- there is no point in adding any new records into the Unique because their
+ absence in the Unique means that they don't have match in S and hence will
+ not get into the result of intersection.
+- we need to only update the counters to be able to tell if the elements that
+ are already in the Unique will have matches in all scans.
+
+1.4 Strategies that do not seem to be useful
+--------------------------------------------
+
+keeping them here so we don't consider them over and over
+
+1.4.1 Remove matches after having produced an ordered stream
+------------------------------------------------------------
+We can dump everything into a rowid stream and get it sorted. Then we read it,
+and if we see a rowid repeated $n_merged_scans times, it belongs to the
+intersection (pass to output), otherwise it doesn't (skip).
+This doesn't have any advantages over the produce/merge sorted streams
+approach.
+
+1.4.2 Sparse rowid bitmaps
+--------------------------
+Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
+bitmaps assume there will always be enough memory to accommodate them.
+
+PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
+disk pages, not ids of individual records. It's hard for us to do something
+similar because our rowids are opaque entities whose meaning depends on the
+storage engines.
+
+This seems to require too much change to be worth it.
+
+2. Optimization
+===============
+
+SEL_TREE objects already represent intersections. The problems with
+optimizations are:
+
+- Cost formula(s)
+- When N keys/conditions are present:
+
+ "cond(key1) AND cond(key2) AND ... AND cond(keyN)",
+
+ somehow avoid considering (2^n - n) possible options.
+
+- Avoid producing (or even considering) apparently suboptimal plans:
+ = Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
+ a subset of columns covered by all other indexes.
+ = (TODO any other rules?)
+
DESCRIPTION:
At the moment index_merge supports intersection only for rowid-ordered streams.
This translates into a limitation that index_merge/intersect can only be
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
For example, assuming that key1 has 2 parts and key2 has 1 part.
The current optimization works with:
WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
WHERE key1_part1=1 AND key2_part1=3
or
WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Execution
1.1 Temptable
1.1.1 Improvement
1.2 Produce/merge sorted streams
1.3 Extend Unique class to handle intersection
1.4 Strategies that do not seem to be useful
1.4.1 Remove matches after having produced an ordered stream
1.4.2 Sparse rowid bitmaps
2. Optimization
</contents>
1. Execution
============
The primary task is to find means to compute an intersection of N unordered
streams. Besides general memory/cpu cost of computation, we consider:
- whether the produced rowid stream is ordered. If it is, it can be piped
into index_merge/intersect (as opposed to sort-intersect)
- whether the strategy can take advantage of the fact that some input streams
are already rowid-ordered
- startup cost (cost of producing the first output record)
We see the following possible strategies:
1.1 Temptable
-------------
[ This is our strategy of choice at the moment]
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
rowid binary($rowid_size),
count n,
primary key(rowid);
);
Then use this algorithm:
i1= {index with the least E(#records)};
for each record R in range_scan(i1)
temp_table.insert(R.rowid, count=1);
for each index idx except i1
{
for each R record in scan(idx) // (INNER-LOOP)
{
if (temp_table has R)
temptable[R].count++;
}
}
// The following loop can do ordered or unordered scan
// if we want it to be ordered scan, we probably better arrange so that
// 'count' column is part of the index.
for each record R in temp_table
{
if (R.count == number_of_streams)
emit(R.rowid);
}
The algorithm has an option to emit an ordered rowid stream.
In the above form, the cost to produce the first record is high. It's easy to
adjust the algorithm to make it low - we'll need to just start scanning all
indexes at once, and finish as soon as we got a full match, i.e. the
temptable[R].count++
operation resulted in the counter being equal to the number of merged scans.
1.1.1 Improvement
~~~~~~~~~~~~~~~~~
When running INNER-LOOP, we could count how many times we've done the
"count++" operation. If it has been done #records-in-temptable times, that
means that all further records will not have matches and we can finish the
scan, i.e. break out of the INNER-LOOP.
1.2 Produce/merge sorted streams
--------------------------------
For each of the merged scan, use filesort-like action to end up with an
ordered stream of rowids. Then merge the ordered streams.
By filesort-like action we mean
- Run over index, collect rowids in a buffer.
- When the buffer is full, sort it and dump into a temporary file.
After the above we'll end up with a number of sorted buffers on disk. We can
use mergebuff() function (it is part of filesort's functions) to produce one
ordered sequence (i.e. array, which may be partially on disk) of rowids.
Merging of ordered streams with help of priority queue is already implemented
in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
child_quick->get_next()
call with a call to read rowid from an ordered sequence.
1.3 Extend Unique class to handle intersection
----------------------------------------------
There is no point to use Unique object as a device that accumulates rowids of
a single scan then produces them in sorted order. One could do the same faster
with accumulating an array of rowids and then sorting it.
It's possible to use Unique object to collect/merge data from all scans though.
The idea is as follows:
- Unique should store <rowid, n_scans> pairs
- Duplicates are pairs with the same rowid
- Unique should try to avoid creating duplicates:
- don't add a duplicate into the in-memory part, instead combine two elements
together by adding their n_scans elements.
- combine duplicates when it sees them in Unique.get() call
- The data we get from Unique.get() should be filtered, all records that have
n_scans != number_of_scans_being_merged should be discarded.
If we're lucky to have started and finished a scan on some index (denote it
as S) without flushing the Unique in the process, then:
- there is no point in adding any new records into the Unique because their
absence in the Unique means that they don't have match in S and hence will
not get into the result of intersection.
- we need to only update the counters to be able to tell if the elements that
are already in the Unique will have matches in all scans.
1.4 Strategies that do not seem to be useful
--------------------------------------------
keeping them here so we don't consider them over and over
1.4.1 Remove matches after having produced an ordered stream
------------------------------------------------------------
We can dump everything into a rowid stream and get it sorted. Then we read it,
and if we see a rowid repeated $n_merged_scans times, it belongs to the
intersection (pass to output), otherwise it doesn't (skip).
This doesn't have any advantages over the produce/merge sorted streams
approach.
1.4.2 Sparse rowid bitmaps
--------------------------
Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
bitmaps assume there will always be enough memory to accommodate them.
PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
disk pages, not ids of individual records. It's hard for us to do something
similar because our rowids are opaque entities whose meaning depends on the
storage engines.
This seems to require too much change to be worth it.
2. Optimization
===============
SEL_TREE objects already represent intersections. The problems with
optimizations are:
- Cost formula(s)
- When N keys/conditions are present:
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
somehow avoid considering (2^n - n) possible options.
- Avoid producing (or even considering) apparently suboptimal plans:
= Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
a subset of columns covered by all other indexes.
= (TODO any other rules?)
- Correlation across selectivities. If there is a condition
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
can we consider satisfaction of AND-parts to be independent?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Igor): index_merge: non-ROR intersection (21)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: non-ROR intersection
CREATION DATE..: Thu, 21 May 2009, 21:32
SUPERVISOR.....: Knielsen
IMPLEMENTOR....: Igor
COPIES TO......: Rhuddleston, Sanja, Knielsen, Serg, Monty, Timour, Igor, Psergey
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 21 (http://askmonty.org/worklog/?tid=21)
VERSION........: Server-9.x
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 25
ESTIMATE.......: 175 (hours remain)
ORIG. ESTIMATE.: 175
PROGRESS NOTES:
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Status updated.
--- /tmp/wklog.21.old.18761 2010-06-24 05:48:43.000000000 +0000
+++ /tmp/wklog.21.new.18761 2010-06-24 05:48:43.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Observers changed: Knielsen,Monty,Psergey,Sanja,Igor,Rhuddleston,Timour,Serg
-=-=(Guest - Thu, 24 Jun 2010, 05:44)=-=-
I spent 25 hours in the month June 2010 to perform the following work for this task.
1. Compared tree possible algorithms to implement the operation of index intersection mentioned in
HLS by their labor/time consumption. Chose the algorithm that uses a modified Unique class (1.3) as
the most cheap requiring the least amount of efforts/time for its development.
2. Developed a design for a modification of the Unique class to support the operation of index
intersection.
3. Modified the merge_buffers procedure used by the Unique class to make it possible to use it not
only for the the operation of union, but for the operation of intersect as well.
Worked 25 hours and estimate 175 hours remain (original estimate increased by 200 hours).
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 21
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 21
-=-=(Guest - Wed, 03 Jun 2009, 01:17)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.30002 2009-06-03 01:17:32.000000000 +0300
+++ /tmp/wklog.21.new.30002 2009-06-03 01:17:32.000000000 +0300
@@ -7,13 +7,13 @@
The current optimization works with:
-WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
-WHERE key1_part1=1 OR key2_part1=3
+WHERE key1_part1=1 AND key2_part1=3
or
-WHERE key_part1<10 or key2_part1<100
+WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:06)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29694 2009-06-03 01:06:50.000000000 +0300
+++ /tmp/wklog.21.new.29694 2009-06-03 01:06:50.000000000 +0300
@@ -12,6 +12,8 @@
but not with:
WHERE key1_part1=1 OR key2_part1=3
+or
+WHERE key_part1<10 or key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:05)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29638 2009-06-03 01:05:01.000000000 +0300
+++ /tmp/wklog.21.new.29638 2009-06-03 01:05:01.000000000 +0300
@@ -3,5 +3,15 @@
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
+For example, assuming that key1 has 2 parts and key2 has 1 part.
+
+The current optimization works with:
+
+WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+
+but not with:
+
+WHERE key1_part1=1 OR key2_part1=3
+
This WL entry is to lift this limitation by developing algorithms that do
-intersection on non-ROR scans.
+intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Guest - Tue, 26 May 2009, 14:04)=-=-
High-Level Specification modified.
--- /tmp/wklog.21.old.1802 2009-05-26 14:04:57.000000000 +0300
+++ /tmp/wklog.21.new.1802 2009-05-26 14:04:57.000000000 +0300
@@ -1,4 +1,3 @@
-
<contents>
1. Execution
1.1 Temptable
@@ -30,6 +29,8 @@
1.1 Temptable
-------------
+[ This is our strategy of choice at the moment]
+
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
@@ -168,3 +169,8 @@
a subset of columns covered by all other indexes.
= (TODO any other rules?)
+- Correlation across selectivities. If there is a condition
+
+ "cond(key1) AND cond(key2) AND ... AND cond(keyN)",
+
+ can we consider satisfaction of AND-parts to be independent?
-=-=(Psergey - Thu, 21 May 2009, 21:33)=-=-
High-Level Specification modified.
--- /tmp/wklog.21.old.25705 2009-05-21 21:33:02.000000000 +0300
+++ /tmp/wklog.21.new.25705 2009-05-21 21:33:02.000000000 +0300
@@ -1 +1,170 @@
+<contents>
+1. Execution
+1.1 Temptable
+1.1.1 Improvement
+1.2 Produce/merge sorted streams
+1.3 Extend Unique class to handle intersection
+1.4 Strategies that do not seem to be useful
+1.4.1 Remove matches after having produced an ordered stream
+1.4.2 Sparse rowid bitmaps
+2. Optimization
+
+</contents>
+
+1. Execution
+============
+
+The primary task is to find means to compute an intersection of N unordered
+streams. Besides general memory/cpu cost of computation, we consider:
+
+- whether the produced rowid stream is ordered. If it is, it can be piped
+ into index_merge/intersect (as opposed to sort-intersect)
+
+- whether the strategy can take advantage of the fact that some input streams
+ are already rowid-ordered
+
+- startup cost (cost of producing the first output record)
+
+We see the following possible strategies:
+
+1.1 Temptable
+-------------
+Use a temporary heap-grow-out-to-myisam table with a primary key:
+
+create table temp_table (
+ rowid binary($rowid_size),
+ count n,
+ primary key(rowid);
+);
+
+Then use this algorithm:
+
+ i1= {index with the least E(#records)};
+
+ for each record R in range_scan(i1)
+ temp_table.insert(R.rowid, count=1);
+
+ for each index idx except i1
+ {
+ for each R record in scan(idx) // (INNER-LOOP)
+ {
+ if (temp_table has R)
+ temptable[R].count++;
+ }
+ }
+
+ // The following loop can do ordered or unordered scan
+ // if we want it to be ordered scan, we probably better arrange so that
+ // 'count' column is part of the index.
+ for each record R in temp_table
+ {
+ if (R.count == number_of_streams)
+ emit(R.rowid);
+ }
+
+The algorithm has an option to emit an ordered rowid stream.
+
+In the above form, the cost to produce the first record is high. It's easy to
+adjust the algorithm to make it low - we'll need to just start scanning all
+indexes at once, and finish as soon as we got a full match, i.e. the
+
+ temptable[R].count++
+
+operation resulted in the counter being equal to the number of merged scans.
+
+1.1.1 Improvement
+~~~~~~~~~~~~~~~~~
+When running INNER-LOOP, we could count how many times we've done the
+"count++" operation. If it has been done #records-in-temptable times, that
+means that all further records will not have matches and we can finish the
+scan, i.e. break out of the INNER-LOOP.
+
+1.2 Produce/merge sorted streams
+--------------------------------
+For each of the merged scan, use filesort-like action to end up with an
+ordered stream of rowids. Then merge the ordered streams.
+
+By filesort-like action we mean
+ - Run over index, collect rowids in a buffer.
+ - When the buffer is full, sort it and dump into a temporary file.
+After the above we'll end up with a number of sorted buffers on disk. We can
+use mergebuff() function (it is part of filesort's functions) to produce one
+ordered sequence (i.e. array, which may be partially on disk) of rowids.
+
+Merging of ordered streams with help of priority queue is already implemented
+in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
+
+ child_quick->get_next()
+
+call with a call to read rowid from an ordered sequence.
+
+1.3 Extend Unique class to handle intersection
+----------------------------------------------
+There is no point to use Unique object as a device that accumulates rowids of
+a single scan then produces them in sorted order. One could do the same faster
+with accumulating an array of rowids and then sorting it.
+
+It's possible to use Unique object to collect/merge data from all scans though.
+The idea is as follows:
+
+- Unique should store <rowid, n_scans> pairs
+- Duplicates are pairs with the same rowid
+- Unique should try to avoid creating duplicates:
+ - don't add a duplicate into the in-memory part, instead combine two elements
+ together by adding their n_scans elements.
+ - combine duplicates when it sees them in Unique.get() call
+- The data we get from Unique.get() should be filtered, all records that have
+ n_scans != number_of_scans_being_merged should be discarded.
+
+If we're lucky to have started and finished a scan on some index (denote it
+as S) without flushing the Unique in the process, then:
+- there is no point in adding any new records into the Unique because their
+ absence in the Unique means that they don't have match in S and hence will
+ not get into the result of intersection.
+- we need to only update the counters to be able to tell if the elements that
+ are already in the Unique will have matches in all scans.
+
+1.4 Strategies that do not seem to be useful
+--------------------------------------------
+
+keeping them here so we don't consider them over and over
+
+1.4.1 Remove matches after having produced an ordered stream
+------------------------------------------------------------
+We can dump everything into a rowid stream and get it sorted. Then we read it,
+and if we see a rowid repeated $n_merged_scans times, it belongs to the
+intersection (pass to output), otherwise it doesn't (skip).
+This doesn't have any advantages over the produce/merge sorted streams
+approach.
+
+1.4.2 Sparse rowid bitmaps
+--------------------------
+Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
+bitmaps assume there will always be enough memory to accommodate them.
+
+PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
+disk pages, not ids of individual records. It's hard for us to do something
+similar because our rowids are opaque entities whose meaning depends on the
+storage engines.
+
+This seems to require too much change to be worth it.
+
+2. Optimization
+===============
+
+SEL_TREE objects already represent intersections. The problems with
+optimizations are:
+
+- Cost formula(s)
+- When N keys/conditions are present:
+
+ "cond(key1) AND cond(key2) AND ... AND cond(keyN)",
+
+ somehow avoid considering (2^n - n) possible options.
+
+- Avoid producing (or even considering) apparently suboptimal plans:
+ = Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
+ a subset of columns covered by all other indexes.
+ = (TODO any other rules?)
+
DESCRIPTION:
At the moment index_merge supports intersection only for rowid-ordered streams.
This translates into a limitation that index_merge/intersect can only be
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
For example, assuming that key1 has 2 parts and key2 has 1 part.
The current optimization works with:
WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
WHERE key1_part1=1 AND key2_part1=3
or
WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Execution
1.1 Temptable
1.1.1 Improvement
1.2 Produce/merge sorted streams
1.3 Extend Unique class to handle intersection
1.4 Strategies that do not seem to be useful
1.4.1 Remove matches after having produced an ordered stream
1.4.2 Sparse rowid bitmaps
2. Optimization
</contents>
1. Execution
============
The primary task is to find means to compute an intersection of N unordered
streams. Besides general memory/cpu cost of computation, we consider:
- whether the produced rowid stream is ordered. If it is, it can be piped
into index_merge/intersect (as opposed to sort-intersect)
- whether the strategy can take advantage of the fact that some input streams
are already rowid-ordered
- startup cost (cost of producing the first output record)
We see the following possible strategies:
1.1 Temptable
-------------
[ This is our strategy of choice at the moment]
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
rowid binary($rowid_size),
count n,
primary key(rowid);
);
Then use this algorithm:
i1= {index with the least E(#records)};
for each record R in range_scan(i1)
temp_table.insert(R.rowid, count=1);
for each index idx except i1
{
for each R record in scan(idx) // (INNER-LOOP)
{
if (temp_table has R)
temptable[R].count++;
}
}
// The following loop can do ordered or unordered scan
// if we want it to be ordered scan, we probably better arrange so that
// 'count' column is part of the index.
for each record R in temp_table
{
if (R.count == number_of_streams)
emit(R.rowid);
}
The algorithm has an option to emit an ordered rowid stream.
In the above form, the cost to produce the first record is high. It's easy to
adjust the algorithm to make it low - we'll need to just start scanning all
indexes at once, and finish as soon as we got a full match, i.e. the
temptable[R].count++
operation resulted in the counter being equal to the number of merged scans.
1.1.1 Improvement
~~~~~~~~~~~~~~~~~
When running INNER-LOOP, we could count how many times we've done the
"count++" operation. If it has been done #records-in-temptable times, that
means that all further records will not have matches and we can finish the
scan, i.e. break out of the INNER-LOOP.
1.2 Produce/merge sorted streams
--------------------------------
For each of the merged scan, use filesort-like action to end up with an
ordered stream of rowids. Then merge the ordered streams.
By filesort-like action we mean
- Run over index, collect rowids in a buffer.
- When the buffer is full, sort it and dump into a temporary file.
After the above we'll end up with a number of sorted buffers on disk. We can
use mergebuff() function (it is part of filesort's functions) to produce one
ordered sequence (i.e. array, which may be partially on disk) of rowids.
Merging of ordered streams with help of priority queue is already implemented
in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
child_quick->get_next()
call with a call to read rowid from an ordered sequence.
1.3 Extend Unique class to handle intersection
----------------------------------------------
There is no point to use Unique object as a device that accumulates rowids of
a single scan then produces them in sorted order. One could do the same faster
with accumulating an array of rowids and then sorting it.
It's possible to use Unique object to collect/merge data from all scans though.
The idea is as follows:
- Unique should store <rowid, n_scans> pairs
- Duplicates are pairs with the same rowid
- Unique should try to avoid creating duplicates:
- don't add a duplicate into the in-memory part, instead combine two elements
together by adding their n_scans elements.
- combine duplicates when it sees them in Unique.get() call
- The data we get from Unique.get() should be filtered, all records that have
n_scans != number_of_scans_being_merged should be discarded.
If we're lucky to have started and finished a scan on some index (denote it
as S) without flushing the Unique in the process, then:
- there is no point in adding any new records into the Unique because their
absence in the Unique means that they don't have match in S and hence will
not get into the result of intersection.
- we need to only update the counters to be able to tell if the elements that
are already in the Unique will have matches in all scans.
1.4 Strategies that do not seem to be useful
--------------------------------------------
keeping them here so we don't consider them over and over
1.4.1 Remove matches after having produced an ordered stream
------------------------------------------------------------
We can dump everything into a rowid stream and get it sorted. Then we read it,
and if we see a rowid repeated $n_merged_scans times, it belongs to the
intersection (pass to output), otherwise it doesn't (skip).
This doesn't have any advantages over the produce/merge sorted streams
approach.
1.4.2 Sparse rowid bitmaps
--------------------------
Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
bitmaps assume there will always be enough memory to accommodate them.
PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
disk pages, not ids of individual records. It's hard for us to do something
similar because our rowids are opaque entities whose meaning depends on the
storage engines.
This seems to require too much change to be worth it.
2. Optimization
===============
SEL_TREE objects already represent intersections. The problems with
optimizations are:
- Cost formula(s)
- When N keys/conditions are present:
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
somehow avoid considering (2^n - n) possible options.
- Avoid producing (or even considering) apparently suboptimal plans:
= Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
a subset of columns covered by all other indexes.
= (TODO any other rules?)
- Correlation across selectivities. If there is a condition
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
can we consider satisfaction of AND-parts to be independent?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Igor): index_merge: non-ROR intersection (21)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: non-ROR intersection
CREATION DATE..: Thu, 21 May 2009, 21:32
SUPERVISOR.....: Knielsen
IMPLEMENTOR....: Igor
COPIES TO......: Rhuddleston, Sanja, Knielsen, Serg, Monty, Timour, Igor, Psergey
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 21 (http://askmonty.org/worklog/?tid=21)
VERSION........: Server-9.x
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 25
ESTIMATE.......: 175 (hours remain)
ORIG. ESTIMATE.: 175
PROGRESS NOTES:
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Status updated.
--- /tmp/wklog.21.old.18761 2010-06-24 05:48:43.000000000 +0000
+++ /tmp/wklog.21.new.18761 2010-06-24 05:48:43.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Observers changed: Knielsen,Monty,Psergey,Sanja,Igor,Rhuddleston,Timour,Serg
-=-=(Guest - Thu, 24 Jun 2010, 05:44)=-=-
I spent 25 hours in the month June 2010 to perform the following work for this task.
1. Compared tree possible algorithms to implement the operation of index intersection mentioned in
HLS by their labor/time consumption. Chose the algorithm that uses a modified Unique class (1.3) as
the most cheap requiring the least amount of efforts/time for its development.
2. Developed a design for a modification of the Unique class to support the operation of index
intersection.
3. Modified the merge_buffers procedure used by the Unique class to make it possible to use it not
only for the the operation of union, but for the operation of intersect as well.
Worked 25 hours and estimate 175 hours remain (original estimate increased by 200 hours).
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 21
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 21
-=-=(Guest - Wed, 03 Jun 2009, 01:17)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.30002 2009-06-03 01:17:32.000000000 +0300
+++ /tmp/wklog.21.new.30002 2009-06-03 01:17:32.000000000 +0300
@@ -7,13 +7,13 @@
The current optimization works with:
-WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
-WHERE key1_part1=1 OR key2_part1=3
+WHERE key1_part1=1 AND key2_part1=3
or
-WHERE key_part1<10 or key2_part1<100
+WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:06)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29694 2009-06-03 01:06:50.000000000 +0300
+++ /tmp/wklog.21.new.29694 2009-06-03 01:06:50.000000000 +0300
@@ -12,6 +12,8 @@
but not with:
WHERE key1_part1=1 OR key2_part1=3
+or
+WHERE key_part1<10 or key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:05)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29638 2009-06-03 01:05:01.000000000 +0300
+++ /tmp/wklog.21.new.29638 2009-06-03 01:05:01.000000000 +0300
@@ -3,5 +3,15 @@
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
+For example, assuming that key1 has 2 parts and key2 has 1 part.
+
+The current optimization works with:
+
+WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+
+but not with:
+
+WHERE key1_part1=1 OR key2_part1=3
+
This WL entry is to lift this limitation by developing algorithms that do
-intersection on non-ROR scans.
+intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Guest - Tue, 26 May 2009, 14:04)=-=-
High-Level Specification modified.
--- /tmp/wklog.21.old.1802 2009-05-26 14:04:57.000000000 +0300
+++ /tmp/wklog.21.new.1802 2009-05-26 14:04:57.000000000 +0300
@@ -1,4 +1,3 @@
-
<contents>
1. Execution
1.1 Temptable
@@ -30,6 +29,8 @@
1.1 Temptable
-------------
+[ This is our strategy of choice at the moment]
+
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
@@ -168,3 +169,8 @@
a subset of columns covered by all other indexes.
= (TODO any other rules?)
+- Correlation across selectivities. If there is a condition
+
+ "cond(key1) AND cond(key2) AND ... AND cond(keyN)",
+
+ can we consider satisfaction of AND-parts to be independent?
-=-=(Psergey - Thu, 21 May 2009, 21:33)=-=-
High-Level Specification modified.
--- /tmp/wklog.21.old.25705 2009-05-21 21:33:02.000000000 +0300
+++ /tmp/wklog.21.new.25705 2009-05-21 21:33:02.000000000 +0300
@@ -1 +1,170 @@
+<contents>
+1. Execution
+1.1 Temptable
+1.1.1 Improvement
+1.2 Produce/merge sorted streams
+1.3 Extend Unique class to handle intersection
+1.4 Strategies that do not seem to be useful
+1.4.1 Remove matches after having produced an ordered stream
+1.4.2 Sparse rowid bitmaps
+2. Optimization
+
+</contents>
+
+1. Execution
+============
+
+The primary task is to find means to compute an intersection of N unordered
+streams. Besides general memory/cpu cost of computation, we consider:
+
+- whether the produced rowid stream is ordered. If it is, it can be piped
+ into index_merge/intersect (as opposed to sort-intersect)
+
+- whether the strategy can take advantage of the fact that some input streams
+ are already rowid-ordered
+
+- startup cost (cost of producing the first output record)
+
+We see the following possible strategies:
+
+1.1 Temptable
+-------------
+Use a temporary heap-grow-out-to-myisam table with a primary key:
+
+create table temp_table (
+ rowid binary($rowid_size),
+ count n,
+ primary key(rowid);
+);
+
+Then use this algorithm:
+
+ i1= {index with the least E(#records)};
+
+ for each record R in range_scan(i1)
+ temp_table.insert(R.rowid, count=1);
+
+ for each index idx except i1
+ {
+ for each R record in scan(idx) // (INNER-LOOP)
+ {
+ if (temp_table has R)
+ temptable[R].count++;
+ }
+ }
+
+ // The following loop can do ordered or unordered scan
+ // if we want it to be ordered scan, we probably better arrange so that
+ // 'count' column is part of the index.
+ for each record R in temp_table
+ {
+ if (R.count == number_of_streams)
+ emit(R.rowid);
+ }
+
+The algorithm has an option to emit an ordered rowid stream.
+
+In the above form, the cost to produce the first record is high. It's easy to
+adjust the algorithm to make it low - we'll need to just start scanning all
+indexes at once, and finish as soon as we got a full match, i.e. the
+
+ temptable[R].count++
+
+operation resulted in the counter being equal to the number of merged scans.
+
+1.1.1 Improvement
+~~~~~~~~~~~~~~~~~
+When running INNER-LOOP, we could count how many times we've done the
+"count++" operation. If it has been done #records-in-temptable times, that
+means that all further records will not have matches and we can finish the
+scan, i.e. break out of the INNER-LOOP.
+
+1.2 Produce/merge sorted streams
+--------------------------------
+For each of the merged scan, use filesort-like action to end up with an
+ordered stream of rowids. Then merge the ordered streams.
+
+By filesort-like action we mean
+ - Run over index, collect rowids in a buffer.
+ - When the buffer is full, sort it and dump into a temporary file.
+After the above we'll end up with a number of sorted buffers on disk. We can
+use mergebuff() function (it is part of filesort's functions) to produce one
+ordered sequence (i.e. array, which may be partially on disk) of rowids.
+
+Merging of ordered streams with help of priority queue is already implemented
+in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
+
+ child_quick->get_next()
+
+call with a call to read rowid from an ordered sequence.
+
+1.3 Extend Unique class to handle intersection
+----------------------------------------------
+There is no point to use Unique object as a device that accumulates rowids of
+a single scan then produces them in sorted order. One could do the same faster
+with accumulating an array of rowids and then sorting it.
+
+It's possible to use Unique object to collect/merge data from all scans though.
+The idea is as follows:
+
+- Unique should store <rowid, n_scans> pairs
+- Duplicates are pairs with the same rowid
+- Unique should try to avoid creating duplicates:
+ - don't add a duplicate into the in-memory part, instead combine two elements
+ together by adding their n_scans elements.
+ - combine duplicates when it sees them in Unique.get() call
+- The data we get from Unique.get() should be filtered, all records that have
+ n_scans != number_of_scans_being_merged should be discarded.
+
+If we're lucky to have started and finished a scan on some index (denote it
+as S) without flushing the Unique in the process, then:
+- there is no point in adding any new records into the Unique because their
+ absence in the Unique means that they don't have match in S and hence will
+ not get into the result of intersection.
+- we need to only update the counters to be able to tell if the elements that
+ are already in the Unique will have matches in all scans.
+
+1.4 Strategies that do not seem to be useful
+--------------------------------------------
+
+keeping them here so we don't consider them over and over
+
+1.4.1 Remove matches after having produced an ordered stream
+------------------------------------------------------------
+We can dump everything into a rowid stream and get it sorted. Then we read it,
+and if we see a rowid repeated $n_merged_scans times, it belongs to the
+intersection (pass to output), otherwise it doesn't (skip).
+This doesn't have any advantages over the produce/merge sorted streams
+approach.
+
+1.4.2 Sparse rowid bitmaps
+--------------------------
+Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
+bitmaps assume there will always be enough memory to accommodate them.
+
+PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
+disk pages, not ids of individual records. It's hard for us to do something
+similar because our rowids are opaque entities whose meaning depends on the
+storage engines.
+
+This seems to require too much change to be worth it.
+
+2. Optimization
+===============
+
+SEL_TREE objects already represent intersections. The problems with
+optimizations are:
+
+- Cost formula(s)
+- When N keys/conditions are present:
+
+ "cond(key1) AND cond(key2) AND ... AND cond(keyN)",
+
+ somehow avoid considering (2^n - n) possible options.
+
+- Avoid producing (or even considering) apparently suboptimal plans:
+ = Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
+ a subset of columns covered by all other indexes.
+ = (TODO any other rules?)
+
DESCRIPTION:
At the moment index_merge supports intersection only for rowid-ordered streams.
This translates into a limitation that index_merge/intersect can only be
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
For example, assuming that key1 has 2 parts and key2 has 1 part.
The current optimization works with:
WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
WHERE key1_part1=1 AND key2_part1=3
or
WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Execution
1.1 Temptable
1.1.1 Improvement
1.2 Produce/merge sorted streams
1.3 Extend Unique class to handle intersection
1.4 Strategies that do not seem to be useful
1.4.1 Remove matches after having produced an ordered stream
1.4.2 Sparse rowid bitmaps
2. Optimization
</contents>
1. Execution
============
The primary task is to find means to compute an intersection of N unordered
streams. Besides general memory/cpu cost of computation, we consider:
- whether the produced rowid stream is ordered. If it is, it can be piped
into index_merge/intersect (as opposed to sort-intersect)
- whether the strategy can take advantage of the fact that some input streams
are already rowid-ordered
- startup cost (cost of producing the first output record)
We see the following possible strategies:
1.1 Temptable
-------------
[ This is our strategy of choice at the moment]
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
rowid binary($rowid_size),
count n,
primary key(rowid);
);
Then use this algorithm:
i1= {index with the least E(#records)};
for each record R in range_scan(i1)
temp_table.insert(R.rowid, count=1);
for each index idx except i1
{
for each R record in scan(idx) // (INNER-LOOP)
{
if (temp_table has R)
temptable[R].count++;
}
}
// The following loop can do ordered or unordered scan
// if we want it to be ordered scan, we probably better arrange so that
// 'count' column is part of the index.
for each record R in temp_table
{
if (R.count == number_of_streams)
emit(R.rowid);
}
The algorithm has an option to emit an ordered rowid stream.
In the above form, the cost to produce the first record is high. It's easy to
adjust the algorithm to make it low - we'll need to just start scanning all
indexes at once, and finish as soon as we got a full match, i.e. the
temptable[R].count++
operation resulted in the counter being equal to the number of merged scans.
1.1.1 Improvement
~~~~~~~~~~~~~~~~~
When running INNER-LOOP, we could count how many times we've done the
"count++" operation. If it has been done #records-in-temptable times, that
means that all further records will not have matches and we can finish the
scan, i.e. break out of the INNER-LOOP.
1.2 Produce/merge sorted streams
--------------------------------
For each of the merged scan, use filesort-like action to end up with an
ordered stream of rowids. Then merge the ordered streams.
By filesort-like action we mean
- Run over index, collect rowids in a buffer.
- When the buffer is full, sort it and dump into a temporary file.
After the above we'll end up with a number of sorted buffers on disk. We can
use mergebuff() function (it is part of filesort's functions) to produce one
ordered sequence (i.e. array, which may be partially on disk) of rowids.
Merging of ordered streams with help of priority queue is already implemented
in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
child_quick->get_next()
call with a call to read rowid from an ordered sequence.
1.3 Extend Unique class to handle intersection
----------------------------------------------
There is no point to use Unique object as a device that accumulates rowids of
a single scan then produces them in sorted order. One could do the same faster
with accumulating an array of rowids and then sorting it.
It's possible to use Unique object to collect/merge data from all scans though.
The idea is as follows:
- Unique should store <rowid, n_scans> pairs
- Duplicates are pairs with the same rowid
- Unique should try to avoid creating duplicates:
- don't add a duplicate into the in-memory part, instead combine two elements
together by adding their n_scans elements.
- combine duplicates when it sees them in Unique.get() call
- The data we get from Unique.get() should be filtered, all records that have
n_scans != number_of_scans_being_merged should be discarded.
If we're lucky to have started and finished a scan on some index (denote it
as S) without flushing the Unique in the process, then:
- there is no point in adding any new records into the Unique because their
absence in the Unique means that they don't have match in S and hence will
not get into the result of intersection.
- we need to only update the counters to be able to tell if the elements that
are already in the Unique will have matches in all scans.
1.4 Strategies that do not seem to be useful
--------------------------------------------
keeping them here so we don't consider them over and over
1.4.1 Remove matches after having produced an ordered stream
------------------------------------------------------------
We can dump everything into a rowid stream and get it sorted. Then we read it,
and if we see a rowid repeated $n_merged_scans times, it belongs to the
intersection (pass to output), otherwise it doesn't (skip).
This doesn't have any advantages over the produce/merge sorted streams
approach.
1.4.2 Sparse rowid bitmaps
--------------------------
Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
bitmaps assume there will always be enough memory to accommodate them.
PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
disk pages, not ids of individual records. It's hard for us to do something
similar because our rowids are opaque entities whose meaning depends on the
storage engines.
This seems to require too much change to be worth it.
2. Optimization
===============
SEL_TREE objects already represent intersections. The problems with
optimizations are:
- Cost formula(s)
- When N keys/conditions are present:
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
somehow avoid considering (2^n - n) possible options.
- Avoid producing (or even considering) apparently suboptimal plans:
= Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
a subset of columns covered by all other indexes.
= (TODO any other rules?)
- Correlation across selectivities. If there is a condition
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
can we consider satisfaction of AND-parts to be independent?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Igor): index_merge: non-ROR intersection (21)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: non-ROR intersection
CREATION DATE..: Thu, 21 May 2009, 21:32
SUPERVISOR.....: Knielsen
IMPLEMENTOR....: Igor
COPIES TO......: Rhuddleston, Sanja, Knielsen, Serg, Monty, Timour, Igor, Psergey
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 21 (http://askmonty.org/worklog/?tid=21)
VERSION........: Server-9.x
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 25
ESTIMATE.......: 175 (hours remain)
ORIG. ESTIMATE.: 175
PROGRESS NOTES:
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Status updated.
--- /tmp/wklog.21.old.18761 2010-06-24 05:48:43.000000000 +0000
+++ /tmp/wklog.21.new.18761 2010-06-24 05:48:43.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Observers changed: Knielsen,Monty,Psergey,Sanja,Igor,Rhuddleston,Timour,Serg
-=-=(Guest - Thu, 24 Jun 2010, 05:44)=-=-
I spent 25 hours in the month June 2010 to perform the following work for this task.
1. Compared tree possible algorithms to implement the operation of index intersection mentioned in
HLS by their labor/time consumption. Chose the algorithm that uses a modified Unique class (1.3) as
the most cheap requiring the least amount of efforts/time for its development.
2. Developed a design for a modification of the Unique class to support the operation of index
intersection.
3. Modified the merge_buffers procedure used by the Unique class to make it possible to use it not
only for the the operation of union, but for the operation of intersect as well.
Worked 25 hours and estimate 175 hours remain (original estimate increased by 200 hours).
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 21
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 21
-=-=(Guest - Wed, 03 Jun 2009, 01:17)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.30002 2009-06-03 01:17:32.000000000 +0300
+++ /tmp/wklog.21.new.30002 2009-06-03 01:17:32.000000000 +0300
@@ -7,13 +7,13 @@
The current optimization works with:
-WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
-WHERE key1_part1=1 OR key2_part1=3
+WHERE key1_part1=1 AND key2_part1=3
or
-WHERE key_part1<10 or key2_part1<100
+WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:06)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29694 2009-06-03 01:06:50.000000000 +0300
+++ /tmp/wklog.21.new.29694 2009-06-03 01:06:50.000000000 +0300
@@ -12,6 +12,8 @@
but not with:
WHERE key1_part1=1 OR key2_part1=3
+or
+WHERE key_part1<10 or key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:05)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29638 2009-06-03 01:05:01.000000000 +0300
+++ /tmp/wklog.21.new.29638 2009-06-03 01:05:01.000000000 +0300
@@ -3,5 +3,15 @@
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
+For example, assuming that key1 has 2 parts and key2 has 1 part.
+
+The current optimization works with:
+
+WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+
+but not with:
+
+WHERE key1_part1=1 OR key2_part1=3
+
This WL entry is to lift this limitation by developing algorithms that do
-intersection on non-ROR scans.
+intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Guest - Tue, 26 May 2009, 14:04)=-=-
High-Level Specification modified.
--- /tmp/wklog.21.old.1802 2009-05-26 14:04:57.000000000 +0300
+++ /tmp/wklog.21.new.1802 2009-05-26 14:04:57.000000000 +0300
@@ -1,4 +1,3 @@
-
<contents>
1. Execution
1.1 Temptable
@@ -30,6 +29,8 @@
1.1 Temptable
-------------
+[ This is our strategy of choice at the moment]
+
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
@@ -168,3 +169,8 @@
a subset of columns covered by all other indexes.
= (TODO any other rules?)
+- Correlation across selectivities. If there is a condition
+
+ "cond(key1) AND cond(key2) AND ... AND cond(keyN)",
+
+ can we consider satisfaction of AND-parts to be independent?
-=-=(Psergey - Thu, 21 May 2009, 21:33)=-=-
High-Level Specification modified.
--- /tmp/wklog.21.old.25705 2009-05-21 21:33:02.000000000 +0300
+++ /tmp/wklog.21.new.25705 2009-05-21 21:33:02.000000000 +0300
@@ -1 +1,170 @@
+<contents>
+1. Execution
+1.1 Temptable
+1.1.1 Improvement
+1.2 Produce/merge sorted streams
+1.3 Extend Unique class to handle intersection
+1.4 Strategies that do not seem to be useful
+1.4.1 Remove matches after having produced an ordered stream
+1.4.2 Sparse rowid bitmaps
+2. Optimization
+
+</contents>
+
+1. Execution
+============
+
+The primary task is to find means to compute an intersection of N unordered
+streams. Besides general memory/cpu cost of computation, we consider:
+
+- whether the produced rowid stream is ordered. If it is, it can be piped
+ into index_merge/intersect (as opposed to sort-intersect)
+
+- whether the strategy can take advantage of the fact that some input streams
+ are already rowid-ordered
+
+- startup cost (cost of producing the first output record)
+
+We see the following possible strategies:
+
+1.1 Temptable
+-------------
+Use a temporary heap-grow-out-to-myisam table with a primary key:
+
+create table temp_table (
+ rowid binary($rowid_size),
+ count n,
+ primary key(rowid);
+);
+
+Then use this algorithm:
+
+ i1= {index with the least E(#records)};
+
+ for each record R in range_scan(i1)
+ temp_table.insert(R.rowid, count=1);
+
+ for each index idx except i1
+ {
+ for each R record in scan(idx) // (INNER-LOOP)
+ {
+ if (temp_table has R)
+ temptable[R].count++;
+ }
+ }
+
+ // The following loop can do ordered or unordered scan
+ // if we want it to be ordered scan, we probably better arrange so that
+ // 'count' column is part of the index.
+ for each record R in temp_table
+ {
+ if (R.count == number_of_streams)
+ emit(R.rowid);
+ }
+
+The algorithm has an option to emit an ordered rowid stream.
+
+In the above form, the cost to produce the first record is high. It's easy to
+adjust the algorithm to make it low - we'll need to just start scanning all
+indexes at once, and finish as soon as we got a full match, i.e. the
+
+ temptable[R].count++
+
+operation resulted in the counter being equal to the number of merged scans.
+
+1.1.1 Improvement
+~~~~~~~~~~~~~~~~~
+When running INNER-LOOP, we could count how many times we've done the
+"count++" operation. If it has been done #records-in-temptable times, that
+means that all further records will not have matches and we can finish the
+scan, i.e. break out of the INNER-LOOP.
+
+1.2 Produce/merge sorted streams
+--------------------------------
+For each of the merged scan, use filesort-like action to end up with an
+ordered stream of rowids. Then merge the ordered streams.
+
+By filesort-like action we mean
+ - Run over index, collect rowids in a buffer.
+ - When the buffer is full, sort it and dump into a temporary file.
+After the above we'll end up with a number of sorted buffers on disk. We can
+use mergebuff() function (it is part of filesort's functions) to produce one
+ordered sequence (i.e. array, which may be partially on disk) of rowids.
+
+Merging of ordered streams with help of priority queue is already implemented
+in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
+
+ child_quick->get_next()
+
+call with a call to read rowid from an ordered sequence.
+
+1.3 Extend Unique class to handle intersection
+----------------------------------------------
+There is no point to use Unique object as a device that accumulates rowids of
+a single scan then produces them in sorted order. One could do the same faster
+with accumulating an array of rowids and then sorting it.
+
+It's possible to use Unique object to collect/merge data from all scans though.
+The idea is as follows:
+
+- Unique should store <rowid, n_scans> pairs
+- Duplicates are pairs with the same rowid
+- Unique should try to avoid creating duplicates:
+ - don't add a duplicate into the in-memory part, instead combine two elements
+ together by adding their n_scans elements.
+ - combine duplicates when it sees them in Unique.get() call
+- The data we get from Unique.get() should be filtered, all records that have
+ n_scans != number_of_scans_being_merged should be discarded.
+
+If we're lucky to have started and finished a scan on some index (denote it
+as S) without flushing the Unique in the process, then:
+- there is no point in adding any new records into the Unique because their
+ absence in the Unique means that they don't have match in S and hence will
+ not get into the result of intersection.
+- we need to only update the counters to be able to tell if the elements that
+ are already in the Unique will have matches in all scans.
+
+1.4 Strategies that do not seem to be useful
+--------------------------------------------
+
+keeping them here so we don't consider them over and over
+
+1.4.1 Remove matches after having produced an ordered stream
+------------------------------------------------------------
+We can dump everything into a rowid stream and get it sorted. Then we read it,
+and if we see a rowid repeated $n_merged_scans times, it belongs to the
+intersection (pass to output), otherwise it doesn't (skip).
+This doesn't have any advantages over the produce/merge sorted streams
+approach.
+
+1.4.2 Sparse rowid bitmaps
+--------------------------
+Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
+bitmaps assume there will always be enough memory to accommodate them.
+
+PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
+disk pages, not ids of individual records. It's hard for us to do something
+similar because our rowids are opaque entities whose meaning depends on the
+storage engines.
+
+This seems to require too much change to be worth it.
+
+2. Optimization
+===============
+
+SEL_TREE objects already represent intersections. The problems with
+optimizations are:
+
+- Cost formula(s)
+- When N keys/conditions are present:
+
+ "cond(key1) AND cond(key2) AND ... AND cond(keyN)",
+
+ somehow avoid considering (2^n - n) possible options.
+
+- Avoid producing (or even considering) apparently suboptimal plans:
+ = Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
+ a subset of columns covered by all other indexes.
+ = (TODO any other rules?)
+
DESCRIPTION:
At the moment index_merge supports intersection only for rowid-ordered streams.
This translates into a limitation that index_merge/intersect can only be
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
For example, assuming that key1 has 2 parts and key2 has 1 part.
The current optimization works with:
WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
WHERE key1_part1=1 AND key2_part1=3
or
WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Execution
1.1 Temptable
1.1.1 Improvement
1.2 Produce/merge sorted streams
1.3 Extend Unique class to handle intersection
1.4 Strategies that do not seem to be useful
1.4.1 Remove matches after having produced an ordered stream
1.4.2 Sparse rowid bitmaps
2. Optimization
</contents>
1. Execution
============
The primary task is to find means to compute an intersection of N unordered
streams. Besides general memory/cpu cost of computation, we consider:
- whether the produced rowid stream is ordered. If it is, it can be piped
into index_merge/intersect (as opposed to sort-intersect)
- whether the strategy can take advantage of the fact that some input streams
are already rowid-ordered
- startup cost (cost of producing the first output record)
We see the following possible strategies:
1.1 Temptable
-------------
[ This is our strategy of choice at the moment]
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
rowid binary($rowid_size),
count n,
primary key(rowid);
);
Then use this algorithm:
i1= {index with the least E(#records)};
for each record R in range_scan(i1)
temp_table.insert(R.rowid, count=1);
for each index idx except i1
{
for each R record in scan(idx) // (INNER-LOOP)
{
if (temp_table has R)
temptable[R].count++;
}
}
// The following loop can do ordered or unordered scan
// if we want it to be ordered scan, we probably better arrange so that
// 'count' column is part of the index.
for each record R in temp_table
{
if (R.count == number_of_streams)
emit(R.rowid);
}
The algorithm has an option to emit an ordered rowid stream.
In the above form, the cost to produce the first record is high. It's easy to
adjust the algorithm to make it low - we'll need to just start scanning all
indexes at once, and finish as soon as we got a full match, i.e. the
temptable[R].count++
operation resulted in the counter being equal to the number of merged scans.
1.1.1 Improvement
~~~~~~~~~~~~~~~~~
When running INNER-LOOP, we could count how many times we've done the
"count++" operation. If it has been done #records-in-temptable times, that
means that all further records will not have matches and we can finish the
scan, i.e. break out of the INNER-LOOP.
1.2 Produce/merge sorted streams
--------------------------------
For each of the merged scan, use filesort-like action to end up with an
ordered stream of rowids. Then merge the ordered streams.
By filesort-like action we mean
- Run over index, collect rowids in a buffer.
- When the buffer is full, sort it and dump into a temporary file.
After the above we'll end up with a number of sorted buffers on disk. We can
use mergebuff() function (it is part of filesort's functions) to produce one
ordered sequence (i.e. array, which may be partially on disk) of rowids.
Merging of ordered streams with help of priority queue is already implemented
in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
child_quick->get_next()
call with a call to read rowid from an ordered sequence.
1.3 Extend Unique class to handle intersection
----------------------------------------------
There is no point to use Unique object as a device that accumulates rowids of
a single scan then produces them in sorted order. One could do the same faster
with accumulating an array of rowids and then sorting it.
It's possible to use Unique object to collect/merge data from all scans though.
The idea is as follows:
- Unique should store <rowid, n_scans> pairs
- Duplicates are pairs with the same rowid
- Unique should try to avoid creating duplicates:
- don't add a duplicate into the in-memory part, instead combine two elements
together by adding their n_scans elements.
- combine duplicates when it sees them in Unique.get() call
- The data we get from Unique.get() should be filtered, all records that have
n_scans != number_of_scans_being_merged should be discarded.
If we're lucky to have started and finished a scan on some index (denote it
as S) without flushing the Unique in the process, then:
- there is no point in adding any new records into the Unique because their
absence in the Unique means that they don't have match in S and hence will
not get into the result of intersection.
- we need to only update the counters to be able to tell if the elements that
are already in the Unique will have matches in all scans.
1.4 Strategies that do not seem to be useful
--------------------------------------------
keeping them here so we don't consider them over and over
1.4.1 Remove matches after having produced an ordered stream
------------------------------------------------------------
We can dump everything into a rowid stream and get it sorted. Then we read it,
and if we see a rowid repeated $n_merged_scans times, it belongs to the
intersection (pass to output), otherwise it doesn't (skip).
This doesn't have any advantages over the produce/merge sorted streams
approach.
1.4.2 Sparse rowid bitmaps
--------------------------
Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
bitmaps assume there will always be enough memory to accommodate them.
PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
disk pages, not ids of individual records. It's hard for us to do something
similar because our rowids are opaque entities whose meaning depends on the
storage engines.
This seems to require too much change to be worth it.
2. Optimization
===============
SEL_TREE objects already represent intersections. The problems with
optimizations are:
- Cost formula(s)
- When N keys/conditions are present:
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
somehow avoid considering (2^n - n) possible options.
- Avoid producing (or even considering) apparently suboptimal plans:
= Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
a subset of columns covered by all other indexes.
= (TODO any other rules?)
- Correlation across selectivities. If there is a condition
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
can we consider satisfaction of AND-parts to be independent?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Igor): index_merge: non-ROR intersection (21)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: non-ROR intersection
CREATION DATE..: Thu, 21 May 2009, 21:32
SUPERVISOR.....: Knielsen
IMPLEMENTOR....: Igor
COPIES TO......: Rhuddleston, Sanja, Knielsen, Serg, Monty, Timour, Igor, Psergey
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 21 (http://askmonty.org/worklog/?tid=21)
VERSION........: Server-9.x
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 25
ESTIMATE.......: 175 (hours remain)
ORIG. ESTIMATE.: 175
PROGRESS NOTES:
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Status updated.
--- /tmp/wklog.21.old.18761 2010-06-24 05:48:43.000000000 +0000
+++ /tmp/wklog.21.new.18761 2010-06-24 05:48:43.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Observers changed: Knielsen,Monty,Psergey,Sanja,Igor,Rhuddleston,Timour,Serg
-=-=(Guest - Thu, 24 Jun 2010, 05:44)=-=-
I spent 25 hours in the month June 2010 to perform the following work for this task.
1. Compared tree possible algorithms to implement the operation of index intersection mentioned in
HLS by their labor/time consumption. Chose the algorithm that uses a modified Unique class (1.3) as
the most cheap requiring the least amount of efforts/time for its development.
2. Developed a design for a modification of the Unique class to support the operation of index
intersection.
3. Modified the merge_buffers procedure used by the Unique class to make it possible to use it not
only for the the operation of union, but for the operation of intersect as well.
Worked 25 hours and estimate 175 hours remain (original estimate increased by 200 hours).
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 21
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 21
-=-=(Guest - Wed, 03 Jun 2009, 01:17)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.30002 2009-06-03 01:17:32.000000000 +0300
+++ /tmp/wklog.21.new.30002 2009-06-03 01:17:32.000000000 +0300
@@ -7,13 +7,13 @@
The current optimization works with:
-WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
-WHERE key1_part1=1 OR key2_part1=3
+WHERE key1_part1=1 AND key2_part1=3
or
-WHERE key_part1<10 or key2_part1<100
+WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:06)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29694 2009-06-03 01:06:50.000000000 +0300
+++ /tmp/wklog.21.new.29694 2009-06-03 01:06:50.000000000 +0300
@@ -12,6 +12,8 @@
but not with:
WHERE key1_part1=1 OR key2_part1=3
+or
+WHERE key_part1<10 or key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:05)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29638 2009-06-03 01:05:01.000000000 +0300
+++ /tmp/wklog.21.new.29638 2009-06-03 01:05:01.000000000 +0300
@@ -3,5 +3,15 @@
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
+For example, assuming that key1 has 2 parts and key2 has 1 part.
+
+The current optimization works with:
+
+WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+
+but not with:
+
+WHERE key1_part1=1 OR key2_part1=3
+
This WL entry is to lift this limitation by developing algorithms that do
-intersection on non-ROR scans.
+intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Guest - Tue, 26 May 2009, 14:04)=-=-
High-Level Specification modified.
--- /tmp/wklog.21.old.1802 2009-05-26 14:04:57.000000000 +0300
+++ /tmp/wklog.21.new.1802 2009-05-26 14:04:57.000000000 +0300
@@ -1,4 +1,3 @@
-
<contents>
1. Execution
1.1 Temptable
@@ -30,6 +29,8 @@
1.1 Temptable
-------------
+[ This is our strategy of choice at the moment]
+
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
@@ -168,3 +169,8 @@
a subset of columns covered by all other indexes.
= (TODO any other rules?)
+- Correlation across selectivities. If there is a condition
+
+ "cond(key1) AND cond(key2) AND ... AND cond(keyN)",
+
+ can we consider satisfaction of AND-parts to be independent?
-=-=(Psergey - Thu, 21 May 2009, 21:33)=-=-
High-Level Specification modified.
--- /tmp/wklog.21.old.25705 2009-05-21 21:33:02.000000000 +0300
+++ /tmp/wklog.21.new.25705 2009-05-21 21:33:02.000000000 +0300
@@ -1 +1,170 @@
+<contents>
+1. Execution
+1.1 Temptable
+1.1.1 Improvement
+1.2 Produce/merge sorted streams
+1.3 Extend Unique class to handle intersection
+1.4 Strategies that do not seem to be useful
+1.4.1 Remove matches after having produced an ordered stream
+1.4.2 Sparse rowid bitmaps
+2. Optimization
+
+</contents>
+
+1. Execution
+============
+
+The primary task is to find means to compute an intersection of N unordered
+streams. Besides general memory/cpu cost of computation, we consider:
+
+- whether the produced rowid stream is ordered. If it is, it can be piped
+ into index_merge/intersect (as opposed to sort-intersect)
+
+- whether the strategy can take advantage of the fact that some input streams
+ are already rowid-ordered
+
+- startup cost (cost of producing the first output record)
+
+We see the following possible strategies:
+
+1.1 Temptable
+-------------
+Use a temporary heap-grow-out-to-myisam table with a primary key:
+
+create table temp_table (
+ rowid binary($rowid_size),
+ count n,
+ primary key(rowid);
+);
+
+Then use this algorithm:
+
+ i1= {index with the least E(#records)};
+
+ for each record R in range_scan(i1)
+ temp_table.insert(R.rowid, count=1);
+
+ for each index idx except i1
+ {
+ for each R record in scan(idx) // (INNER-LOOP)
+ {
+ if (temp_table has R)
+ temptable[R].count++;
+ }
+ }
+
+ // The following loop can do ordered or unordered scan
+ // if we want it to be ordered scan, we probably better arrange so that
+ // 'count' column is part of the index.
+ for each record R in temp_table
+ {
+ if (R.count == number_of_streams)
+ emit(R.rowid);
+ }
+
+The algorithm has an option to emit an ordered rowid stream.
+
+In the above form, the cost to produce the first record is high. It's easy to
+adjust the algorithm to make it low - we'll need to just start scanning all
+indexes at once, and finish as soon as we got a full match, i.e. the
+
+ temptable[R].count++
+
+operation resulted in the counter being equal to the number of merged scans.
+
+1.1.1 Improvement
+~~~~~~~~~~~~~~~~~
+When running INNER-LOOP, we could count how many times we've done the
+"count++" operation. If it has been done #records-in-temptable times, that
+means that all further records will not have matches and we can finish the
+scan, i.e. break out of the INNER-LOOP.
+
+1.2 Produce/merge sorted streams
+--------------------------------
+For each of the merged scan, use filesort-like action to end up with an
+ordered stream of rowids. Then merge the ordered streams.
+
+By filesort-like action we mean
+ - Run over index, collect rowids in a buffer.
+ - When the buffer is full, sort it and dump into a temporary file.
+After the above we'll end up with a number of sorted buffers on disk. We can
+use mergebuff() function (it is part of filesort's functions) to produce one
+ordered sequence (i.e. array, which may be partially on disk) of rowids.
+
+Merging of ordered streams with help of priority queue is already implemented
+in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
+
+ child_quick->get_next()
+
+call with a call to read rowid from an ordered sequence.
+
+1.3 Extend Unique class to handle intersection
+----------------------------------------------
+There is no point to use Unique object as a device that accumulates rowids of
+a single scan then produces them in sorted order. One could do the same faster
+with accumulating an array of rowids and then sorting it.
+
+It's possible to use Unique object to collect/merge data from all scans though.
+The idea is as follows:
+
+- Unique should store <rowid, n_scans> pairs
+- Duplicates are pairs with the same rowid
+- Unique should try to avoid creating duplicates:
+ - don't add a duplicate into the in-memory part, instead combine two elements
+ together by adding their n_scans elements.
+ - combine duplicates when it sees them in Unique.get() call
+- The data we get from Unique.get() should be filtered, all records that have
+ n_scans != number_of_scans_being_merged should be discarded.
+
+If we're lucky to have started and finished a scan on some index (denote it
+as S) without flushing the Unique in the process, then:
+- there is no point in adding any new records into the Unique because their
+ absence in the Unique means that they don't have match in S and hence will
+ not get into the result of intersection.
+- we need to only update the counters to be able to tell if the elements that
+ are already in the Unique will have matches in all scans.
+
+1.4 Strategies that do not seem to be useful
+--------------------------------------------
+
+keeping them here so we don't consider them over and over
+
+1.4.1 Remove matches after having produced an ordered stream
+------------------------------------------------------------
+We can dump everything into a rowid stream and get it sorted. Then we read it,
+and if we see a rowid repeated $n_merged_scans times, it belongs to the
+intersection (pass to output), otherwise it doesn't (skip).
+This doesn't have any advantages over the produce/merge sorted streams
+approach.
+
+1.4.2 Sparse rowid bitmaps
+--------------------------
+Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
+bitmaps assume there will always be enough memory to accommodate them.
+
+PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
+disk pages, not ids of individual records. It's hard for us to do something
+similar because our rowids are opaque entities whose meaning depends on the
+storage engines.
+
+This seems to require too much change to be worth it.
+
+2. Optimization
+===============
+
+SEL_TREE objects already represent intersections. The problems with
+optimizations are:
+
+- Cost formula(s)
+- When N keys/conditions are present:
+
+ "cond(key1) AND cond(key2) AND ... AND cond(keyN)",
+
+ somehow avoid considering (2^n - n) possible options.
+
+- Avoid producing (or even considering) apparently suboptimal plans:
+ = Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
+ a subset of columns covered by all other indexes.
+ = (TODO any other rules?)
+
DESCRIPTION:
At the moment index_merge supports intersection only for rowid-ordered streams.
This translates into a limitation that index_merge/intersect can only be
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
For example, assuming that key1 has 2 parts and key2 has 1 part.
The current optimization works with:
WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
WHERE key1_part1=1 AND key2_part1=3
or
WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Execution
1.1 Temptable
1.1.1 Improvement
1.2 Produce/merge sorted streams
1.3 Extend Unique class to handle intersection
1.4 Strategies that do not seem to be useful
1.4.1 Remove matches after having produced an ordered stream
1.4.2 Sparse rowid bitmaps
2. Optimization
</contents>
1. Execution
============
The primary task is to find means to compute an intersection of N unordered
streams. Besides general memory/cpu cost of computation, we consider:
- whether the produced rowid stream is ordered. If it is, it can be piped
into index_merge/intersect (as opposed to sort-intersect)
- whether the strategy can take advantage of the fact that some input streams
are already rowid-ordered
- startup cost (cost of producing the first output record)
We see the following possible strategies:
1.1 Temptable
-------------
[ This is our strategy of choice at the moment]
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
rowid binary($rowid_size),
count n,
primary key(rowid);
);
Then use this algorithm:
i1= {index with the least E(#records)};
for each record R in range_scan(i1)
temp_table.insert(R.rowid, count=1);
for each index idx except i1
{
for each R record in scan(idx) // (INNER-LOOP)
{
if (temp_table has R)
temptable[R].count++;
}
}
// The following loop can do ordered or unordered scan
// if we want it to be ordered scan, we probably better arrange so that
// 'count' column is part of the index.
for each record R in temp_table
{
if (R.count == number_of_streams)
emit(R.rowid);
}
The algorithm has an option to emit an ordered rowid stream.
In the above form, the cost to produce the first record is high. It's easy to
adjust the algorithm to make it low - we'll need to just start scanning all
indexes at once, and finish as soon as we got a full match, i.e. the
temptable[R].count++
operation resulted in the counter being equal to the number of merged scans.
1.1.1 Improvement
~~~~~~~~~~~~~~~~~
When running INNER-LOOP, we could count how many times we've done the
"count++" operation. If it has been done #records-in-temptable times, that
means that all further records will not have matches and we can finish the
scan, i.e. break out of the INNER-LOOP.
1.2 Produce/merge sorted streams
--------------------------------
For each of the merged scan, use filesort-like action to end up with an
ordered stream of rowids. Then merge the ordered streams.
By filesort-like action we mean
- Run over index, collect rowids in a buffer.
- When the buffer is full, sort it and dump into a temporary file.
After the above we'll end up with a number of sorted buffers on disk. We can
use mergebuff() function (it is part of filesort's functions) to produce one
ordered sequence (i.e. array, which may be partially on disk) of rowids.
Merging of ordered streams with help of priority queue is already implemented
in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
child_quick->get_next()
call with a call to read rowid from an ordered sequence.
1.3 Extend Unique class to handle intersection
----------------------------------------------
There is no point to use Unique object as a device that accumulates rowids of
a single scan then produces them in sorted order. One could do the same faster
with accumulating an array of rowids and then sorting it.
It's possible to use Unique object to collect/merge data from all scans though.
The idea is as follows:
- Unique should store <rowid, n_scans> pairs
- Duplicates are pairs with the same rowid
- Unique should try to avoid creating duplicates:
- don't add a duplicate into the in-memory part, instead combine two elements
together by adding their n_scans elements.
- combine duplicates when it sees them in Unique.get() call
- The data we get from Unique.get() should be filtered, all records that have
n_scans != number_of_scans_being_merged should be discarded.
If we're lucky to have started and finished a scan on some index (denote it
as S) without flushing the Unique in the process, then:
- there is no point in adding any new records into the Unique because their
absence in the Unique means that they don't have match in S and hence will
not get into the result of intersection.
- we need to only update the counters to be able to tell if the elements that
are already in the Unique will have matches in all scans.
1.4 Strategies that do not seem to be useful
--------------------------------------------
keeping them here so we don't consider them over and over
1.4.1 Remove matches after having produced an ordered stream
------------------------------------------------------------
We can dump everything into a rowid stream and get it sorted. Then we read it,
and if we see a rowid repeated $n_merged_scans times, it belongs to the
intersection (pass to output), otherwise it doesn't (skip).
This doesn't have any advantages over the produce/merge sorted streams
approach.
1.4.2 Sparse rowid bitmaps
--------------------------
Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
bitmaps assume there will always be enough memory to accommodate them.
PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
disk pages, not ids of individual records. It's hard for us to do something
similar because our rowids are opaque entities whose meaning depends on the
storage engines.
This seems to require too much change to be worth it.
2. Optimization
===============
SEL_TREE objects already represent intersections. The problems with
optimizations are:
- Cost formula(s)
- When N keys/conditions are present:
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
somehow avoid considering (2^n - n) possible options.
- Avoid producing (or even considering) apparently suboptimal plans:
= Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
a subset of columns covered by all other indexes.
= (TODO any other rules?)
- Correlation across selectivities. If there is a condition
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
can we consider satisfaction of AND-parts to be independent?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Igor): index_merge: non-ROR intersection (21)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: non-ROR intersection
CREATION DATE..: Thu, 21 May 2009, 21:32
SUPERVISOR.....: Knielsen
IMPLEMENTOR....:
COPIES TO......: Rhuddleston, Sanja, Knielsen, Serg, Monty, Timour, Igor, Psergey
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 21 (http://askmonty.org/worklog/?tid=21)
VERSION........: Server-9.x
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 25
ESTIMATE.......: 175 (hours remain)
ORIG. ESTIMATE.: 175
PROGRESS NOTES:
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Observers changed: Knielsen,Monty,Psergey,Sanja,Igor,Rhuddleston,Timour,Serg
-=-=(Guest - Thu, 24 Jun 2010, 05:44)=-=-
I spent 25 hours in the month June 2010 to perform the following work for this task.
1. Compared tree possible algorithms to implement the operation of index intersection mentioned in
HLS by their labor/time consumption. Chose the algorithm that uses a modified Unique class (1.3) as
the most cheap requiring the least amount of efforts/time for its development.
2. Developed a design for a modification of the Unique class to support the operation of index
intersection.
3. Modified the merge_buffers procedure used by the Unique class to make it possible to use it not
only for the the operation of union, but for the operation of intersect as well.
Worked 25 hours and estimate 175 hours remain (original estimate increased by 200 hours).
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 21
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 21
-=-=(Guest - Wed, 03 Jun 2009, 01:17)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.30002 2009-06-03 01:17:32.000000000 +0300
+++ /tmp/wklog.21.new.30002 2009-06-03 01:17:32.000000000 +0300
@@ -7,13 +7,13 @@
The current optimization works with:
-WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
-WHERE key1_part1=1 OR key2_part1=3
+WHERE key1_part1=1 AND key2_part1=3
or
-WHERE key_part1<10 or key2_part1<100
+WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:06)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29694 2009-06-03 01:06:50.000000000 +0300
+++ /tmp/wklog.21.new.29694 2009-06-03 01:06:50.000000000 +0300
@@ -12,6 +12,8 @@
but not with:
WHERE key1_part1=1 OR key2_part1=3
+or
+WHERE key_part1<10 or key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:05)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29638 2009-06-03 01:05:01.000000000 +0300
+++ /tmp/wklog.21.new.29638 2009-06-03 01:05:01.000000000 +0300
@@ -3,5 +3,15 @@
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
+For example, assuming that key1 has 2 parts and key2 has 1 part.
+
+The current optimization works with:
+
+WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+
+but not with:
+
+WHERE key1_part1=1 OR key2_part1=3
+
This WL entry is to lift this limitation by developing algorithms that do
-intersection on non-ROR scans.
+intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Guest - Tue, 26 May 2009, 14:04)=-=-
High-Level Specification modified.
--- /tmp/wklog.21.old.1802 2009-05-26 14:04:57.000000000 +0300
+++ /tmp/wklog.21.new.1802 2009-05-26 14:04:57.000000000 +0300
@@ -1,4 +1,3 @@
-
<contents>
1. Execution
1.1 Temptable
@@ -30,6 +29,8 @@
1.1 Temptable
-------------
+[ This is our strategy of choice at the moment]
+
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
@@ -168,3 +169,8 @@
a subset of columns covered by all other indexes.
= (TODO any other rules?)
+- Correlation across selectivities. If there is a condition
+
+ "cond(key1) AND cond(key2) AND ... AND cond(keyN)",
+
+ can we consider satisfaction of AND-parts to be independent?
-=-=(Psergey - Thu, 21 May 2009, 21:33)=-=-
High-Level Specification modified.
--- /tmp/wklog.21.old.25705 2009-05-21 21:33:02.000000000 +0300
+++ /tmp/wklog.21.new.25705 2009-05-21 21:33:02.000000000 +0300
@@ -1 +1,170 @@
+<contents>
+1. Execution
+1.1 Temptable
+1.1.1 Improvement
+1.2 Produce/merge sorted streams
+1.3 Extend Unique class to handle intersection
+1.4 Strategies that do not seem to be useful
+1.4.1 Remove matches after having produced an ordered stream
+1.4.2 Sparse rowid bitmaps
+2. Optimization
+
+</contents>
+
+1. Execution
+============
+
+The primary task is to find means to compute an intersection of N unordered
+streams. Besides general memory/cpu cost of computation, we consider:
+
+- whether the produced rowid stream is ordered. If it is, it can be piped
+ into index_merge/intersect (as opposed to sort-intersect)
+
+- whether the strategy can take advantage of the fact that some input streams
+ are already rowid-ordered
+
+- startup cost (cost of producing the first output record)
+
+We see the following possible strategies:
+
+1.1 Temptable
+-------------
+Use a temporary heap-grow-out-to-myisam table with a primary key:
+
+create table temp_table (
+ rowid binary($rowid_size),
+ count n,
+ primary key(rowid);
+);
+
+Then use this algorithm:
+
+ i1= {index with the least E(#records)};
+
+ for each record R in range_scan(i1)
+ temp_table.insert(R.rowid, count=1);
+
+ for each index idx except i1
+ {
+ for each R record in scan(idx) // (INNER-LOOP)
+ {
+ if (temp_table has R)
+ temptable[R].count++;
+ }
+ }
+
+ // The following loop can do ordered or unordered scan
+ // if we want it to be ordered scan, we probably better arrange so that
+ // 'count' column is part of the index.
+ for each record R in temp_table
+ {
+ if (R.count == number_of_streams)
+ emit(R.rowid);
+ }
+
+The algorithm has an option to emit an ordered rowid stream.
+
+In the above form, the cost to produce the first record is high. It's easy to
+adjust the algorithm to make it low - we'll need to just start scanning all
+indexes at once, and finish as soon as we got a full match, i.e. the
+
+ temptable[R].count++
+
+operation resulted in the counter being equal to the number of merged scans.
+
+1.1.1 Improvement
+~~~~~~~~~~~~~~~~~
+When running INNER-LOOP, we could count how many times we've done the
+"count++" operation. If it has been done #records-in-temptable times, that
+means that all further records will not have matches and we can finish the
+scan, i.e. break out of the INNER-LOOP.
+
+1.2 Produce/merge sorted streams
+--------------------------------
+For each of the merged scan, use filesort-like action to end up with an
+ordered stream of rowids. Then merge the ordered streams.
+
+By filesort-like action we mean
+ - Run over index, collect rowids in a buffer.
+ - When the buffer is full, sort it and dump into a temporary file.
+After the above we'll end up with a number of sorted buffers on disk. We can
+use mergebuff() function (it is part of filesort's functions) to produce one
+ordered sequence (i.e. array, which may be partially on disk) of rowids.
+
+Merging of ordered streams with help of priority queue is already implemented
+in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
+
+ child_quick->get_next()
+
+call with a call to read rowid from an ordered sequence.
+
+1.3 Extend Unique class to handle intersection
+----------------------------------------------
+There is no point to use Unique object as a device that accumulates rowids of
+a single scan then produces them in sorted order. One could do the same faster
+with accumulating an array of rowids and then sorting it.
+
+It's possible to use Unique object to collect/merge data from all scans though.
+The idea is as follows:
+
+- Unique should store <rowid, n_scans> pairs
+- Duplicates are pairs with the same rowid
+- Unique should try to avoid creating duplicates:
+ - don't add a duplicate into the in-memory part, instead combine two elements
+ together by adding their n_scans elements.
+ - combine duplicates when it sees them in Unique.get() call
+- The data we get from Unique.get() should be filtered, all records that have
+ n_scans != number_of_scans_being_merged should be discarded.
+
+If we're lucky to have started and finished a scan on some index (denote it
+as S) without flushing the Unique in the process, then:
+- there is no point in adding any new records into the Unique because their
+ absence in the Unique means that they don't have match in S and hence will
+ not get into the result of intersection.
+- we need to only update the counters to be able to tell if the elements that
+ are already in the Unique will have matches in all scans.
+
+1.4 Strategies that do not seem to be useful
+--------------------------------------------
+
+keeping them here so we don't consider them over and over
+
+1.4.1 Remove matches after having produced an ordered stream
+------------------------------------------------------------
+We can dump everything into a rowid stream and get it sorted. Then we read it,
+and if we see a rowid repeated $n_merged_scans times, it belongs to the
+intersection (pass to output), otherwise it doesn't (skip).
+This doesn't have any advantages over the produce/merge sorted streams
+approach.
+
+1.4.2 Sparse rowid bitmaps
+--------------------------
+Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
+bitmaps assume there will always be enough memory to accommodate them.
+
+PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
+disk pages, not ids of individual records. It's hard for us to do something
+similar because our rowids are opaque entities whose meaning depends on the
+storage engines.
+
+This seems to require too much change to be worth it.
+
+2. Optimization
+===============
+
+SEL_TREE objects already represent intersections. The problems with
+optimizations are:
+
+- Cost formula(s)
+- When N keys/conditions are present:
+
+ "cond(key1) AND cond(key2) AND ... AND cond(keyN)",
+
+ somehow avoid considering (2^n - n) possible options.
+
+- Avoid producing (or even considering) apparently suboptimal plans:
+ = Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
+ a subset of columns covered by all other indexes.
+ = (TODO any other rules?)
+
DESCRIPTION:
At the moment index_merge supports intersection only for rowid-ordered streams.
This translates into a limitation that index_merge/intersect can only be
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
For example, assuming that key1 has 2 parts and key2 has 1 part.
The current optimization works with:
WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
WHERE key1_part1=1 AND key2_part1=3
or
WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Execution
1.1 Temptable
1.1.1 Improvement
1.2 Produce/merge sorted streams
1.3 Extend Unique class to handle intersection
1.4 Strategies that do not seem to be useful
1.4.1 Remove matches after having produced an ordered stream
1.4.2 Sparse rowid bitmaps
2. Optimization
</contents>
1. Execution
============
The primary task is to find means to compute an intersection of N unordered
streams. Besides general memory/cpu cost of computation, we consider:
- whether the produced rowid stream is ordered. If it is, it can be piped
into index_merge/intersect (as opposed to sort-intersect)
- whether the strategy can take advantage of the fact that some input streams
are already rowid-ordered
- startup cost (cost of producing the first output record)
We see the following possible strategies:
1.1 Temptable
-------------
[ This is our strategy of choice at the moment]
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
rowid binary($rowid_size),
count n,
primary key(rowid);
);
Then use this algorithm:
i1= {index with the least E(#records)};
for each record R in range_scan(i1)
temp_table.insert(R.rowid, count=1);
for each index idx except i1
{
for each R record in scan(idx) // (INNER-LOOP)
{
if (temp_table has R)
temptable[R].count++;
}
}
// The following loop can do ordered or unordered scan
// if we want it to be ordered scan, we probably better arrange so that
// 'count' column is part of the index.
for each record R in temp_table
{
if (R.count == number_of_streams)
emit(R.rowid);
}
The algorithm has an option to emit an ordered rowid stream.
In the above form, the cost to produce the first record is high. It's easy to
adjust the algorithm to make it low - we'll need to just start scanning all
indexes at once, and finish as soon as we got a full match, i.e. the
temptable[R].count++
operation resulted in the counter being equal to the number of merged scans.
1.1.1 Improvement
~~~~~~~~~~~~~~~~~
When running INNER-LOOP, we could count how many times we've done the
"count++" operation. If it has been done #records-in-temptable times, that
means that all further records will not have matches and we can finish the
scan, i.e. break out of the INNER-LOOP.
1.2 Produce/merge sorted streams
--------------------------------
For each of the merged scan, use filesort-like action to end up with an
ordered stream of rowids. Then merge the ordered streams.
By filesort-like action we mean
- Run over index, collect rowids in a buffer.
- When the buffer is full, sort it and dump into a temporary file.
After the above we'll end up with a number of sorted buffers on disk. We can
use mergebuff() function (it is part of filesort's functions) to produce one
ordered sequence (i.e. array, which may be partially on disk) of rowids.
Merging of ordered streams with help of priority queue is already implemented
in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
child_quick->get_next()
call with a call to read rowid from an ordered sequence.
1.3 Extend Unique class to handle intersection
----------------------------------------------
There is no point to use Unique object as a device that accumulates rowids of
a single scan then produces them in sorted order. One could do the same faster
with accumulating an array of rowids and then sorting it.
It's possible to use Unique object to collect/merge data from all scans though.
The idea is as follows:
- Unique should store <rowid, n_scans> pairs
- Duplicates are pairs with the same rowid
- Unique should try to avoid creating duplicates:
- don't add a duplicate into the in-memory part, instead combine two elements
together by adding their n_scans elements.
- combine duplicates when it sees them in Unique.get() call
- The data we get from Unique.get() should be filtered, all records that have
n_scans != number_of_scans_being_merged should be discarded.
If we're lucky to have started and finished a scan on some index (denote it
as S) without flushing the Unique in the process, then:
- there is no point in adding any new records into the Unique because their
absence in the Unique means that they don't have match in S and hence will
not get into the result of intersection.
- we need to only update the counters to be able to tell if the elements that
are already in the Unique will have matches in all scans.
1.4 Strategies that do not seem to be useful
--------------------------------------------
keeping them here so we don't consider them over and over
1.4.1 Remove matches after having produced an ordered stream
------------------------------------------------------------
We can dump everything into a rowid stream and get it sorted. Then we read it,
and if we see a rowid repeated $n_merged_scans times, it belongs to the
intersection (pass to output), otherwise it doesn't (skip).
This doesn't have any advantages over the produce/merge sorted streams
approach.
1.4.2 Sparse rowid bitmaps
--------------------------
Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
bitmaps assume there will always be enough memory to accommodate them.
PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
disk pages, not ids of individual records. It's hard for us to do something
similar because our rowids are opaque entities whose meaning depends on the
storage engines.
This seems to require too much change to be worth it.
2. Optimization
===============
SEL_TREE objects already represent intersections. The problems with
optimizations are:
- Cost formula(s)
- When N keys/conditions are present:
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
somehow avoid considering (2^n - n) possible options.
- Avoid producing (or even considering) apparently suboptimal plans:
= Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
a subset of columns covered by all other indexes.
= (TODO any other rules?)
- Correlation across selectivities. If there is a condition
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
can we consider satisfaction of AND-parts to be independent?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Igor): index_merge: non-ROR intersection (21)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: non-ROR intersection
CREATION DATE..: Thu, 21 May 2009, 21:32
SUPERVISOR.....: Knielsen
IMPLEMENTOR....:
COPIES TO......: Rhuddleston, Sanja, Knielsen, Serg, Monty, Timour, Igor, Psergey
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 21 (http://askmonty.org/worklog/?tid=21)
VERSION........: Server-9.x
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 25
ESTIMATE.......: 175 (hours remain)
ORIG. ESTIMATE.: 175
PROGRESS NOTES:
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Observers changed: Knielsen,Monty,Psergey,Sanja,Igor,Rhuddleston,Timour,Serg
-=-=(Guest - Thu, 24 Jun 2010, 05:44)=-=-
I spent 25 hours in the month June 2010 to perform the following work for this task.
1. Compared tree possible algorithms to implement the operation of index intersection mentioned in
HLS by their labor/time consumption. Chose the algorithm that uses a modified Unique class (1.3) as
the most cheap requiring the least amount of efforts/time for its development.
2. Developed a design for a modification of the Unique class to support the operation of index
intersection.
3. Modified the merge_buffers procedure used by the Unique class to make it possible to use it not
only for the the operation of union, but for the operation of intersect as well.
Worked 25 hours and estimate 175 hours remain (original estimate increased by 200 hours).
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 21
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 21
-=-=(Guest - Wed, 03 Jun 2009, 01:17)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.30002 2009-06-03 01:17:32.000000000 +0300
+++ /tmp/wklog.21.new.30002 2009-06-03 01:17:32.000000000 +0300
@@ -7,13 +7,13 @@
The current optimization works with:
-WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
-WHERE key1_part1=1 OR key2_part1=3
+WHERE key1_part1=1 AND key2_part1=3
or
-WHERE key_part1<10 or key2_part1<100
+WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:06)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29694 2009-06-03 01:06:50.000000000 +0300
+++ /tmp/wklog.21.new.29694 2009-06-03 01:06:50.000000000 +0300
@@ -12,6 +12,8 @@
but not with:
WHERE key1_part1=1 OR key2_part1=3
+or
+WHERE key_part1<10 or key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:05)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29638 2009-06-03 01:05:01.000000000 +0300
+++ /tmp/wklog.21.new.29638 2009-06-03 01:05:01.000000000 +0300
@@ -3,5 +3,15 @@
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
+For example, assuming that key1 has 2 parts and key2 has 1 part.
+
+The current optimization works with:
+
+WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+
+but not with:
+
+WHERE key1_part1=1 OR key2_part1=3
+
This WL entry is to lift this limitation by developing algorithms that do
-intersection on non-ROR scans.
+intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Guest - Tue, 26 May 2009, 14:04)=-=-
High-Level Specification modified.
--- /tmp/wklog.21.old.1802 2009-05-26 14:04:57.000000000 +0300
+++ /tmp/wklog.21.new.1802 2009-05-26 14:04:57.000000000 +0300
@@ -1,4 +1,3 @@
-
<contents>
1. Execution
1.1 Temptable
@@ -30,6 +29,8 @@
1.1 Temptable
-------------
+[ This is our strategy of choice at the moment]
+
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
@@ -168,3 +169,8 @@
a subset of columns covered by all other indexes.
= (TODO any other rules?)
+- Correlation across selectivities. If there is a condition
+
+ "cond(key1) AND cond(key2) AND ... AND cond(keyN)",
+
+ can we consider satisfaction of AND-parts to be independent?
-=-=(Psergey - Thu, 21 May 2009, 21:33)=-=-
High-Level Specification modified.
--- /tmp/wklog.21.old.25705 2009-05-21 21:33:02.000000000 +0300
+++ /tmp/wklog.21.new.25705 2009-05-21 21:33:02.000000000 +0300
@@ -1 +1,170 @@
+<contents>
+1. Execution
+1.1 Temptable
+1.1.1 Improvement
+1.2 Produce/merge sorted streams
+1.3 Extend Unique class to handle intersection
+1.4 Strategies that do not seem to be useful
+1.4.1 Remove matches after having produced an ordered stream
+1.4.2 Sparse rowid bitmaps
+2. Optimization
+
+</contents>
+
+1. Execution
+============
+
+The primary task is to find means to compute an intersection of N unordered
+streams. Besides general memory/cpu cost of computation, we consider:
+
+- whether the produced rowid stream is ordered. If it is, it can be piped
+ into index_merge/intersect (as opposed to sort-intersect)
+
+- whether the strategy can take advantage of the fact that some input streams
+ are already rowid-ordered
+
+- startup cost (cost of producing the first output record)
+
+We see the following possible strategies:
+
+1.1 Temptable
+-------------
+Use a temporary heap-grow-out-to-myisam table with a primary key:
+
+create table temp_table (
+ rowid binary($rowid_size),
+ count n,
+ primary key(rowid);
+);
+
+Then use this algorithm:
+
+ i1= {index with the least E(#records)};
+
+ for each record R in range_scan(i1)
+ temp_table.insert(R.rowid, count=1);
+
+ for each index idx except i1
+ {
+ for each R record in scan(idx) // (INNER-LOOP)
+ {
+ if (temp_table has R)
+ temptable[R].count++;
+ }
+ }
+
+ // The following loop can do ordered or unordered scan
+ // if we want it to be ordered scan, we probably better arrange so that
+ // 'count' column is part of the index.
+ for each record R in temp_table
+ {
+ if (R.count == number_of_streams)
+ emit(R.rowid);
+ }
+
+The algorithm has an option to emit an ordered rowid stream.
+
+In the above form, the cost to produce the first record is high. It's easy to
+adjust the algorithm to make it low - we'll need to just start scanning all
+indexes at once, and finish as soon as we got a full match, i.e. the
+
+ temptable[R].count++
+
+operation resulted in the counter being equal to the number of merged scans.
+
+1.1.1 Improvement
+~~~~~~~~~~~~~~~~~
+When running INNER-LOOP, we could count how many times we've done the
+"count++" operation. If it has been done #records-in-temptable times, that
+means that all further records will not have matches and we can finish the
+scan, i.e. break out of the INNER-LOOP.
+
+1.2 Produce/merge sorted streams
+--------------------------------
+For each of the merged scan, use filesort-like action to end up with an
+ordered stream of rowids. Then merge the ordered streams.
+
+By filesort-like action we mean
+ - Run over index, collect rowids in a buffer.
+ - When the buffer is full, sort it and dump into a temporary file.
+After the above we'll end up with a number of sorted buffers on disk. We can
+use mergebuff() function (it is part of filesort's functions) to produce one
+ordered sequence (i.e. array, which may be partially on disk) of rowids.
+
+Merging of ordered streams with help of priority queue is already implemented
+in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
+
+ child_quick->get_next()
+
+call with a call to read rowid from an ordered sequence.
+
+1.3 Extend Unique class to handle intersection
+----------------------------------------------
+There is no point to use Unique object as a device that accumulates rowids of
+a single scan then produces them in sorted order. One could do the same faster
+with accumulating an array of rowids and then sorting it.
+
+It's possible to use Unique object to collect/merge data from all scans though.
+The idea is as follows:
+
+- Unique should store <rowid, n_scans> pairs
+- Duplicates are pairs with the same rowid
+- Unique should try to avoid creating duplicates:
+ - don't add a duplicate into the in-memory part, instead combine two elements
+ together by adding their n_scans elements.
+ - combine duplicates when it sees them in Unique.get() call
+- The data we get from Unique.get() should be filtered, all records that have
+ n_scans != number_of_scans_being_merged should be discarded.
+
+If we're lucky to have started and finished a scan on some index (denote it
+as S) without flushing the Unique in the process, then:
+- there is no point in adding any new records into the Unique because their
+ absence in the Unique means that they don't have match in S and hence will
+ not get into the result of intersection.
+- we need to only update the counters to be able to tell if the elements that
+ are already in the Unique will have matches in all scans.
+
+1.4 Strategies that do not seem to be useful
+--------------------------------------------
+
+keeping them here so we don't consider them over and over
+
+1.4.1 Remove matches after having produced an ordered stream
+------------------------------------------------------------
+We can dump everything into a rowid stream and get it sorted. Then we read it,
+and if we see a rowid repeated $n_merged_scans times, it belongs to the
+intersection (pass to output), otherwise it doesn't (skip).
+This doesn't have any advantages over the produce/merge sorted streams
+approach.
+
+1.4.2 Sparse rowid bitmaps
+--------------------------
+Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
+bitmaps assume there will always be enough memory to accommodate them.
+
+PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
+disk pages, not ids of individual records. It's hard for us to do something
+similar because our rowids are opaque entities whose meaning depends on the
+storage engines.
+
+This seems to require too much change to be worth it.
+
+2. Optimization
+===============
+
+SEL_TREE objects already represent intersections. The problems with
+optimizations are:
+
+- Cost formula(s)
+- When N keys/conditions are present:
+
+ "cond(key1) AND cond(key2) AND ... AND cond(keyN)",
+
+ somehow avoid considering (2^n - n) possible options.
+
+- Avoid producing (or even considering) apparently suboptimal plans:
+ = Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
+ a subset of columns covered by all other indexes.
+ = (TODO any other rules?)
+
DESCRIPTION:
At the moment index_merge supports intersection only for rowid-ordered streams.
This translates into a limitation that index_merge/intersect can only be
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
For example, assuming that key1 has 2 parts and key2 has 1 part.
The current optimization works with:
WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
WHERE key1_part1=1 AND key2_part1=3
or
WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Execution
1.1 Temptable
1.1.1 Improvement
1.2 Produce/merge sorted streams
1.3 Extend Unique class to handle intersection
1.4 Strategies that do not seem to be useful
1.4.1 Remove matches after having produced an ordered stream
1.4.2 Sparse rowid bitmaps
2. Optimization
</contents>
1. Execution
============
The primary task is to find means to compute an intersection of N unordered
streams. Besides general memory/cpu cost of computation, we consider:
- whether the produced rowid stream is ordered. If it is, it can be piped
into index_merge/intersect (as opposed to sort-intersect)
- whether the strategy can take advantage of the fact that some input streams
are already rowid-ordered
- startup cost (cost of producing the first output record)
We see the following possible strategies:
1.1 Temptable
-------------
[ This is our strategy of choice at the moment]
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
rowid binary($rowid_size),
count n,
primary key(rowid);
);
Then use this algorithm:
i1= {index with the least E(#records)};
for each record R in range_scan(i1)
temp_table.insert(R.rowid, count=1);
for each index idx except i1
{
for each R record in scan(idx) // (INNER-LOOP)
{
if (temp_table has R)
temptable[R].count++;
}
}
// The following loop can do ordered or unordered scan
// if we want it to be ordered scan, we probably better arrange so that
// 'count' column is part of the index.
for each record R in temp_table
{
if (R.count == number_of_streams)
emit(R.rowid);
}
The algorithm has an option to emit an ordered rowid stream.
In the above form, the cost to produce the first record is high. It's easy to
adjust the algorithm to make it low - we'll need to just start scanning all
indexes at once, and finish as soon as we got a full match, i.e. the
temptable[R].count++
operation resulted in the counter being equal to the number of merged scans.
1.1.1 Improvement
~~~~~~~~~~~~~~~~~
When running INNER-LOOP, we could count how many times we've done the
"count++" operation. If it has been done #records-in-temptable times, that
means that all further records will not have matches and we can finish the
scan, i.e. break out of the INNER-LOOP.
1.2 Produce/merge sorted streams
--------------------------------
For each of the merged scan, use filesort-like action to end up with an
ordered stream of rowids. Then merge the ordered streams.
By filesort-like action we mean
- Run over index, collect rowids in a buffer.
- When the buffer is full, sort it and dump into a temporary file.
After the above we'll end up with a number of sorted buffers on disk. We can
use mergebuff() function (it is part of filesort's functions) to produce one
ordered sequence (i.e. array, which may be partially on disk) of rowids.
Merging of ordered streams with help of priority queue is already implemented
in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
child_quick->get_next()
call with a call to read rowid from an ordered sequence.
1.3 Extend Unique class to handle intersection
----------------------------------------------
There is no point to use Unique object as a device that accumulates rowids of
a single scan then produces them in sorted order. One could do the same faster
with accumulating an array of rowids and then sorting it.
It's possible to use Unique object to collect/merge data from all scans though.
The idea is as follows:
- Unique should store <rowid, n_scans> pairs
- Duplicates are pairs with the same rowid
- Unique should try to avoid creating duplicates:
- don't add a duplicate into the in-memory part, instead combine two elements
together by adding their n_scans elements.
- combine duplicates when it sees them in Unique.get() call
- The data we get from Unique.get() should be filtered, all records that have
n_scans != number_of_scans_being_merged should be discarded.
If we're lucky to have started and finished a scan on some index (denote it
as S) without flushing the Unique in the process, then:
- there is no point in adding any new records into the Unique because their
absence in the Unique means that they don't have match in S and hence will
not get into the result of intersection.
- we need to only update the counters to be able to tell if the elements that
are already in the Unique will have matches in all scans.
1.4 Strategies that do not seem to be useful
--------------------------------------------
keeping them here so we don't consider them over and over
1.4.1 Remove matches after having produced an ordered stream
------------------------------------------------------------
We can dump everything into a rowid stream and get it sorted. Then we read it,
and if we see a rowid repeated $n_merged_scans times, it belongs to the
intersection (pass to output), otherwise it doesn't (skip).
This doesn't have any advantages over the produce/merge sorted streams
approach.
1.4.2 Sparse rowid bitmaps
--------------------------
Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
bitmaps assume there will always be enough memory to accommodate them.
PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
disk pages, not ids of individual records. It's hard for us to do something
similar because our rowids are opaque entities whose meaning depends on the
storage engines.
This seems to require too much change to be worth it.
2. Optimization
===============
SEL_TREE objects already represent intersections. The problems with
optimizations are:
- Cost formula(s)
- When N keys/conditions are present:
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
somehow avoid considering (2^n - n) possible options.
- Avoid producing (or even considering) apparently suboptimal plans:
= Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
a subset of columns covered by all other indexes.
= (TODO any other rules?)
- Correlation across selectivities. If there is a condition
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
can we consider satisfaction of AND-parts to be independent?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Igor): index_merge: non-ROR intersection (21)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: non-ROR intersection
CREATION DATE..: Thu, 21 May 2009, 21:32
SUPERVISOR.....: Knielsen
IMPLEMENTOR....:
COPIES TO......: Rhuddleston, Sanja, Knielsen, Serg, Monty, Timour, Igor, Psergey
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 21 (http://askmonty.org/worklog/?tid=21)
VERSION........: Server-9.x
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 25
ESTIMATE.......: 175 (hours remain)
ORIG. ESTIMATE.: 175
PROGRESS NOTES:
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Observers changed: Knielsen,Monty,Psergey,Sanja,Igor,Rhuddleston,Timour,Serg
-=-=(Guest - Thu, 24 Jun 2010, 05:44)=-=-
I spent 25 hours in the month June 2010 to perform the following work for this task.
1. Compared tree possible algorithms to implement the operation of index intersection mentioned in
HLS by their labor/time consumption. Chose the algorithm that uses a modified Unique class (1.3) as
the most cheap requiring the least amount of efforts/time for its development.
2. Developed a design for a modification of the Unique class to support the operation of index
intersection.
3. Modified the merge_buffers procedure used by the Unique class to make it possible to use it not
only for the the operation of union, but for the operation of intersect as well.
Worked 25 hours and estimate 175 hours remain (original estimate increased by 200 hours).
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 21
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 21
-=-=(Guest - Wed, 03 Jun 2009, 01:17)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.30002 2009-06-03 01:17:32.000000000 +0300
+++ /tmp/wklog.21.new.30002 2009-06-03 01:17:32.000000000 +0300
@@ -7,13 +7,13 @@
The current optimization works with:
-WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
-WHERE key1_part1=1 OR key2_part1=3
+WHERE key1_part1=1 AND key2_part1=3
or
-WHERE key_part1<10 or key2_part1<100
+WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:06)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29694 2009-06-03 01:06:50.000000000 +0300
+++ /tmp/wklog.21.new.29694 2009-06-03 01:06:50.000000000 +0300
@@ -12,6 +12,8 @@
but not with:
WHERE key1_part1=1 OR key2_part1=3
+or
+WHERE key_part1<10 or key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:05)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29638 2009-06-03 01:05:01.000000000 +0300
+++ /tmp/wklog.21.new.29638 2009-06-03 01:05:01.000000000 +0300
@@ -3,5 +3,15 @@
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
+For example, assuming that key1 has 2 parts and key2 has 1 part.
+
+The current optimization works with:
+
+WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+
+but not with:
+
+WHERE key1_part1=1 OR key2_part1=3
+
This WL entry is to lift this limitation by developing algorithms that do
-intersection on non-ROR scans.
+intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Guest - Tue, 26 May 2009, 14:04)=-=-
High-Level Specification modified.
--- /tmp/wklog.21.old.1802 2009-05-26 14:04:57.000000000 +0300
+++ /tmp/wklog.21.new.1802 2009-05-26 14:04:57.000000000 +0300
@@ -1,4 +1,3 @@
-
<contents>
1. Execution
1.1 Temptable
@@ -30,6 +29,8 @@
1.1 Temptable
-------------
+[ This is our strategy of choice at the moment]
+
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
@@ -168,3 +169,8 @@
a subset of columns covered by all other indexes.
= (TODO any other rules?)
+- Correlation across selectivities. If there is a condition
+
+ "cond(key1) AND cond(key2) AND ... AND cond(keyN)",
+
+ can we consider satisfaction of AND-parts to be independent?
-=-=(Psergey - Thu, 21 May 2009, 21:33)=-=-
High-Level Specification modified.
--- /tmp/wklog.21.old.25705 2009-05-21 21:33:02.000000000 +0300
+++ /tmp/wklog.21.new.25705 2009-05-21 21:33:02.000000000 +0300
@@ -1 +1,170 @@
+<contents>
+1. Execution
+1.1 Temptable
+1.1.1 Improvement
+1.2 Produce/merge sorted streams
+1.3 Extend Unique class to handle intersection
+1.4 Strategies that do not seem to be useful
+1.4.1 Remove matches after having produced an ordered stream
+1.4.2 Sparse rowid bitmaps
+2. Optimization
+
+</contents>
+
+1. Execution
+============
+
+The primary task is to find means to compute an intersection of N unordered
+streams. Besides general memory/cpu cost of computation, we consider:
+
+- whether the produced rowid stream is ordered. If it is, it can be piped
+ into index_merge/intersect (as opposed to sort-intersect)
+
+- whether the strategy can take advantage of the fact that some input streams
+ are already rowid-ordered
+
+- startup cost (cost of producing the first output record)
+
+We see the following possible strategies:
+
+1.1 Temptable
+-------------
+Use a temporary heap-grow-out-to-myisam table with a primary key:
+
+create table temp_table (
+ rowid binary($rowid_size),
+ count n,
+ primary key(rowid);
+);
+
+Then use this algorithm:
+
+ i1= {index with the least E(#records)};
+
+ for each record R in range_scan(i1)
+ temp_table.insert(R.rowid, count=1);
+
+ for each index idx except i1
+ {
+ for each R record in scan(idx) // (INNER-LOOP)
+ {
+ if (temp_table has R)
+ temptable[R].count++;
+ }
+ }
+
+ // The following loop can do ordered or unordered scan
+ // if we want it to be ordered scan, we probably better arrange so that
+ // 'count' column is part of the index.
+ for each record R in temp_table
+ {
+ if (R.count == number_of_streams)
+ emit(R.rowid);
+ }
+
+The algorithm has an option to emit an ordered rowid stream.
+
+In the above form, the cost to produce the first record is high. It's easy to
+adjust the algorithm to make it low - we'll need to just start scanning all
+indexes at once, and finish as soon as we got a full match, i.e. the
+
+ temptable[R].count++
+
+operation resulted in the counter being equal to the number of merged scans.
+
+1.1.1 Improvement
+~~~~~~~~~~~~~~~~~
+When running INNER-LOOP, we could count how many times we've done the
+"count++" operation. If it has been done #records-in-temptable times, that
+means that all further records will not have matches and we can finish the
+scan, i.e. break out of the INNER-LOOP.
+
+1.2 Produce/merge sorted streams
+--------------------------------
+For each of the merged scan, use filesort-like action to end up with an
+ordered stream of rowids. Then merge the ordered streams.
+
+By filesort-like action we mean
+ - Run over index, collect rowids in a buffer.
+ - When the buffer is full, sort it and dump into a temporary file.
+After the above we'll end up with a number of sorted buffers on disk. We can
+use mergebuff() function (it is part of filesort's functions) to produce one
+ordered sequence (i.e. array, which may be partially on disk) of rowids.
+
+Merging of ordered streams with help of priority queue is already implemented
+in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
+
+ child_quick->get_next()
+
+call with a call to read rowid from an ordered sequence.
+
+1.3 Extend Unique class to handle intersection
+----------------------------------------------
+There is no point to use Unique object as a device that accumulates rowids of
+a single scan then produces them in sorted order. One could do the same faster
+with accumulating an array of rowids and then sorting it.
+
+It's possible to use Unique object to collect/merge data from all scans though.
+The idea is as follows:
+
+- Unique should store <rowid, n_scans> pairs
+- Duplicates are pairs with the same rowid
+- Unique should try to avoid creating duplicates:
+ - don't add a duplicate into the in-memory part, instead combine two elements
+ together by adding their n_scans elements.
+ - combine duplicates when it sees them in Unique.get() call
+- The data we get from Unique.get() should be filtered, all records that have
+ n_scans != number_of_scans_being_merged should be discarded.
+
+If we're lucky to have started and finished a scan on some index (denote it
+as S) without flushing the Unique in the process, then:
+- there is no point in adding any new records into the Unique because their
+ absence in the Unique means that they don't have match in S and hence will
+ not get into the result of intersection.
+- we need to only update the counters to be able to tell if the elements that
+ are already in the Unique will have matches in all scans.
+
+1.4 Strategies that do not seem to be useful
+--------------------------------------------
+
+keeping them here so we don't consider them over and over
+
+1.4.1 Remove matches after having produced an ordered stream
+------------------------------------------------------------
+We can dump everything into a rowid stream and get it sorted. Then we read it,
+and if we see a rowid repeated $n_merged_scans times, it belongs to the
+intersection (pass to output), otherwise it doesn't (skip).
+This doesn't have any advantages over the produce/merge sorted streams
+approach.
+
+1.4.2 Sparse rowid bitmaps
+--------------------------
+Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
+bitmaps assume there will always be enough memory to accommodate them.
+
+PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
+disk pages, not ids of individual records. It's hard for us to do something
+similar because our rowids are opaque entities whose meaning depends on the
+storage engines.
+
+This seems to require too much change to be worth it.
+
+2. Optimization
+===============
+
+SEL_TREE objects already represent intersections. The problems with
+optimizations are:
+
+- Cost formula(s)
+- When N keys/conditions are present:
+
+ "cond(key1) AND cond(key2) AND ... AND cond(keyN)",
+
+ somehow avoid considering (2^n - n) possible options.
+
+- Avoid producing (or even considering) apparently suboptimal plans:
+ = Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
+ a subset of columns covered by all other indexes.
+ = (TODO any other rules?)
+
DESCRIPTION:
At the moment index_merge supports intersection only for rowid-ordered streams.
This translates into a limitation that index_merge/intersect can only be
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
For example, assuming that key1 has 2 parts and key2 has 1 part.
The current optimization works with:
WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
WHERE key1_part1=1 AND key2_part1=3
or
WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Execution
1.1 Temptable
1.1.1 Improvement
1.2 Produce/merge sorted streams
1.3 Extend Unique class to handle intersection
1.4 Strategies that do not seem to be useful
1.4.1 Remove matches after having produced an ordered stream
1.4.2 Sparse rowid bitmaps
2. Optimization
</contents>
1. Execution
============
The primary task is to find means to compute an intersection of N unordered
streams. Besides general memory/cpu cost of computation, we consider:
- whether the produced rowid stream is ordered. If it is, it can be piped
into index_merge/intersect (as opposed to sort-intersect)
- whether the strategy can take advantage of the fact that some input streams
are already rowid-ordered
- startup cost (cost of producing the first output record)
We see the following possible strategies:
1.1 Temptable
-------------
[ This is our strategy of choice at the moment]
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
rowid binary($rowid_size),
count n,
primary key(rowid);
);
Then use this algorithm:
i1= {index with the least E(#records)};
for each record R in range_scan(i1)
temp_table.insert(R.rowid, count=1);
for each index idx except i1
{
for each R record in scan(idx) // (INNER-LOOP)
{
if (temp_table has R)
temptable[R].count++;
}
}
// The following loop can do ordered or unordered scan
// if we want it to be ordered scan, we probably better arrange so that
// 'count' column is part of the index.
for each record R in temp_table
{
if (R.count == number_of_streams)
emit(R.rowid);
}
The algorithm has an option to emit an ordered rowid stream.
In the above form, the cost to produce the first record is high. It's easy to
adjust the algorithm to make it low - we'll need to just start scanning all
indexes at once, and finish as soon as we got a full match, i.e. the
temptable[R].count++
operation resulted in the counter being equal to the number of merged scans.
1.1.1 Improvement
~~~~~~~~~~~~~~~~~
When running INNER-LOOP, we could count how many times we've done the
"count++" operation. If it has been done #records-in-temptable times, that
means that all further records will not have matches and we can finish the
scan, i.e. break out of the INNER-LOOP.
1.2 Produce/merge sorted streams
--------------------------------
For each of the merged scan, use filesort-like action to end up with an
ordered stream of rowids. Then merge the ordered streams.
By filesort-like action we mean
- Run over index, collect rowids in a buffer.
- When the buffer is full, sort it and dump into a temporary file.
After the above we'll end up with a number of sorted buffers on disk. We can
use mergebuff() function (it is part of filesort's functions) to produce one
ordered sequence (i.e. array, which may be partially on disk) of rowids.
Merging of ordered streams with help of priority queue is already implemented
in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
child_quick->get_next()
call with a call to read rowid from an ordered sequence.
1.3 Extend Unique class to handle intersection
----------------------------------------------
There is no point to use Unique object as a device that accumulates rowids of
a single scan then produces them in sorted order. One could do the same faster
with accumulating an array of rowids and then sorting it.
It's possible to use Unique object to collect/merge data from all scans though.
The idea is as follows:
- Unique should store <rowid, n_scans> pairs
- Duplicates are pairs with the same rowid
- Unique should try to avoid creating duplicates:
- don't add a duplicate into the in-memory part, instead combine two elements
together by adding their n_scans elements.
- combine duplicates when it sees them in Unique.get() call
- The data we get from Unique.get() should be filtered, all records that have
n_scans != number_of_scans_being_merged should be discarded.
If we're lucky to have started and finished a scan on some index (denote it
as S) without flushing the Unique in the process, then:
- there is no point in adding any new records into the Unique because their
absence in the Unique means that they don't have match in S and hence will
not get into the result of intersection.
- we need to only update the counters to be able to tell if the elements that
are already in the Unique will have matches in all scans.
1.4 Strategies that do not seem to be useful
--------------------------------------------
keeping them here so we don't consider them over and over
1.4.1 Remove matches after having produced an ordered stream
------------------------------------------------------------
We can dump everything into a rowid stream and get it sorted. Then we read it,
and if we see a rowid repeated $n_merged_scans times, it belongs to the
intersection (pass to output), otherwise it doesn't (skip).
This doesn't have any advantages over the produce/merge sorted streams
approach.
1.4.2 Sparse rowid bitmaps
--------------------------
Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
bitmaps assume there will always be enough memory to accommodate them.
PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
disk pages, not ids of individual records. It's hard for us to do something
similar because our rowids are opaque entities whose meaning depends on the
storage engines.
This seems to require too much change to be worth it.
2. Optimization
===============
SEL_TREE objects already represent intersections. The problems with
optimizations are:
- Cost formula(s)
- When N keys/conditions are present:
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
somehow avoid considering (2^n - n) possible options.
- Avoid producing (or even considering) apparently suboptimal plans:
= Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
a subset of columns covered by all other indexes.
= (TODO any other rules?)
- Correlation across selectivities. If there is a condition
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
can we consider satisfaction of AND-parts to be independent?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Igor): index_merge: non-ROR intersection (21)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: non-ROR intersection
CREATION DATE..: Thu, 21 May 2009, 21:32
SUPERVISOR.....: Knielsen
IMPLEMENTOR....:
COPIES TO......: Rhuddleston, Sanja, Knielsen, Serg, Monty, Timour, Igor, Psergey
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 21 (http://askmonty.org/worklog/?tid=21)
VERSION........: Server-9.x
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 25
ESTIMATE.......: 175 (hours remain)
ORIG. ESTIMATE.: 175
PROGRESS NOTES:
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Observers changed: Knielsen,Monty,Psergey,Sanja,Igor,Rhuddleston,Timour,Serg
-=-=(Guest - Thu, 24 Jun 2010, 05:44)=-=-
I spent 25 hours in the month June 2010 to perform the following work for this task.
1. Compared tree possible algorithms to implement the operation of index intersection mentioned in
HLS by their labor/time consumption. Chose the algorithm that uses a modified Unique class (1.3) as
the most cheap requiring the least amount of efforts/time for its development.
2. Developed a design for a modification of the Unique class to support the operation of index
intersection.
3. Modified the merge_buffers procedure used by the Unique class to make it possible to use it not
only for the the operation of union, but for the operation of intersect as well.
Worked 25 hours and estimate 175 hours remain (original estimate increased by 200 hours).
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 21
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 21
-=-=(Guest - Wed, 03 Jun 2009, 01:17)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.30002 2009-06-03 01:17:32.000000000 +0300
+++ /tmp/wklog.21.new.30002 2009-06-03 01:17:32.000000000 +0300
@@ -7,13 +7,13 @@
The current optimization works with:
-WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
-WHERE key1_part1=1 OR key2_part1=3
+WHERE key1_part1=1 AND key2_part1=3
or
-WHERE key_part1<10 or key2_part1<100
+WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:06)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29694 2009-06-03 01:06:50.000000000 +0300
+++ /tmp/wklog.21.new.29694 2009-06-03 01:06:50.000000000 +0300
@@ -12,6 +12,8 @@
but not with:
WHERE key1_part1=1 OR key2_part1=3
+or
+WHERE key_part1<10 or key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:05)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29638 2009-06-03 01:05:01.000000000 +0300
+++ /tmp/wklog.21.new.29638 2009-06-03 01:05:01.000000000 +0300
@@ -3,5 +3,15 @@
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
+For example, assuming that key1 has 2 parts and key2 has 1 part.
+
+The current optimization works with:
+
+WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+
+but not with:
+
+WHERE key1_part1=1 OR key2_part1=3
+
This WL entry is to lift this limitation by developing algorithms that do
-intersection on non-ROR scans.
+intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Guest - Tue, 26 May 2009, 14:04)=-=-
High-Level Specification modified.
--- /tmp/wklog.21.old.1802 2009-05-26 14:04:57.000000000 +0300
+++ /tmp/wklog.21.new.1802 2009-05-26 14:04:57.000000000 +0300
@@ -1,4 +1,3 @@
-
<contents>
1. Execution
1.1 Temptable
@@ -30,6 +29,8 @@
1.1 Temptable
-------------
+[ This is our strategy of choice at the moment]
+
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
@@ -168,3 +169,8 @@
a subset of columns covered by all other indexes.
= (TODO any other rules?)
+- Correlation across selectivities. If there is a condition
+
+ "cond(key1) AND cond(key2) AND ... AND cond(keyN)",
+
+ can we consider satisfaction of AND-parts to be independent?
-=-=(Psergey - Thu, 21 May 2009, 21:33)=-=-
High-Level Specification modified.
--- /tmp/wklog.21.old.25705 2009-05-21 21:33:02.000000000 +0300
+++ /tmp/wklog.21.new.25705 2009-05-21 21:33:02.000000000 +0300
@@ -1 +1,170 @@
+<contents>
+1. Execution
+1.1 Temptable
+1.1.1 Improvement
+1.2 Produce/merge sorted streams
+1.3 Extend Unique class to handle intersection
+1.4 Strategies that do not seem to be useful
+1.4.1 Remove matches after having produced an ordered stream
+1.4.2 Sparse rowid bitmaps
+2. Optimization
+
+</contents>
+
+1. Execution
+============
+
+The primary task is to find means to compute an intersection of N unordered
+streams. Besides general memory/cpu cost of computation, we consider:
+
+- whether the produced rowid stream is ordered. If it is, it can be piped
+ into index_merge/intersect (as opposed to sort-intersect)
+
+- whether the strategy can take advantage of the fact that some input streams
+ are already rowid-ordered
+
+- startup cost (cost of producing the first output record)
+
+We see the following possible strategies:
+
+1.1 Temptable
+-------------
+Use a temporary heap-grow-out-to-myisam table with a primary key:
+
+create table temp_table (
+ rowid binary($rowid_size),
+ count n,
+ primary key(rowid);
+);
+
+Then use this algorithm:
+
+ i1= {index with the least E(#records)};
+
+ for each record R in range_scan(i1)
+ temp_table.insert(R.rowid, count=1);
+
+ for each index idx except i1
+ {
+ for each R record in scan(idx) // (INNER-LOOP)
+ {
+ if (temp_table has R)
+ temptable[R].count++;
+ }
+ }
+
+ // The following loop can do ordered or unordered scan
+ // if we want it to be ordered scan, we probably better arrange so that
+ // 'count' column is part of the index.
+ for each record R in temp_table
+ {
+ if (R.count == number_of_streams)
+ emit(R.rowid);
+ }
+
+The algorithm has an option to emit an ordered rowid stream.
+
+In the above form, the cost to produce the first record is high. It's easy to
+adjust the algorithm to make it low - we'll need to just start scanning all
+indexes at once, and finish as soon as we got a full match, i.e. the
+
+ temptable[R].count++
+
+operation resulted in the counter being equal to the number of merged scans.
+
+1.1.1 Improvement
+~~~~~~~~~~~~~~~~~
+When running INNER-LOOP, we could count how many times we've done the
+"count++" operation. If it has been done #records-in-temptable times, that
+means that all further records will not have matches and we can finish the
+scan, i.e. break out of the INNER-LOOP.
+
+1.2 Produce/merge sorted streams
+--------------------------------
+For each of the merged scan, use filesort-like action to end up with an
+ordered stream of rowids. Then merge the ordered streams.
+
+By filesort-like action we mean
+ - Run over index, collect rowids in a buffer.
+ - When the buffer is full, sort it and dump into a temporary file.
+After the above we'll end up with a number of sorted buffers on disk. We can
+use mergebuff() function (it is part of filesort's functions) to produce one
+ordered sequence (i.e. array, which may be partially on disk) of rowids.
+
+Merging of ordered streams with help of priority queue is already implemented
+in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
+
+ child_quick->get_next()
+
+call with a call to read rowid from an ordered sequence.
+
+1.3 Extend Unique class to handle intersection
+----------------------------------------------
+There is no point to use Unique object as a device that accumulates rowids of
+a single scan then produces them in sorted order. One could do the same faster
+with accumulating an array of rowids and then sorting it.
+
+It's possible to use Unique object to collect/merge data from all scans though.
+The idea is as follows:
+
+- Unique should store <rowid, n_scans> pairs
+- Duplicates are pairs with the same rowid
+- Unique should try to avoid creating duplicates:
+ - don't add a duplicate into the in-memory part, instead combine two elements
+ together by adding their n_scans elements.
+ - combine duplicates when it sees them in Unique.get() call
+- The data we get from Unique.get() should be filtered, all records that have
+ n_scans != number_of_scans_being_merged should be discarded.
+
+If we're lucky to have started and finished a scan on some index (denote it
+as S) without flushing the Unique in the process, then:
+- there is no point in adding any new records into the Unique because their
+ absence in the Unique means that they don't have match in S and hence will
+ not get into the result of intersection.
+- we need to only update the counters to be able to tell if the elements that
+ are already in the Unique will have matches in all scans.
+
+1.4 Strategies that do not seem to be useful
+--------------------------------------------
+
+keeping them here so we don't consider them over and over
+
+1.4.1 Remove matches after having produced an ordered stream
+------------------------------------------------------------
+We can dump everything into a rowid stream and get it sorted. Then we read it,
+and if we see a rowid repeated $n_merged_scans times, it belongs to the
+intersection (pass to output), otherwise it doesn't (skip).
+This doesn't have any advantages over the produce/merge sorted streams
+approach.
+
+1.4.2 Sparse rowid bitmaps
+--------------------------
+Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
+bitmaps assume there will always be enough memory to accommodate them.
+
+PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
+disk pages, not ids of individual records. It's hard for us to do something
+similar because our rowids are opaque entities whose meaning depends on the
+storage engines.
+
+This seems to require too much change to be worth it.
+
+2. Optimization
+===============
+
+SEL_TREE objects already represent intersections. The problems with
+optimizations are:
+
+- Cost formula(s)
+- When N keys/conditions are present:
+
+ "cond(key1) AND cond(key2) AND ... AND cond(keyN)",
+
+ somehow avoid considering (2^n - n) possible options.
+
+- Avoid producing (or even considering) apparently suboptimal plans:
+ = Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
+ a subset of columns covered by all other indexes.
+ = (TODO any other rules?)
+
DESCRIPTION:
At the moment index_merge supports intersection only for rowid-ordered streams.
This translates into a limitation that index_merge/intersect can only be
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
For example, assuming that key1 has 2 parts and key2 has 1 part.
The current optimization works with:
WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
WHERE key1_part1=1 AND key2_part1=3
or
WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Execution
1.1 Temptable
1.1.1 Improvement
1.2 Produce/merge sorted streams
1.3 Extend Unique class to handle intersection
1.4 Strategies that do not seem to be useful
1.4.1 Remove matches after having produced an ordered stream
1.4.2 Sparse rowid bitmaps
2. Optimization
</contents>
1. Execution
============
The primary task is to find means to compute an intersection of N unordered
streams. Besides general memory/cpu cost of computation, we consider:
- whether the produced rowid stream is ordered. If it is, it can be piped
into index_merge/intersect (as opposed to sort-intersect)
- whether the strategy can take advantage of the fact that some input streams
are already rowid-ordered
- startup cost (cost of producing the first output record)
We see the following possible strategies:
1.1 Temptable
-------------
[ This is our strategy of choice at the moment]
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
rowid binary($rowid_size),
count n,
primary key(rowid);
);
Then use this algorithm:
i1= {index with the least E(#records)};
for each record R in range_scan(i1)
temp_table.insert(R.rowid, count=1);
for each index idx except i1
{
for each R record in scan(idx) // (INNER-LOOP)
{
if (temp_table has R)
temptable[R].count++;
}
}
// The following loop can do ordered or unordered scan
// if we want it to be ordered scan, we probably better arrange so that
// 'count' column is part of the index.
for each record R in temp_table
{
if (R.count == number_of_streams)
emit(R.rowid);
}
The algorithm has an option to emit an ordered rowid stream.
In the above form, the cost to produce the first record is high. It's easy to
adjust the algorithm to make it low - we'll need to just start scanning all
indexes at once, and finish as soon as we got a full match, i.e. the
temptable[R].count++
operation resulted in the counter being equal to the number of merged scans.
1.1.1 Improvement
~~~~~~~~~~~~~~~~~
When running INNER-LOOP, we could count how many times we've done the
"count++" operation. If it has been done #records-in-temptable times, that
means that all further records will not have matches and we can finish the
scan, i.e. break out of the INNER-LOOP.
1.2 Produce/merge sorted streams
--------------------------------
For each of the merged scan, use filesort-like action to end up with an
ordered stream of rowids. Then merge the ordered streams.
By filesort-like action we mean
- Run over index, collect rowids in a buffer.
- When the buffer is full, sort it and dump into a temporary file.
After the above we'll end up with a number of sorted buffers on disk. We can
use mergebuff() function (it is part of filesort's functions) to produce one
ordered sequence (i.e. array, which may be partially on disk) of rowids.
Merging of ordered streams with help of priority queue is already implemented
in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
child_quick->get_next()
call with a call to read rowid from an ordered sequence.
1.3 Extend Unique class to handle intersection
----------------------------------------------
There is no point to use Unique object as a device that accumulates rowids of
a single scan then produces them in sorted order. One could do the same faster
with accumulating an array of rowids and then sorting it.
It's possible to use Unique object to collect/merge data from all scans though.
The idea is as follows:
- Unique should store <rowid, n_scans> pairs
- Duplicates are pairs with the same rowid
- Unique should try to avoid creating duplicates:
- don't add a duplicate into the in-memory part, instead combine two elements
together by adding their n_scans elements.
- combine duplicates when it sees them in Unique.get() call
- The data we get from Unique.get() should be filtered, all records that have
n_scans != number_of_scans_being_merged should be discarded.
If we're lucky to have started and finished a scan on some index (denote it
as S) without flushing the Unique in the process, then:
- there is no point in adding any new records into the Unique because their
absence in the Unique means that they don't have match in S and hence will
not get into the result of intersection.
- we need to only update the counters to be able to tell if the elements that
are already in the Unique will have matches in all scans.
1.4 Strategies that do not seem to be useful
--------------------------------------------
keeping them here so we don't consider them over and over
1.4.1 Remove matches after having produced an ordered stream
------------------------------------------------------------
We can dump everything into a rowid stream and get it sorted. Then we read it,
and if we see a rowid repeated $n_merged_scans times, it belongs to the
intersection (pass to output), otherwise it doesn't (skip).
This doesn't have any advantages over the produce/merge sorted streams
approach.
1.4.2 Sparse rowid bitmaps
--------------------------
Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
bitmaps assume there will always be enough memory to accommodate them.
PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
disk pages, not ids of individual records. It's hard for us to do something
similar because our rowids are opaque entities whose meaning depends on the
storage engines.
This seems to require too much change to be worth it.
2. Optimization
===============
SEL_TREE objects already represent intersections. The problems with
optimizations are:
- Cost formula(s)
- When N keys/conditions are present:
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
somehow avoid considering (2^n - n) possible options.
- Avoid producing (or even considering) apparently suboptimal plans:
= Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
a subset of columns covered by all other indexes.
= (TODO any other rules?)
- Correlation across selectivities. If there is a condition
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
can we consider satisfaction of AND-parts to be independent?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Igor): index_merge: non-ROR intersection (21)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: non-ROR intersection
CREATION DATE..: Thu, 21 May 2009, 21:32
SUPERVISOR.....: Knielsen
IMPLEMENTOR....:
COPIES TO......: Rhuddleston, Sanja, Knielsen, Serg, Monty, Timour, Igor, Psergey
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 21 (http://askmonty.org/worklog/?tid=21)
VERSION........: Server-9.x
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 25
ESTIMATE.......: 175 (hours remain)
ORIG. ESTIMATE.: 175
PROGRESS NOTES:
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Observers changed: Knielsen,Monty,Psergey,Sanja,Igor,Rhuddleston,Timour,Serg
-=-=(Guest - Thu, 24 Jun 2010, 05:44)=-=-
I spent 25 hours in the month June 2010 to perform the following work for this task.
1. Compared tree possible algorithms to implement the operation of index intersection mentioned in
HLS by their labor/time consumption. Chose the algorithm that uses a modified Unique class (1.3) as
the most cheap requiring the least amount of efforts/time for its development.
2. Developed a design for a modification of the Unique class to support the operation of index
intersection.
3. Modified the merge_buffers procedure used by the Unique class to make it possible to use it not
only for the the operation of union, but for the operation of intersect as well.
Worked 25 hours and estimate 175 hours remain (original estimate increased by 200 hours).
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 21
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 21
-=-=(Guest - Wed, 03 Jun 2009, 01:17)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.30002 2009-06-03 01:17:32.000000000 +0300
+++ /tmp/wklog.21.new.30002 2009-06-03 01:17:32.000000000 +0300
@@ -7,13 +7,13 @@
The current optimization works with:
-WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
-WHERE key1_part1=1 OR key2_part1=3
+WHERE key1_part1=1 AND key2_part1=3
or
-WHERE key_part1<10 or key2_part1<100
+WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:06)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29694 2009-06-03 01:06:50.000000000 +0300
+++ /tmp/wklog.21.new.29694 2009-06-03 01:06:50.000000000 +0300
@@ -12,6 +12,8 @@
but not with:
WHERE key1_part1=1 OR key2_part1=3
+or
+WHERE key_part1<10 or key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:05)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29638 2009-06-03 01:05:01.000000000 +0300
+++ /tmp/wklog.21.new.29638 2009-06-03 01:05:01.000000000 +0300
@@ -3,5 +3,15 @@
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
+For example, assuming that key1 has 2 parts and key2 has 1 part.
+
+The current optimization works with:
+
+WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+
+but not with:
+
+WHERE key1_part1=1 OR key2_part1=3
+
This WL entry is to lift this limitation by developing algorithms that do
-intersection on non-ROR scans.
+intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Guest - Tue, 26 May 2009, 14:04)=-=-
High-Level Specification modified.
--- /tmp/wklog.21.old.1802 2009-05-26 14:04:57.000000000 +0300
+++ /tmp/wklog.21.new.1802 2009-05-26 14:04:57.000000000 +0300
@@ -1,4 +1,3 @@
-
<contents>
1. Execution
1.1 Temptable
@@ -30,6 +29,8 @@
1.1 Temptable
-------------
+[ This is our strategy of choice at the moment]
+
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
@@ -168,3 +169,8 @@
a subset of columns covered by all other indexes.
= (TODO any other rules?)
+- Correlation across selectivities. If there is a condition
+
+ "cond(key1) AND cond(key2) AND ... AND cond(keyN)",
+
+ can we consider satisfaction of AND-parts to be independent?
-=-=(Psergey - Thu, 21 May 2009, 21:33)=-=-
High-Level Specification modified.
--- /tmp/wklog.21.old.25705 2009-05-21 21:33:02.000000000 +0300
+++ /tmp/wklog.21.new.25705 2009-05-21 21:33:02.000000000 +0300
@@ -1 +1,170 @@
+<contents>
+1. Execution
+1.1 Temptable
+1.1.1 Improvement
+1.2 Produce/merge sorted streams
+1.3 Extend Unique class to handle intersection
+1.4 Strategies that do not seem to be useful
+1.4.1 Remove matches after having produced an ordered stream
+1.4.2 Sparse rowid bitmaps
+2. Optimization
+
+</contents>
+
+1. Execution
+============
+
+The primary task is to find means to compute an intersection of N unordered
+streams. Besides general memory/cpu cost of computation, we consider:
+
+- whether the produced rowid stream is ordered. If it is, it can be piped
+ into index_merge/intersect (as opposed to sort-intersect)
+
+- whether the strategy can take advantage of the fact that some input streams
+ are already rowid-ordered
+
+- startup cost (cost of producing the first output record)
+
+We see the following possible strategies:
+
+1.1 Temptable
+-------------
+Use a temporary heap-grow-out-to-myisam table with a primary key:
+
+create table temp_table (
+ rowid binary($rowid_size),
+ count n,
+ primary key(rowid);
+);
+
+Then use this algorithm:
+
+ i1= {index with the least E(#records)};
+
+ for each record R in range_scan(i1)
+ temp_table.insert(R.rowid, count=1);
+
+ for each index idx except i1
+ {
+ for each R record in scan(idx) // (INNER-LOOP)
+ {
+ if (temp_table has R)
+ temptable[R].count++;
+ }
+ }
+
+ // The following loop can do ordered or unordered scan
+ // if we want it to be ordered scan, we probably better arrange so that
+ // 'count' column is part of the index.
+ for each record R in temp_table
+ {
+ if (R.count == number_of_streams)
+ emit(R.rowid);
+ }
+
+The algorithm has an option to emit an ordered rowid stream.
+
+In the above form, the cost to produce the first record is high. It's easy to
+adjust the algorithm to make it low - we'll need to just start scanning all
+indexes at once, and finish as soon as we got a full match, i.e. the
+
+ temptable[R].count++
+
+operation resulted in the counter being equal to the number of merged scans.
+
+1.1.1 Improvement
+~~~~~~~~~~~~~~~~~
+When running INNER-LOOP, we could count how many times we've done the
+"count++" operation. If it has been done #records-in-temptable times, that
+means that all further records will not have matches and we can finish the
+scan, i.e. break out of the INNER-LOOP.
+
+1.2 Produce/merge sorted streams
+--------------------------------
+For each of the merged scan, use filesort-like action to end up with an
+ordered stream of rowids. Then merge the ordered streams.
+
+By filesort-like action we mean
+ - Run over index, collect rowids in a buffer.
+ - When the buffer is full, sort it and dump into a temporary file.
+After the above we'll end up with a number of sorted buffers on disk. We can
+use mergebuff() function (it is part of filesort's functions) to produce one
+ordered sequence (i.e. array, which may be partially on disk) of rowids.
+
+Merging of ordered streams with help of priority queue is already implemented
+in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
+
+ child_quick->get_next()
+
+call with a call to read rowid from an ordered sequence.
+
+1.3 Extend Unique class to handle intersection
+----------------------------------------------
+There is no point to use Unique object as a device that accumulates rowids of
+a single scan then produces them in sorted order. One could do the same faster
+with accumulating an array of rowids and then sorting it.
+
+It's possible to use Unique object to collect/merge data from all scans though.
+The idea is as follows:
+
+- Unique should store <rowid, n_scans> pairs
+- Duplicates are pairs with the same rowid
+- Unique should try to avoid creating duplicates:
+ - don't add a duplicate into the in-memory part, instead combine two elements
+ together by adding their n_scans elements.
+ - combine duplicates when it sees them in Unique.get() call
+- The data we get from Unique.get() should be filtered, all records that have
+ n_scans != number_of_scans_being_merged should be discarded.
+
+If we're lucky to have started and finished a scan on some index (denote it
+as S) without flushing the Unique in the process, then:
+- there is no point in adding any new records into the Unique because their
+ absence in the Unique means that they don't have match in S and hence will
+ not get into the result of intersection.
+- we need to only update the counters to be able to tell if the elements that
+ are already in the Unique will have matches in all scans.
+
+1.4 Strategies that do not seem to be useful
+--------------------------------------------
+
+keeping them here so we don't consider them over and over
+
+1.4.1 Remove matches after having produced an ordered stream
+------------------------------------------------------------
+We can dump everything into a rowid stream and get it sorted. Then we read it,
+and if we see a rowid repeated $n_merged_scans times, it belongs to the
+intersection (pass to output), otherwise it doesn't (skip).
+This doesn't have any advantages over the produce/merge sorted streams
+approach.
+
+1.4.2 Sparse rowid bitmaps
+--------------------------
+Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
+bitmaps assume there will always be enough memory to accommodate them.
+
+PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
+disk pages, not ids of individual records. It's hard for us to do something
+similar because our rowids are opaque entities whose meaning depends on the
+storage engines.
+
+This seems to require too much change to be worth it.
+
+2. Optimization
+===============
+
+SEL_TREE objects already represent intersections. The problems with
+optimizations are:
+
+- Cost formula(s)
+- When N keys/conditions are present:
+
+ "cond(key1) AND cond(key2) AND ... AND cond(keyN)",
+
+ somehow avoid considering (2^n - n) possible options.
+
+- Avoid producing (or even considering) apparently suboptimal plans:
+ = Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
+ a subset of columns covered by all other indexes.
+ = (TODO any other rules?)
+
DESCRIPTION:
At the moment index_merge supports intersection only for rowid-ordered streams.
This translates into a limitation that index_merge/intersect can only be
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
For example, assuming that key1 has 2 parts and key2 has 1 part.
The current optimization works with:
WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
WHERE key1_part1=1 AND key2_part1=3
or
WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Execution
1.1 Temptable
1.1.1 Improvement
1.2 Produce/merge sorted streams
1.3 Extend Unique class to handle intersection
1.4 Strategies that do not seem to be useful
1.4.1 Remove matches after having produced an ordered stream
1.4.2 Sparse rowid bitmaps
2. Optimization
</contents>
1. Execution
============
The primary task is to find means to compute an intersection of N unordered
streams. Besides general memory/cpu cost of computation, we consider:
- whether the produced rowid stream is ordered. If it is, it can be piped
into index_merge/intersect (as opposed to sort-intersect)
- whether the strategy can take advantage of the fact that some input streams
are already rowid-ordered
- startup cost (cost of producing the first output record)
We see the following possible strategies:
1.1 Temptable
-------------
[ This is our strategy of choice at the moment]
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
rowid binary($rowid_size),
count n,
primary key(rowid);
);
Then use this algorithm:
i1= {index with the least E(#records)};
for each record R in range_scan(i1)
temp_table.insert(R.rowid, count=1);
for each index idx except i1
{
for each R record in scan(idx) // (INNER-LOOP)
{
if (temp_table has R)
temptable[R].count++;
}
}
// The following loop can do ordered or unordered scan
// if we want it to be ordered scan, we probably better arrange so that
// 'count' column is part of the index.
for each record R in temp_table
{
if (R.count == number_of_streams)
emit(R.rowid);
}
The algorithm has an option to emit an ordered rowid stream.
In the above form, the cost to produce the first record is high. It's easy to
adjust the algorithm to make it low - we'll need to just start scanning all
indexes at once, and finish as soon as we got a full match, i.e. the
temptable[R].count++
operation resulted in the counter being equal to the number of merged scans.
1.1.1 Improvement
~~~~~~~~~~~~~~~~~
When running INNER-LOOP, we could count how many times we've done the
"count++" operation. If it has been done #records-in-temptable times, that
means that all further records will not have matches and we can finish the
scan, i.e. break out of the INNER-LOOP.
1.2 Produce/merge sorted streams
--------------------------------
For each of the merged scan, use filesort-like action to end up with an
ordered stream of rowids. Then merge the ordered streams.
By filesort-like action we mean
- Run over index, collect rowids in a buffer.
- When the buffer is full, sort it and dump into a temporary file.
After the above we'll end up with a number of sorted buffers on disk. We can
use mergebuff() function (it is part of filesort's functions) to produce one
ordered sequence (i.e. array, which may be partially on disk) of rowids.
Merging of ordered streams with help of priority queue is already implemented
in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
child_quick->get_next()
call with a call to read rowid from an ordered sequence.
1.3 Extend Unique class to handle intersection
----------------------------------------------
There is no point to use Unique object as a device that accumulates rowids of
a single scan then produces them in sorted order. One could do the same faster
with accumulating an array of rowids and then sorting it.
It's possible to use Unique object to collect/merge data from all scans though.
The idea is as follows:
- Unique should store <rowid, n_scans> pairs
- Duplicates are pairs with the same rowid
- Unique should try to avoid creating duplicates:
- don't add a duplicate into the in-memory part, instead combine two elements
together by adding their n_scans elements.
- combine duplicates when it sees them in Unique.get() call
- The data we get from Unique.get() should be filtered, all records that have
n_scans != number_of_scans_being_merged should be discarded.
If we're lucky to have started and finished a scan on some index (denote it
as S) without flushing the Unique in the process, then:
- there is no point in adding any new records into the Unique because their
absence in the Unique means that they don't have match in S and hence will
not get into the result of intersection.
- we need to only update the counters to be able to tell if the elements that
are already in the Unique will have matches in all scans.
1.4 Strategies that do not seem to be useful
--------------------------------------------
keeping them here so we don't consider them over and over
1.4.1 Remove matches after having produced an ordered stream
------------------------------------------------------------
We can dump everything into a rowid stream and get it sorted. Then we read it,
and if we see a rowid repeated $n_merged_scans times, it belongs to the
intersection (pass to output), otherwise it doesn't (skip).
This doesn't have any advantages over the produce/merge sorted streams
approach.
1.4.2 Sparse rowid bitmaps
--------------------------
Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
bitmaps assume there will always be enough memory to accommodate them.
PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
disk pages, not ids of individual records. It's hard for us to do something
similar because our rowids are opaque entities whose meaning depends on the
storage engines.
This seems to require too much change to be worth it.
2. Optimization
===============
SEL_TREE objects already represent intersections. The problems with
optimizations are:
- Cost formula(s)
- When N keys/conditions are present:
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
somehow avoid considering (2^n - n) possible options.
- Avoid producing (or even considering) apparently suboptimal plans:
= Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
a subset of columns covered by all other indexes.
= (TODO any other rules?)
- Correlation across selectivities. If there is a condition
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
can we consider satisfaction of AND-parts to be independent?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Igor): index_merge: non-ROR intersection (21)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: non-ROR intersection
CREATION DATE..: Thu, 21 May 2009, 21:32
SUPERVISOR.....: Knielsen
IMPLEMENTOR....:
COPIES TO......: Rhuddleston, Sanja, Knielsen, Serg, Monty, Timour, Igor, Psergey
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 21 (http://askmonty.org/worklog/?tid=21)
VERSION........: Server-9.x
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 25
ESTIMATE.......: 175 (hours remain)
ORIG. ESTIMATE.: 175
PROGRESS NOTES:
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Observers changed: Knielsen,Monty,Psergey,Sanja,Igor,Rhuddleston,Timour,Serg
-=-=(Guest - Thu, 24 Jun 2010, 05:44)=-=-
I spent 25 hours in the month June 2010 to perform the following work for this task.
1. Compared tree possible algorithms to implement the operation of index intersection mentioned in
HLS by their labor/time consumption. Chose the algorithm that uses a modified Unique class (1.3) as
the most cheap requiring the least amount of efforts/time for its development.
2. Developed a design for a modification of the Unique class to support the operation of index
intersection.
3. Modified the merge_buffers procedure used by the Unique class to make it possible to use it not
only for the the operation of union, but for the operation of intersect as well.
Worked 25 hours and estimate 175 hours remain (original estimate increased by 200 hours).
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 21
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 21
-=-=(Guest - Wed, 03 Jun 2009, 01:17)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.30002 2009-06-03 01:17:32.000000000 +0300
+++ /tmp/wklog.21.new.30002 2009-06-03 01:17:32.000000000 +0300
@@ -7,13 +7,13 @@
The current optimization works with:
-WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
-WHERE key1_part1=1 OR key2_part1=3
+WHERE key1_part1=1 AND key2_part1=3
or
-WHERE key_part1<10 or key2_part1<100
+WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:06)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29694 2009-06-03 01:06:50.000000000 +0300
+++ /tmp/wklog.21.new.29694 2009-06-03 01:06:50.000000000 +0300
@@ -12,6 +12,8 @@
but not with:
WHERE key1_part1=1 OR key2_part1=3
+or
+WHERE key_part1<10 or key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:05)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29638 2009-06-03 01:05:01.000000000 +0300
+++ /tmp/wklog.21.new.29638 2009-06-03 01:05:01.000000000 +0300
@@ -3,5 +3,15 @@
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
+For example, assuming that key1 has 2 parts and key2 has 1 part.
+
+The current optimization works with:
+
+WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+
+but not with:
+
+WHERE key1_part1=1 OR key2_part1=3
+
This WL entry is to lift this limitation by developing algorithms that do
-intersection on non-ROR scans.
+intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Guest - Tue, 26 May 2009, 14:04)=-=-
High-Level Specification modified.
--- /tmp/wklog.21.old.1802 2009-05-26 14:04:57.000000000 +0300
+++ /tmp/wklog.21.new.1802 2009-05-26 14:04:57.000000000 +0300
@@ -1,4 +1,3 @@
-
<contents>
1. Execution
1.1 Temptable
@@ -30,6 +29,8 @@
1.1 Temptable
-------------
+[ This is our strategy of choice at the moment]
+
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
@@ -168,3 +169,8 @@
a subset of columns covered by all other indexes.
= (TODO any other rules?)
+- Correlation across selectivities. If there is a condition
+
+ "cond(key1) AND cond(key2) AND ... AND cond(keyN)",
+
+ can we consider satisfaction of AND-parts to be independent?
-=-=(Psergey - Thu, 21 May 2009, 21:33)=-=-
High-Level Specification modified.
--- /tmp/wklog.21.old.25705 2009-05-21 21:33:02.000000000 +0300
+++ /tmp/wklog.21.new.25705 2009-05-21 21:33:02.000000000 +0300
@@ -1 +1,170 @@
+<contents>
+1. Execution
+1.1 Temptable
+1.1.1 Improvement
+1.2 Produce/merge sorted streams
+1.3 Extend Unique class to handle intersection
+1.4 Strategies that do not seem to be useful
+1.4.1 Remove matches after having produced an ordered stream
+1.4.2 Sparse rowid bitmaps
+2. Optimization
+
+</contents>
+
+1. Execution
+============
+
+The primary task is to find means to compute an intersection of N unordered
+streams. Besides general memory/cpu cost of computation, we consider:
+
+- whether the produced rowid stream is ordered. If it is, it can be piped
+ into index_merge/intersect (as opposed to sort-intersect)
+
+- whether the strategy can take advantage of the fact that some input streams
+ are already rowid-ordered
+
+- startup cost (cost of producing the first output record)
+
+We see the following possible strategies:
+
+1.1 Temptable
+-------------
+Use a temporary heap-grow-out-to-myisam table with a primary key:
+
+create table temp_table (
+ rowid binary($rowid_size),
+ count n,
+ primary key(rowid);
+);
+
+Then use this algorithm:
+
+ i1= {index with the least E(#records)};
+
+ for each record R in range_scan(i1)
+ temp_table.insert(R.rowid, count=1);
+
+ for each index idx except i1
+ {
+ for each R record in scan(idx) // (INNER-LOOP)
+ {
+ if (temp_table has R)
+ temptable[R].count++;
+ }
+ }
+
+ // The following loop can do ordered or unordered scan
+ // if we want it to be ordered scan, we probably better arrange so that
+ // 'count' column is part of the index.
+ for each record R in temp_table
+ {
+ if (R.count == number_of_streams)
+ emit(R.rowid);
+ }
+
+The algorithm has an option to emit an ordered rowid stream.
+
+In the above form, the cost to produce the first record is high. It's easy to
+adjust the algorithm to make it low - we'll need to just start scanning all
+indexes at once, and finish as soon as we got a full match, i.e. the
+
+ temptable[R].count++
+
+operation resulted in the counter being equal to the number of merged scans.
+
+1.1.1 Improvement
+~~~~~~~~~~~~~~~~~
+When running INNER-LOOP, we could count how many times we've done the
+"count++" operation. If it has been done #records-in-temptable times, that
+means that all further records will not have matches and we can finish the
+scan, i.e. break out of the INNER-LOOP.
+
+1.2 Produce/merge sorted streams
+--------------------------------
+For each of the merged scan, use filesort-like action to end up with an
+ordered stream of rowids. Then merge the ordered streams.
+
+By filesort-like action we mean
+ - Run over index, collect rowids in a buffer.
+ - When the buffer is full, sort it and dump into a temporary file.
+After the above we'll end up with a number of sorted buffers on disk. We can
+use mergebuff() function (it is part of filesort's functions) to produce one
+ordered sequence (i.e. array, which may be partially on disk) of rowids.
+
+Merging of ordered streams with help of priority queue is already implemented
+in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
+
+ child_quick->get_next()
+
+call with a call to read rowid from an ordered sequence.
+
+1.3 Extend Unique class to handle intersection
+----------------------------------------------
+There is no point to use Unique object as a device that accumulates rowids of
+a single scan then produces them in sorted order. One could do the same faster
+with accumulating an array of rowids and then sorting it.
+
+It's possible to use Unique object to collect/merge data from all scans though.
+The idea is as follows:
+
+- Unique should store <rowid, n_scans> pairs
+- Duplicates are pairs with the same rowid
+- Unique should try to avoid creating duplicates:
+ - don't add a duplicate into the in-memory part, instead combine two elements
+ together by adding their n_scans elements.
+ - combine duplicates when it sees them in Unique.get() call
+- The data we get from Unique.get() should be filtered, all records that have
+ n_scans != number_of_scans_being_merged should be discarded.
+
+If we're lucky to have started and finished a scan on some index (denote it
+as S) without flushing the Unique in the process, then:
+- there is no point in adding any new records into the Unique because their
+ absence in the Unique means that they don't have match in S and hence will
+ not get into the result of intersection.
+- we need to only update the counters to be able to tell if the elements that
+ are already in the Unique will have matches in all scans.
+
+1.4 Strategies that do not seem to be useful
+--------------------------------------------
+
+keeping them here so we don't consider them over and over
+
+1.4.1 Remove matches after having produced an ordered stream
+------------------------------------------------------------
+We can dump everything into a rowid stream and get it sorted. Then we read it,
+and if we see a rowid repeated $n_merged_scans times, it belongs to the
+intersection (pass to output), otherwise it doesn't (skip).
+This doesn't have any advantages over the produce/merge sorted streams
+approach.
+
+1.4.2 Sparse rowid bitmaps
+--------------------------
+Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
+bitmaps assume there will always be enough memory to accommodate them.
+
+PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
+disk pages, not ids of individual records. It's hard for us to do something
+similar because our rowids are opaque entities whose meaning depends on the
+storage engines.
+
+This seems to require too much change to be worth it.
+
+2. Optimization
+===============
+
+SEL_TREE objects already represent intersections. The problems with
+optimizations are:
+
+- Cost formula(s)
+- When N keys/conditions are present:
+
+ "cond(key1) AND cond(key2) AND ... AND cond(keyN)",
+
+ somehow avoid considering (2^n - n) possible options.
+
+- Avoid producing (or even considering) apparently suboptimal plans:
+ = Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
+ a subset of columns covered by all other indexes.
+ = (TODO any other rules?)
+
DESCRIPTION:
At the moment index_merge supports intersection only for rowid-ordered streams.
This translates into a limitation that index_merge/intersect can only be
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
For example, assuming that key1 has 2 parts and key2 has 1 part.
The current optimization works with:
WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
WHERE key1_part1=1 AND key2_part1=3
or
WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Execution
1.1 Temptable
1.1.1 Improvement
1.2 Produce/merge sorted streams
1.3 Extend Unique class to handle intersection
1.4 Strategies that do not seem to be useful
1.4.1 Remove matches after having produced an ordered stream
1.4.2 Sparse rowid bitmaps
2. Optimization
</contents>
1. Execution
============
The primary task is to find means to compute an intersection of N unordered
streams. Besides general memory/cpu cost of computation, we consider:
- whether the produced rowid stream is ordered. If it is, it can be piped
into index_merge/intersect (as opposed to sort-intersect)
- whether the strategy can take advantage of the fact that some input streams
are already rowid-ordered
- startup cost (cost of producing the first output record)
We see the following possible strategies:
1.1 Temptable
-------------
[ This is our strategy of choice at the moment]
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
rowid binary($rowid_size),
count n,
primary key(rowid);
);
Then use this algorithm:
i1= {index with the least E(#records)};
for each record R in range_scan(i1)
temp_table.insert(R.rowid, count=1);
for each index idx except i1
{
for each R record in scan(idx) // (INNER-LOOP)
{
if (temp_table has R)
temptable[R].count++;
}
}
// The following loop can do ordered or unordered scan
// if we want it to be ordered scan, we probably better arrange so that
// 'count' column is part of the index.
for each record R in temp_table
{
if (R.count == number_of_streams)
emit(R.rowid);
}
The algorithm has an option to emit an ordered rowid stream.
In the above form, the cost to produce the first record is high. It's easy to
adjust the algorithm to make it low - we'll need to just start scanning all
indexes at once, and finish as soon as we got a full match, i.e. the
temptable[R].count++
operation resulted in the counter being equal to the number of merged scans.
1.1.1 Improvement
~~~~~~~~~~~~~~~~~
When running INNER-LOOP, we could count how many times we've done the
"count++" operation. If it has been done #records-in-temptable times, that
means that all further records will not have matches and we can finish the
scan, i.e. break out of the INNER-LOOP.
1.2 Produce/merge sorted streams
--------------------------------
For each of the merged scan, use filesort-like action to end up with an
ordered stream of rowids. Then merge the ordered streams.
By filesort-like action we mean
- Run over index, collect rowids in a buffer.
- When the buffer is full, sort it and dump into a temporary file.
After the above we'll end up with a number of sorted buffers on disk. We can
use mergebuff() function (it is part of filesort's functions) to produce one
ordered sequence (i.e. array, which may be partially on disk) of rowids.
Merging of ordered streams with help of priority queue is already implemented
in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
child_quick->get_next()
call with a call to read rowid from an ordered sequence.
1.3 Extend Unique class to handle intersection
----------------------------------------------
There is no point to use Unique object as a device that accumulates rowids of
a single scan then produces them in sorted order. One could do the same faster
with accumulating an array of rowids and then sorting it.
It's possible to use Unique object to collect/merge data from all scans though.
The idea is as follows:
- Unique should store <rowid, n_scans> pairs
- Duplicates are pairs with the same rowid
- Unique should try to avoid creating duplicates:
- don't add a duplicate into the in-memory part, instead combine two elements
together by adding their n_scans elements.
- combine duplicates when it sees them in Unique.get() call
- The data we get from Unique.get() should be filtered, all records that have
n_scans != number_of_scans_being_merged should be discarded.
If we're lucky to have started and finished a scan on some index (denote it
as S) without flushing the Unique in the process, then:
- there is no point in adding any new records into the Unique because their
absence in the Unique means that they don't have match in S and hence will
not get into the result of intersection.
- we need to only update the counters to be able to tell if the elements that
are already in the Unique will have matches in all scans.
1.4 Strategies that do not seem to be useful
--------------------------------------------
keeping them here so we don't consider them over and over
1.4.1 Remove matches after having produced an ordered stream
------------------------------------------------------------
We can dump everything into a rowid stream and get it sorted. Then we read it,
and if we see a rowid repeated $n_merged_scans times, it belongs to the
intersection (pass to output), otherwise it doesn't (skip).
This doesn't have any advantages over the produce/merge sorted streams
approach.
1.4.2 Sparse rowid bitmaps
--------------------------
Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
bitmaps assume there will always be enough memory to accommodate them.
PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
disk pages, not ids of individual records. It's hard for us to do something
similar because our rowids are opaque entities whose meaning depends on the
storage engines.
This seems to require too much change to be worth it.
2. Optimization
===============
SEL_TREE objects already represent intersections. The problems with
optimizations are:
- Cost formula(s)
- When N keys/conditions are present:
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
somehow avoid considering (2^n - n) possible options.
- Avoid producing (or even considering) apparently suboptimal plans:
= Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
a subset of columns covered by all other indexes.
= (TODO any other rules?)
- Correlation across selectivities. If there is a condition
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
can we consider satisfaction of AND-parts to be independent?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Igor): index_merge: non-ROR intersection (21)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: non-ROR intersection
CREATION DATE..: Thu, 21 May 2009, 21:32
SUPERVISOR.....: Knielsen
IMPLEMENTOR....:
COPIES TO......: Rhuddleston, Sanja, Knielsen, Serg, Monty, Timour, Igor, Psergey
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 21 (http://askmonty.org/worklog/?tid=21)
VERSION........: Server-9.x
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 25
ESTIMATE.......: 175 (hours remain)
ORIG. ESTIMATE.: 175
PROGRESS NOTES:
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Observers changed: Knielsen,Monty,Psergey,Sanja,Igor,Rhuddleston,Timour,Serg
-=-=(Guest - Thu, 24 Jun 2010, 05:44)=-=-
I spent 25 hours in the month June 2010 to perform the following work for this task.
1. Compared tree possible algorithms to implement the operation of index intersection mentioned in
HLS by their labor/time consumption. Chose the algorithm that uses a modified Unique class (1.3) as
the most cheap requiring the least amount of efforts/time for its development.
2. Developed a design for a modification of the Unique class to support the operation of index
intersection.
3. Modified the merge_buffers procedure used by the Unique class to make it possible to use it not
only for the the operation of union, but for the operation of intersect as well.
Worked 25 hours and estimate 175 hours remain (original estimate increased by 200 hours).
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 21
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 21
-=-=(Guest - Wed, 03 Jun 2009, 01:17)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.30002 2009-06-03 01:17:32.000000000 +0300
+++ /tmp/wklog.21.new.30002 2009-06-03 01:17:32.000000000 +0300
@@ -7,13 +7,13 @@
The current optimization works with:
-WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
-WHERE key1_part1=1 OR key2_part1=3
+WHERE key1_part1=1 AND key2_part1=3
or
-WHERE key_part1<10 or key2_part1<100
+WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:06)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29694 2009-06-03 01:06:50.000000000 +0300
+++ /tmp/wklog.21.new.29694 2009-06-03 01:06:50.000000000 +0300
@@ -12,6 +12,8 @@
but not with:
WHERE key1_part1=1 OR key2_part1=3
+or
+WHERE key_part1<10 or key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:05)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29638 2009-06-03 01:05:01.000000000 +0300
+++ /tmp/wklog.21.new.29638 2009-06-03 01:05:01.000000000 +0300
@@ -3,5 +3,15 @@
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
+For example, assuming that key1 has 2 parts and key2 has 1 part.
+
+The current optimization works with:
+
+WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+
+but not with:
+
+WHERE key1_part1=1 OR key2_part1=3
+
This WL entry is to lift this limitation by developing algorithms that do
-intersection on non-ROR scans.
+intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Guest - Tue, 26 May 2009, 14:04)=-=-
High-Level Specification modified.
--- /tmp/wklog.21.old.1802 2009-05-26 14:04:57.000000000 +0300
+++ /tmp/wklog.21.new.1802 2009-05-26 14:04:57.000000000 +0300
@@ -1,4 +1,3 @@
-
<contents>
1. Execution
1.1 Temptable
@@ -30,6 +29,8 @@
1.1 Temptable
-------------
+[ This is our strategy of choice at the moment]
+
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
@@ -168,3 +169,8 @@
a subset of columns covered by all other indexes.
= (TODO any other rules?)
+- Correlation across selectivities. If there is a condition
+
+ "cond(key1) AND cond(key2) AND ... AND cond(keyN)",
+
+ can we consider satisfaction of AND-parts to be independent?
-=-=(Psergey - Thu, 21 May 2009, 21:33)=-=-
High-Level Specification modified.
--- /tmp/wklog.21.old.25705 2009-05-21 21:33:02.000000000 +0300
+++ /tmp/wklog.21.new.25705 2009-05-21 21:33:02.000000000 +0300
@@ -1 +1,170 @@
+<contents>
+1. Execution
+1.1 Temptable
+1.1.1 Improvement
+1.2 Produce/merge sorted streams
+1.3 Extend Unique class to handle intersection
+1.4 Strategies that do not seem to be useful
+1.4.1 Remove matches after having produced an ordered stream
+1.4.2 Sparse rowid bitmaps
+2. Optimization
+
+</contents>
+
+1. Execution
+============
+
+The primary task is to find means to compute an intersection of N unordered
+streams. Besides general memory/cpu cost of computation, we consider:
+
+- whether the produced rowid stream is ordered. If it is, it can be piped
+ into index_merge/intersect (as opposed to sort-intersect)
+
+- whether the strategy can take advantage of the fact that some input streams
+ are already rowid-ordered
+
+- startup cost (cost of producing the first output record)
+
+We see the following possible strategies:
+
+1.1 Temptable
+-------------
+Use a temporary heap-grow-out-to-myisam table with a primary key:
+
+create table temp_table (
+ rowid binary($rowid_size),
+ count n,
+ primary key(rowid);
+);
+
+Then use this algorithm:
+
+ i1= {index with the least E(#records)};
+
+ for each record R in range_scan(i1)
+ temp_table.insert(R.rowid, count=1);
+
+ for each index idx except i1
+ {
+ for each R record in scan(idx) // (INNER-LOOP)
+ {
+ if (temp_table has R)
+ temptable[R].count++;
+ }
+ }
+
+ // The following loop can do ordered or unordered scan
+ // if we want it to be ordered scan, we probably better arrange so that
+ // 'count' column is part of the index.
+ for each record R in temp_table
+ {
+ if (R.count == number_of_streams)
+ emit(R.rowid);
+ }
+
+The algorithm has an option to emit an ordered rowid stream.
+
+In the above form, the cost to produce the first record is high. It's easy to
+adjust the algorithm to make it low - we'll need to just start scanning all
+indexes at once, and finish as soon as we got a full match, i.e. the
+
+ temptable[R].count++
+
+operation resulted in the counter being equal to the number of merged scans.
+
+1.1.1 Improvement
+~~~~~~~~~~~~~~~~~
+When running INNER-LOOP, we could count how many times we've done the
+"count++" operation. If it has been done #records-in-temptable times, that
+means that all further records will not have matches and we can finish the
+scan, i.e. break out of the INNER-LOOP.
+
+1.2 Produce/merge sorted streams
+--------------------------------
+For each of the merged scan, use filesort-like action to end up with an
+ordered stream of rowids. Then merge the ordered streams.
+
+By filesort-like action we mean
+ - Run over index, collect rowids in a buffer.
+ - When the buffer is full, sort it and dump into a temporary file.
+After the above we'll end up with a number of sorted buffers on disk. We can
+use mergebuff() function (it is part of filesort's functions) to produce one
+ordered sequence (i.e. array, which may be partially on disk) of rowids.
+
+Merging of ordered streams with help of priority queue is already implemented
+in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
+
+ child_quick->get_next()
+
+call with a call to read rowid from an ordered sequence.
+
+1.3 Extend Unique class to handle intersection
+----------------------------------------------
+There is no point to use Unique object as a device that accumulates rowids of
+a single scan then produces them in sorted order. One could do the same faster
+with accumulating an array of rowids and then sorting it.
+
+It's possible to use Unique object to collect/merge data from all scans though.
+The idea is as follows:
+
+- Unique should store <rowid, n_scans> pairs
+- Duplicates are pairs with the same rowid
+- Unique should try to avoid creating duplicates:
+ - don't add a duplicate into the in-memory part, instead combine two elements
+ together by adding their n_scans elements.
+ - combine duplicates when it sees them in Unique.get() call
+- The data we get from Unique.get() should be filtered, all records that have
+ n_scans != number_of_scans_being_merged should be discarded.
+
+If we're lucky to have started and finished a scan on some index (denote it
+as S) without flushing the Unique in the process, then:
+- there is no point in adding any new records into the Unique because their
+ absence in the Unique means that they don't have match in S and hence will
+ not get into the result of intersection.
+- we need to only update the counters to be able to tell if the elements that
+ are already in the Unique will have matches in all scans.
+
+1.4 Strategies that do not seem to be useful
+--------------------------------------------
+
+keeping them here so we don't consider them over and over
+
+1.4.1 Remove matches after having produced an ordered stream
+------------------------------------------------------------
+We can dump everything into a rowid stream and get it sorted. Then we read it,
+and if we see a rowid repeated $n_merged_scans times, it belongs to the
+intersection (pass to output), otherwise it doesn't (skip).
+This doesn't have any advantages over the produce/merge sorted streams
+approach.
+
+1.4.2 Sparse rowid bitmaps
+--------------------------
+Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
+bitmaps assume there will always be enough memory to accommodate them.
+
+PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
+disk pages, not ids of individual records. It's hard for us to do something
+similar because our rowids are opaque entities whose meaning depends on the
+storage engines.
+
+This seems to require too much change to be worth it.
+
+2. Optimization
+===============
+
+SEL_TREE objects already represent intersections. The problems with
+optimizations are:
+
+- Cost formula(s)
+- When N keys/conditions are present:
+
+ "cond(key1) AND cond(key2) AND ... AND cond(keyN)",
+
+ somehow avoid considering (2^n - n) possible options.
+
+- Avoid producing (or even considering) apparently suboptimal plans:
+ = Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
+ a subset of columns covered by all other indexes.
+ = (TODO any other rules?)
+
DESCRIPTION:
At the moment index_merge supports intersection only for rowid-ordered streams.
This translates into a limitation that index_merge/intersect can only be
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
For example, assuming that key1 has 2 parts and key2 has 1 part.
The current optimization works with:
WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
WHERE key1_part1=1 AND key2_part1=3
or
WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Execution
1.1 Temptable
1.1.1 Improvement
1.2 Produce/merge sorted streams
1.3 Extend Unique class to handle intersection
1.4 Strategies that do not seem to be useful
1.4.1 Remove matches after having produced an ordered stream
1.4.2 Sparse rowid bitmaps
2. Optimization
</contents>
1. Execution
============
The primary task is to find means to compute an intersection of N unordered
streams. Besides general memory/cpu cost of computation, we consider:
- whether the produced rowid stream is ordered. If it is, it can be piped
into index_merge/intersect (as opposed to sort-intersect)
- whether the strategy can take advantage of the fact that some input streams
are already rowid-ordered
- startup cost (cost of producing the first output record)
We see the following possible strategies:
1.1 Temptable
-------------
[ This is our strategy of choice at the moment]
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
rowid binary($rowid_size),
count n,
primary key(rowid);
);
Then use this algorithm:
i1= {index with the least E(#records)};
for each record R in range_scan(i1)
temp_table.insert(R.rowid, count=1);
for each index idx except i1
{
for each R record in scan(idx) // (INNER-LOOP)
{
if (temp_table has R)
temptable[R].count++;
}
}
// The following loop can do ordered or unordered scan
// if we want it to be ordered scan, we probably better arrange so that
// 'count' column is part of the index.
for each record R in temp_table
{
if (R.count == number_of_streams)
emit(R.rowid);
}
The algorithm has an option to emit an ordered rowid stream.
In the above form, the cost to produce the first record is high. It's easy to
adjust the algorithm to make it low - we'll need to just start scanning all
indexes at once, and finish as soon as we got a full match, i.e. the
temptable[R].count++
operation resulted in the counter being equal to the number of merged scans.
1.1.1 Improvement
~~~~~~~~~~~~~~~~~
When running INNER-LOOP, we could count how many times we've done the
"count++" operation. If it has been done #records-in-temptable times, that
means that all further records will not have matches and we can finish the
scan, i.e. break out of the INNER-LOOP.
1.2 Produce/merge sorted streams
--------------------------------
For each of the merged scan, use filesort-like action to end up with an
ordered stream of rowids. Then merge the ordered streams.
By filesort-like action we mean
- Run over index, collect rowids in a buffer.
- When the buffer is full, sort it and dump into a temporary file.
After the above we'll end up with a number of sorted buffers on disk. We can
use mergebuff() function (it is part of filesort's functions) to produce one
ordered sequence (i.e. array, which may be partially on disk) of rowids.
Merging of ordered streams with help of priority queue is already implemented
in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
child_quick->get_next()
call with a call to read rowid from an ordered sequence.
1.3 Extend Unique class to handle intersection
----------------------------------------------
There is no point to use Unique object as a device that accumulates rowids of
a single scan then produces them in sorted order. One could do the same faster
with accumulating an array of rowids and then sorting it.
It's possible to use Unique object to collect/merge data from all scans though.
The idea is as follows:
- Unique should store <rowid, n_scans> pairs
- Duplicates are pairs with the same rowid
- Unique should try to avoid creating duplicates:
- don't add a duplicate into the in-memory part, instead combine two elements
together by adding their n_scans elements.
- combine duplicates when it sees them in Unique.get() call
- The data we get from Unique.get() should be filtered, all records that have
n_scans != number_of_scans_being_merged should be discarded.
If we're lucky to have started and finished a scan on some index (denote it
as S) without flushing the Unique in the process, then:
- there is no point in adding any new records into the Unique because their
absence in the Unique means that they don't have match in S and hence will
not get into the result of intersection.
- we need to only update the counters to be able to tell if the elements that
are already in the Unique will have matches in all scans.
1.4 Strategies that do not seem to be useful
--------------------------------------------
keeping them here so we don't consider them over and over
1.4.1 Remove matches after having produced an ordered stream
------------------------------------------------------------
We can dump everything into a rowid stream and get it sorted. Then we read it,
and if we see a rowid repeated $n_merged_scans times, it belongs to the
intersection (pass to output), otherwise it doesn't (skip).
This doesn't have any advantages over the produce/merge sorted streams
approach.
1.4.2 Sparse rowid bitmaps
--------------------------
Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
bitmaps assume there will always be enough memory to accommodate them.
PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
disk pages, not ids of individual records. It's hard for us to do something
similar because our rowids are opaque entities whose meaning depends on the
storage engines.
This seems to require too much change to be worth it.
2. Optimization
===============
SEL_TREE objects already represent intersections. The problems with
optimizations are:
- Cost formula(s)
- When N keys/conditions are present:
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
somehow avoid considering (2^n - n) possible options.
- Avoid producing (or even considering) apparently suboptimal plans:
= Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
a subset of columns covered by all other indexes.
= (TODO any other rules?)
- Correlation across selectivities. If there is a condition
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
can we consider satisfaction of AND-parts to be independent?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Igor): index_merge: non-ROR intersection (21)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: non-ROR intersection
CREATION DATE..: Thu, 21 May 2009, 21:32
SUPERVISOR.....: Knielsen
IMPLEMENTOR....:
COPIES TO......: Rhuddleston, Sanja, Knielsen, Serg, Monty, Timour, Igor, Psergey
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 21 (http://askmonty.org/worklog/?tid=21)
VERSION........: Server-9.x
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 25
ESTIMATE.......: 175 (hours remain)
ORIG. ESTIMATE.: 175
PROGRESS NOTES:
-=-=(Igor - Thu, 24 Jun 2010, 05:48)=-=-
Observers changed: Knielsen,Monty,Psergey,Sanja,Igor,Rhuddleston,Timour,Serg
-=-=(Guest - Thu, 24 Jun 2010, 05:44)=-=-
I spent 25 hours in the month June 2010 to perform the following work for this task.
1. Compared tree possible algorithms to implement the operation of index intersection mentioned in
HLS by their labor/time consumption. Chose the algorithm that uses a modified Unique class (1.3) as
the most cheap requiring the least amount of efforts/time for its development.
2. Developed a design for a modification of the Unique class to support the operation of index
intersection.
3. Modified the merge_buffers procedure used by the Unique class to make it possible to use it not
only for the the operation of union, but for the operation of intersect as well.
Worked 25 hours and estimate 175 hours remain (original estimate increased by 200 hours).
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 21
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 21
-=-=(Guest - Wed, 03 Jun 2009, 01:17)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.30002 2009-06-03 01:17:32.000000000 +0300
+++ /tmp/wklog.21.new.30002 2009-06-03 01:17:32.000000000 +0300
@@ -7,13 +7,13 @@
The current optimization works with:
-WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
-WHERE key1_part1=1 OR key2_part1=3
+WHERE key1_part1=1 AND key2_part1=3
or
-WHERE key_part1<10 or key2_part1<100
+WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:06)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29694 2009-06-03 01:06:50.000000000 +0300
+++ /tmp/wklog.21.new.29694 2009-06-03 01:06:50.000000000 +0300
@@ -12,6 +12,8 @@
but not with:
WHERE key1_part1=1 OR key2_part1=3
+or
+WHERE key_part1<10 or key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:05)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29638 2009-06-03 01:05:01.000000000 +0300
+++ /tmp/wklog.21.new.29638 2009-06-03 01:05:01.000000000 +0300
@@ -3,5 +3,15 @@
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
+For example, assuming that key1 has 2 parts and key2 has 1 part.
+
+The current optimization works with:
+
+WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+
+but not with:
+
+WHERE key1_part1=1 OR key2_part1=3
+
This WL entry is to lift this limitation by developing algorithms that do
-intersection on non-ROR scans.
+intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Guest - Tue, 26 May 2009, 14:04)=-=-
High-Level Specification modified.
--- /tmp/wklog.21.old.1802 2009-05-26 14:04:57.000000000 +0300
+++ /tmp/wklog.21.new.1802 2009-05-26 14:04:57.000000000 +0300
@@ -1,4 +1,3 @@
-
<contents>
1. Execution
1.1 Temptable
@@ -30,6 +29,8 @@
1.1 Temptable
-------------
+[ This is our strategy of choice at the moment]
+
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
@@ -168,3 +169,8 @@
a subset of columns covered by all other indexes.
= (TODO any other rules?)
+- Correlation across selectivities. If there is a condition
+
+ "cond(key1) AND cond(key2) AND ... AND cond(keyN)",
+
+ can we consider satisfaction of AND-parts to be independent?
-=-=(Psergey - Thu, 21 May 2009, 21:33)=-=-
High-Level Specification modified.
--- /tmp/wklog.21.old.25705 2009-05-21 21:33:02.000000000 +0300
+++ /tmp/wklog.21.new.25705 2009-05-21 21:33:02.000000000 +0300
@@ -1 +1,170 @@
+<contents>
+1. Execution
+1.1 Temptable
+1.1.1 Improvement
+1.2 Produce/merge sorted streams
+1.3 Extend Unique class to handle intersection
+1.4 Strategies that do not seem to be useful
+1.4.1 Remove matches after having produced an ordered stream
+1.4.2 Sparse rowid bitmaps
+2. Optimization
+
+</contents>
+
+1. Execution
+============
+
+The primary task is to find means to compute an intersection of N unordered
+streams. Besides general memory/cpu cost of computation, we consider:
+
+- whether the produced rowid stream is ordered. If it is, it can be piped
+ into index_merge/intersect (as opposed to sort-intersect)
+
+- whether the strategy can take advantage of the fact that some input streams
+ are already rowid-ordered
+
+- startup cost (cost of producing the first output record)
+
+We see the following possible strategies:
+
+1.1 Temptable
+-------------
+Use a temporary heap-grow-out-to-myisam table with a primary key:
+
+create table temp_table (
+ rowid binary($rowid_size),
+ count n,
+ primary key(rowid);
+);
+
+Then use this algorithm:
+
+ i1= {index with the least E(#records)};
+
+ for each record R in range_scan(i1)
+ temp_table.insert(R.rowid, count=1);
+
+ for each index idx except i1
+ {
+ for each R record in scan(idx) // (INNER-LOOP)
+ {
+ if (temp_table has R)
+ temptable[R].count++;
+ }
+ }
+
+ // The following loop can do ordered or unordered scan
+ // if we want it to be ordered scan, we probably better arrange so that
+ // 'count' column is part of the index.
+ for each record R in temp_table
+ {
+ if (R.count == number_of_streams)
+ emit(R.rowid);
+ }
+
+The algorithm has an option to emit an ordered rowid stream.
+
+In the above form, the cost to produce the first record is high. It's easy to
+adjust the algorithm to make it low - we'll need to just start scanning all
+indexes at once, and finish as soon as we got a full match, i.e. the
+
+ temptable[R].count++
+
+operation resulted in the counter being equal to the number of merged scans.
+
+1.1.1 Improvement
+~~~~~~~~~~~~~~~~~
+When running INNER-LOOP, we could count how many times we've done the
+"count++" operation. If it has been done #records-in-temptable times, that
+means that all further records will not have matches and we can finish the
+scan, i.e. break out of the INNER-LOOP.
+
+1.2 Produce/merge sorted streams
+--------------------------------
+For each of the merged scan, use filesort-like action to end up with an
+ordered stream of rowids. Then merge the ordered streams.
+
+By filesort-like action we mean
+ - Run over index, collect rowids in a buffer.
+ - When the buffer is full, sort it and dump into a temporary file.
+After the above we'll end up with a number of sorted buffers on disk. We can
+use mergebuff() function (it is part of filesort's functions) to produce one
+ordered sequence (i.e. array, which may be partially on disk) of rowids.
+
+Merging of ordered streams with help of priority queue is already implemented
+in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
+
+ child_quick->get_next()
+
+call with a call to read rowid from an ordered sequence.
+
+1.3 Extend Unique class to handle intersection
+----------------------------------------------
+There is no point to use Unique object as a device that accumulates rowids of
+a single scan then produces them in sorted order. One could do the same faster
+with accumulating an array of rowids and then sorting it.
+
+It's possible to use Unique object to collect/merge data from all scans though.
+The idea is as follows:
+
+- Unique should store <rowid, n_scans> pairs
+- Duplicates are pairs with the same rowid
+- Unique should try to avoid creating duplicates:
+ - don't add a duplicate into the in-memory part, instead combine two elements
+ together by adding their n_scans elements.
+ - combine duplicates when it sees them in Unique.get() call
+- The data we get from Unique.get() should be filtered, all records that have
+ n_scans != number_of_scans_being_merged should be discarded.
+
+If we're lucky to have started and finished a scan on some index (denote it
+as S) without flushing the Unique in the process, then:
+- there is no point in adding any new records into the Unique because their
+ absence in the Unique means that they don't have match in S and hence will
+ not get into the result of intersection.
+- we need to only update the counters to be able to tell if the elements that
+ are already in the Unique will have matches in all scans.
+
+1.4 Strategies that do not seem to be useful
+--------------------------------------------
+
+keeping them here so we don't consider them over and over
+
+1.4.1 Remove matches after having produced an ordered stream
+------------------------------------------------------------
+We can dump everything into a rowid stream and get it sorted. Then we read it,
+and if we see a rowid repeated $n_merged_scans times, it belongs to the
+intersection (pass to output), otherwise it doesn't (skip).
+This doesn't have any advantages over the produce/merge sorted streams
+approach.
+
+1.4.2 Sparse rowid bitmaps
+--------------------------
+Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
+bitmaps assume there will always be enough memory to accommodate them.
+
+PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
+disk pages, not ids of individual records. It's hard for us to do something
+similar because our rowids are opaque entities whose meaning depends on the
+storage engines.
+
+This seems to require too much change to be worth it.
+
+2. Optimization
+===============
+
+SEL_TREE objects already represent intersections. The problems with
+optimizations are:
+
+- Cost formula(s)
+- When N keys/conditions are present:
+
+ "cond(key1) AND cond(key2) AND ... AND cond(keyN)",
+
+ somehow avoid considering (2^n - n) possible options.
+
+- Avoid producing (or even considering) apparently suboptimal plans:
+ = Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
+ a subset of columns covered by all other indexes.
+ = (TODO any other rules?)
+
DESCRIPTION:
At the moment index_merge supports intersection only for rowid-ordered streams.
This translates into a limitation that index_merge/intersect can only be
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
For example, assuming that key1 has 2 parts and key2 has 1 part.
The current optimization works with:
WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
WHERE key1_part1=1 AND key2_part1=3
or
WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Execution
1.1 Temptable
1.1.1 Improvement
1.2 Produce/merge sorted streams
1.3 Extend Unique class to handle intersection
1.4 Strategies that do not seem to be useful
1.4.1 Remove matches after having produced an ordered stream
1.4.2 Sparse rowid bitmaps
2. Optimization
</contents>
1. Execution
============
The primary task is to find means to compute an intersection of N unordered
streams. Besides general memory/cpu cost of computation, we consider:
- whether the produced rowid stream is ordered. If it is, it can be piped
into index_merge/intersect (as opposed to sort-intersect)
- whether the strategy can take advantage of the fact that some input streams
are already rowid-ordered
- startup cost (cost of producing the first output record)
We see the following possible strategies:
1.1 Temptable
-------------
[ This is our strategy of choice at the moment]
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
rowid binary($rowid_size),
count n,
primary key(rowid);
);
Then use this algorithm:
i1= {index with the least E(#records)};
for each record R in range_scan(i1)
temp_table.insert(R.rowid, count=1);
for each index idx except i1
{
for each R record in scan(idx) // (INNER-LOOP)
{
if (temp_table has R)
temptable[R].count++;
}
}
// The following loop can do ordered or unordered scan
// if we want it to be ordered scan, we probably better arrange so that
// 'count' column is part of the index.
for each record R in temp_table
{
if (R.count == number_of_streams)
emit(R.rowid);
}
The algorithm has an option to emit an ordered rowid stream.
In the above form, the cost to produce the first record is high. It's easy to
adjust the algorithm to make it low - we'll need to just start scanning all
indexes at once, and finish as soon as we got a full match, i.e. the
temptable[R].count++
operation resulted in the counter being equal to the number of merged scans.
1.1.1 Improvement
~~~~~~~~~~~~~~~~~
When running INNER-LOOP, we could count how many times we've done the
"count++" operation. If it has been done #records-in-temptable times, that
means that all further records will not have matches and we can finish the
scan, i.e. break out of the INNER-LOOP.
1.2 Produce/merge sorted streams
--------------------------------
For each of the merged scan, use filesort-like action to end up with an
ordered stream of rowids. Then merge the ordered streams.
By filesort-like action we mean
- Run over index, collect rowids in a buffer.
- When the buffer is full, sort it and dump into a temporary file.
After the above we'll end up with a number of sorted buffers on disk. We can
use mergebuff() function (it is part of filesort's functions) to produce one
ordered sequence (i.e. array, which may be partially on disk) of rowids.
Merging of ordered streams with help of priority queue is already implemented
in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
child_quick->get_next()
call with a call to read rowid from an ordered sequence.
1.3 Extend Unique class to handle intersection
----------------------------------------------
There is no point to use Unique object as a device that accumulates rowids of
a single scan then produces them in sorted order. One could do the same faster
with accumulating an array of rowids and then sorting it.
It's possible to use Unique object to collect/merge data from all scans though.
The idea is as follows:
- Unique should store <rowid, n_scans> pairs
- Duplicates are pairs with the same rowid
- Unique should try to avoid creating duplicates:
- don't add a duplicate into the in-memory part, instead combine two elements
together by adding their n_scans elements.
- combine duplicates when it sees them in Unique.get() call
- The data we get from Unique.get() should be filtered, all records that have
n_scans != number_of_scans_being_merged should be discarded.
If we're lucky to have started and finished a scan on some index (denote it
as S) without flushing the Unique in the process, then:
- there is no point in adding any new records into the Unique because their
absence in the Unique means that they don't have match in S and hence will
not get into the result of intersection.
- we need to only update the counters to be able to tell if the elements that
are already in the Unique will have matches in all scans.
1.4 Strategies that do not seem to be useful
--------------------------------------------
keeping them here so we don't consider them over and over
1.4.1 Remove matches after having produced an ordered stream
------------------------------------------------------------
We can dump everything into a rowid stream and get it sorted. Then we read it,
and if we see a rowid repeated $n_merged_scans times, it belongs to the
intersection (pass to output), otherwise it doesn't (skip).
This doesn't have any advantages over the produce/merge sorted streams
approach.
1.4.2 Sparse rowid bitmaps
--------------------------
Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
bitmaps assume there will always be enough memory to accommodate them.
PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
disk pages, not ids of individual records. It's hard for us to do something
similar because our rowids are opaque entities whose meaning depends on the
storage engines.
This seems to require too much change to be worth it.
2. Optimization
===============
SEL_TREE objects already represent intersections. The problems with
optimizations are:
- Cost formula(s)
- When N keys/conditions are present:
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
somehow avoid considering (2^n - n) possible options.
- Avoid producing (or even considering) apparently suboptimal plans:
= Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
a subset of columns covered by all other indexes.
= (TODO any other rules?)
- Correlation across selectivities. If there is a condition
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
can we consider satisfaction of AND-parts to be independent?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Progress (by Guest): index_merge: non-ROR intersection (21)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: non-ROR intersection
CREATION DATE..: Thu, 21 May 2009, 21:32
SUPERVISOR.....: Knielsen
IMPLEMENTOR....:
COPIES TO......: Psergey
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 21 (http://askmonty.org/worklog/?tid=21)
VERSION........: Server-9.x
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 25
ESTIMATE.......: 175 (hours remain)
ORIG. ESTIMATE.: 175
PROGRESS NOTES:
-=-=(Guest - Thu, 24 Jun 2010, 05:44)=-=-
I spent 25 hours in the month June 2010 to perform the following work for this task.
1. Compared tree possible algorithms to implement the operation of index intersection mentioned in
HLS by their labor/time consumption. Chose the algorithm that uses a modified Unique class (1.3) as
the most cheap requiring the least amount of efforts/time for its development.
2. Developed a design for a modification of the Unique class to support the operation of index
intersection.
3. Modified the merge_buffers procedure used by the Unique class to make it possible to use it not
only for the the operation of union, but for the operation of intersect as well.
Worked 25 hours and estimate 175 hours remain (original estimate increased by 200 hours).
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 21
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 21
-=-=(Guest - Wed, 03 Jun 2009, 01:17)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.30002 2009-06-03 01:17:32.000000000 +0300
+++ /tmp/wklog.21.new.30002 2009-06-03 01:17:32.000000000 +0300
@@ -7,13 +7,13 @@
The current optimization works with:
-WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
-WHERE key1_part1=1 OR key2_part1=3
+WHERE key1_part1=1 AND key2_part1=3
or
-WHERE key_part1<10 or key2_part1<100
+WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:06)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29694 2009-06-03 01:06:50.000000000 +0300
+++ /tmp/wklog.21.new.29694 2009-06-03 01:06:50.000000000 +0300
@@ -12,6 +12,8 @@
but not with:
WHERE key1_part1=1 OR key2_part1=3
+or
+WHERE key_part1<10 or key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:05)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29638 2009-06-03 01:05:01.000000000 +0300
+++ /tmp/wklog.21.new.29638 2009-06-03 01:05:01.000000000 +0300
@@ -3,5 +3,15 @@
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
+For example, assuming that key1 has 2 parts and key2 has 1 part.
+
+The current optimization works with:
+
+WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+
+but not with:
+
+WHERE key1_part1=1 OR key2_part1=3
+
This WL entry is to lift this limitation by developing algorithms that do
-intersection on non-ROR scans.
+intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Guest - Tue, 26 May 2009, 14:04)=-=-
High-Level Specification modified.
--- /tmp/wklog.21.old.1802 2009-05-26 14:04:57.000000000 +0300
+++ /tmp/wklog.21.new.1802 2009-05-26 14:04:57.000000000 +0300
@@ -1,4 +1,3 @@
-
<contents>
1. Execution
1.1 Temptable
@@ -30,6 +29,8 @@
1.1 Temptable
-------------
+[ This is our strategy of choice at the moment]
+
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
@@ -168,3 +169,8 @@
a subset of columns covered by all other indexes.
= (TODO any other rules?)
+- Correlation across selectivities. If there is a condition
+
+ "cond(key1) AND cond(key2) AND ... AND cond(keyN)",
+
+ can we consider satisfaction of AND-parts to be independent?
-=-=(Psergey - Thu, 21 May 2009, 21:33)=-=-
High-Level Specification modified.
--- /tmp/wklog.21.old.25705 2009-05-21 21:33:02.000000000 +0300
+++ /tmp/wklog.21.new.25705 2009-05-21 21:33:02.000000000 +0300
@@ -1 +1,170 @@
+<contents>
+1. Execution
+1.1 Temptable
+1.1.1 Improvement
+1.2 Produce/merge sorted streams
+1.3 Extend Unique class to handle intersection
+1.4 Strategies that do not seem to be useful
+1.4.1 Remove matches after having produced an ordered stream
+1.4.2 Sparse rowid bitmaps
+2. Optimization
+
+</contents>
+
+1. Execution
+============
+
+The primary task is to find means to compute an intersection of N unordered
+streams. Besides general memory/cpu cost of computation, we consider:
+
+- whether the produced rowid stream is ordered. If it is, it can be piped
+ into index_merge/intersect (as opposed to sort-intersect)
+
+- whether the strategy can take advantage of the fact that some input streams
+ are already rowid-ordered
+
+- startup cost (cost of producing the first output record)
+
+We see the following possible strategies:
+
+1.1 Temptable
+-------------
+Use a temporary heap-grow-out-to-myisam table with a primary key:
+
+create table temp_table (
+ rowid binary($rowid_size),
+ count n,
+ primary key(rowid);
+);
+
+Then use this algorithm:
+
+ i1= {index with the least E(#records)};
+
+ for each record R in range_scan(i1)
+ temp_table.insert(R.rowid, count=1);
+
+ for each index idx except i1
+ {
+ for each R record in scan(idx) // (INNER-LOOP)
+ {
+ if (temp_table has R)
+ temptable[R].count++;
+ }
+ }
+
+ // The following loop can do ordered or unordered scan
+ // if we want it to be ordered scan, we probably better arrange so that
+ // 'count' column is part of the index.
+ for each record R in temp_table
+ {
+ if (R.count == number_of_streams)
+ emit(R.rowid);
+ }
+
+The algorithm has an option to emit an ordered rowid stream.
+
+In the above form, the cost to produce the first record is high. It's easy to
+adjust the algorithm to make it low - we'll need to just start scanning all
+indexes at once, and finish as soon as we got a full match, i.e. the
+
+ temptable[R].count++
+
+operation resulted in the counter being equal to the number of merged scans.
+
+1.1.1 Improvement
+~~~~~~~~~~~~~~~~~
+When running INNER-LOOP, we could count how many times we've done the
+"count++" operation. If it has been done #records-in-temptable times, that
+means that all further records will not have matches and we can finish the
+scan, i.e. break out of the INNER-LOOP.
+
+1.2 Produce/merge sorted streams
+--------------------------------
+For each of the merged scan, use filesort-like action to end up with an
+ordered stream of rowids. Then merge the ordered streams.
+
+By filesort-like action we mean
+ - Run over index, collect rowids in a buffer.
+ - When the buffer is full, sort it and dump into a temporary file.
+After the above we'll end up with a number of sorted buffers on disk. We can
+use mergebuff() function (it is part of filesort's functions) to produce one
+ordered sequence (i.e. array, which may be partially on disk) of rowids.
+
+Merging of ordered streams with help of priority queue is already implemented
+in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
+
+ child_quick->get_next()
+
+call with a call to read rowid from an ordered sequence.
+
+1.3 Extend Unique class to handle intersection
+----------------------------------------------
+There is no point to use Unique object as a device that accumulates rowids of
+a single scan then produces them in sorted order. One could do the same faster
+with accumulating an array of rowids and then sorting it.
+
+It's possible to use Unique object to collect/merge data from all scans though.
+The idea is as follows:
+
+- Unique should store <rowid, n_scans> pairs
+- Duplicates are pairs with the same rowid
+- Unique should try to avoid creating duplicates:
+ - don't add a duplicate into the in-memory part, instead combine two elements
+ together by adding their n_scans elements.
+ - combine duplicates when it sees them in Unique.get() call
+- The data we get from Unique.get() should be filtered, all records that have
+ n_scans != number_of_scans_being_merged should be discarded.
+
+If we're lucky to have started and finished a scan on some index (denote it
+as S) without flushing the Unique in the process, then:
+- there is no point in adding any new records into the Unique because their
+ absence in the Unique means that they don't have match in S and hence will
+ not get into the result of intersection.
+- we need to only update the counters to be able to tell if the elements that
+ are already in the Unique will have matches in all scans.
+
+1.4 Strategies that do not seem to be useful
+--------------------------------------------
+
+keeping them here so we don't consider them over and over
+
+1.4.1 Remove matches after having produced an ordered stream
+------------------------------------------------------------
+We can dump everything into a rowid stream and get it sorted. Then we read it,
+and if we see a rowid repeated $n_merged_scans times, it belongs to the
+intersection (pass to output), otherwise it doesn't (skip).
+This doesn't have any advantages over the produce/merge sorted streams
+approach.
+
+1.4.2 Sparse rowid bitmaps
+--------------------------
+Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
+bitmaps assume there will always be enough memory to accommodate them.
+
+PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
+disk pages, not ids of individual records. It's hard for us to do something
+similar because our rowids are opaque entities whose meaning depends on the
+storage engines.
+
+This seems to require too much change to be worth it.
+
+2. Optimization
+===============
+
+SEL_TREE objects already represent intersections. The problems with
+optimizations are:
+
+- Cost formula(s)
+- When N keys/conditions are present:
+
+ "cond(key1) AND cond(key2) AND ... AND cond(keyN)",
+
+ somehow avoid considering (2^n - n) possible options.
+
+- Avoid producing (or even considering) apparently suboptimal plans:
+ = Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
+ a subset of columns covered by all other indexes.
+ = (TODO any other rules?)
+
DESCRIPTION:
At the moment index_merge supports intersection only for rowid-ordered streams.
This translates into a limitation that index_merge/intersect can only be
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
For example, assuming that key1 has 2 parts and key2 has 1 part.
The current optimization works with:
WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
WHERE key1_part1=1 AND key2_part1=3
or
WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Execution
1.1 Temptable
1.1.1 Improvement
1.2 Produce/merge sorted streams
1.3 Extend Unique class to handle intersection
1.4 Strategies that do not seem to be useful
1.4.1 Remove matches after having produced an ordered stream
1.4.2 Sparse rowid bitmaps
2. Optimization
</contents>
1. Execution
============
The primary task is to find means to compute an intersection of N unordered
streams. Besides general memory/cpu cost of computation, we consider:
- whether the produced rowid stream is ordered. If it is, it can be piped
into index_merge/intersect (as opposed to sort-intersect)
- whether the strategy can take advantage of the fact that some input streams
are already rowid-ordered
- startup cost (cost of producing the first output record)
We see the following possible strategies:
1.1 Temptable
-------------
[ This is our strategy of choice at the moment]
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
rowid binary($rowid_size),
count n,
primary key(rowid);
);
Then use this algorithm:
i1= {index with the least E(#records)};
for each record R in range_scan(i1)
temp_table.insert(R.rowid, count=1);
for each index idx except i1
{
for each R record in scan(idx) // (INNER-LOOP)
{
if (temp_table has R)
temptable[R].count++;
}
}
// The following loop can do ordered or unordered scan
// if we want it to be ordered scan, we probably better arrange so that
// 'count' column is part of the index.
for each record R in temp_table
{
if (R.count == number_of_streams)
emit(R.rowid);
}
The algorithm has an option to emit an ordered rowid stream.
In the above form, the cost to produce the first record is high. It's easy to
adjust the algorithm to make it low - we'll need to just start scanning all
indexes at once, and finish as soon as we got a full match, i.e. the
temptable[R].count++
operation resulted in the counter being equal to the number of merged scans.
1.1.1 Improvement
~~~~~~~~~~~~~~~~~
When running INNER-LOOP, we could count how many times we've done the
"count++" operation. If it has been done #records-in-temptable times, that
means that all further records will not have matches and we can finish the
scan, i.e. break out of the INNER-LOOP.
1.2 Produce/merge sorted streams
--------------------------------
For each of the merged scan, use filesort-like action to end up with an
ordered stream of rowids. Then merge the ordered streams.
By filesort-like action we mean
- Run over index, collect rowids in a buffer.
- When the buffer is full, sort it and dump into a temporary file.
After the above we'll end up with a number of sorted buffers on disk. We can
use mergebuff() function (it is part of filesort's functions) to produce one
ordered sequence (i.e. array, which may be partially on disk) of rowids.
Merging of ordered streams with help of priority queue is already implemented
in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
child_quick->get_next()
call with a call to read rowid from an ordered sequence.
1.3 Extend Unique class to handle intersection
----------------------------------------------
There is no point to use Unique object as a device that accumulates rowids of
a single scan then produces them in sorted order. One could do the same faster
with accumulating an array of rowids and then sorting it.
It's possible to use Unique object to collect/merge data from all scans though.
The idea is as follows:
- Unique should store <rowid, n_scans> pairs
- Duplicates are pairs with the same rowid
- Unique should try to avoid creating duplicates:
- don't add a duplicate into the in-memory part, instead combine two elements
together by adding their n_scans elements.
- combine duplicates when it sees them in Unique.get() call
- The data we get from Unique.get() should be filtered, all records that have
n_scans != number_of_scans_being_merged should be discarded.
If we're lucky to have started and finished a scan on some index (denote it
as S) without flushing the Unique in the process, then:
- there is no point in adding any new records into the Unique because their
absence in the Unique means that they don't have match in S and hence will
not get into the result of intersection.
- we need to only update the counters to be able to tell if the elements that
are already in the Unique will have matches in all scans.
1.4 Strategies that do not seem to be useful
--------------------------------------------
keeping them here so we don't consider them over and over
1.4.1 Remove matches after having produced an ordered stream
------------------------------------------------------------
We can dump everything into a rowid stream and get it sorted. Then we read it,
and if we see a rowid repeated $n_merged_scans times, it belongs to the
intersection (pass to output), otherwise it doesn't (skip).
This doesn't have any advantages over the produce/merge sorted streams
approach.
1.4.2 Sparse rowid bitmaps
--------------------------
Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
bitmaps assume there will always be enough memory to accommodate them.
PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
disk pages, not ids of individual records. It's hard for us to do something
similar because our rowids are opaque entities whose meaning depends on the
storage engines.
This seems to require too much change to be worth it.
2. Optimization
===============
SEL_TREE objects already represent intersections. The problems with
optimizations are:
- Cost formula(s)
- When N keys/conditions are present:
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
somehow avoid considering (2^n - n) possible options.
- Avoid producing (or even considering) apparently suboptimal plans:
= Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
a subset of columns covered by all other indexes.
= (TODO any other rules?)
- Correlation across selectivities. If there is a condition
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
can we consider satisfaction of AND-parts to be independent?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Progress (by Guest): index_merge: non-ROR intersection (21)
by worklog-noreply@askmonty.org 24 Jun '10
by worklog-noreply@askmonty.org 24 Jun '10
24 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: index_merge: non-ROR intersection
CREATION DATE..: Thu, 21 May 2009, 21:32
SUPERVISOR.....: Knielsen
IMPLEMENTOR....:
COPIES TO......: Psergey
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 21 (http://askmonty.org/worklog/?tid=21)
VERSION........: Server-9.x
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 25
ESTIMATE.......: 175 (hours remain)
ORIG. ESTIMATE.: 175
PROGRESS NOTES:
-=-=(Guest - Thu, 24 Jun 2010, 05:44)=-=-
I spent 25 hours in the month June 2010 to perform the following work for this task.
1. Compared tree possible algorithms to implement the operation of index intersection mentioned in
HLS by their labor/time consumption. Chose the algorithm that uses a modified Unique class (1.3) as
the most cheap requiring the least amount of efforts/time for its development.
2. Developed a design for a modification of the Unique class to support the operation of index
intersection.
3. Modified the merge_buffers procedure used by the Unique class to make it possible to use it not
only for the the operation of union, but for the operation of intersect as well.
Worked 25 hours and estimate 175 hours remain (original estimate increased by 200 hours).
-=-=(Guest - Mon, 20 Jul 2009, 17:13)=-=-
Dependency deleted: 30 no longer depends on 21
-=-=(Psergey - Wed, 03 Jun 2009, 12:09)=-=-
Dependency created: 30 now depends on 21
-=-=(Guest - Wed, 03 Jun 2009, 01:17)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.30002 2009-06-03 01:17:32.000000000 +0300
+++ /tmp/wklog.21.new.30002 2009-06-03 01:17:32.000000000 +0300
@@ -7,13 +7,13 @@
The current optimization works with:
-WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
-WHERE key1_part1=1 OR key2_part1=3
+WHERE key1_part1=1 AND key2_part1=3
or
-WHERE key_part1<10 or key2_part1<100
+WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:06)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29694 2009-06-03 01:06:50.000000000 +0300
+++ /tmp/wklog.21.new.29694 2009-06-03 01:06:50.000000000 +0300
@@ -12,6 +12,8 @@
but not with:
WHERE key1_part1=1 OR key2_part1=3
+or
+WHERE key_part1<10 or key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Monty - Wed, 03 Jun 2009, 01:05)=-=-
High Level Description modified.
--- /tmp/wklog.21.old.29638 2009-06-03 01:05:01.000000000 +0300
+++ /tmp/wklog.21.new.29638 2009-06-03 01:05:01.000000000 +0300
@@ -3,5 +3,15 @@
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
+For example, assuming that key1 has 2 parts and key2 has 1 part.
+
+The current optimization works with:
+
+WHERE key1_part1=1 AND key1_part2=2 OR key2_part1=3
+
+but not with:
+
+WHERE key1_part1=1 OR key2_part1=3
+
This WL entry is to lift this limitation by developing algorithms that do
-intersection on non-ROR scans.
+intersection on non-ROR (rowid ordered retrieval) scans.
-=-=(Guest - Tue, 26 May 2009, 14:04)=-=-
High-Level Specification modified.
--- /tmp/wklog.21.old.1802 2009-05-26 14:04:57.000000000 +0300
+++ /tmp/wklog.21.new.1802 2009-05-26 14:04:57.000000000 +0300
@@ -1,4 +1,3 @@
-
<contents>
1. Execution
1.1 Temptable
@@ -30,6 +29,8 @@
1.1 Temptable
-------------
+[ This is our strategy of choice at the moment]
+
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
@@ -168,3 +169,8 @@
a subset of columns covered by all other indexes.
= (TODO any other rules?)
+- Correlation across selectivities. If there is a condition
+
+ "cond(key1) AND cond(key2) AND ... AND cond(keyN)",
+
+ can we consider satisfaction of AND-parts to be independent?
-=-=(Psergey - Thu, 21 May 2009, 21:33)=-=-
High-Level Specification modified.
--- /tmp/wklog.21.old.25705 2009-05-21 21:33:02.000000000 +0300
+++ /tmp/wklog.21.new.25705 2009-05-21 21:33:02.000000000 +0300
@@ -1 +1,170 @@
+<contents>
+1. Execution
+1.1 Temptable
+1.1.1 Improvement
+1.2 Produce/merge sorted streams
+1.3 Extend Unique class to handle intersection
+1.4 Strategies that do not seem to be useful
+1.4.1 Remove matches after having produced an ordered stream
+1.4.2 Sparse rowid bitmaps
+2. Optimization
+
+</contents>
+
+1. Execution
+============
+
+The primary task is to find means to compute an intersection of N unordered
+streams. Besides general memory/cpu cost of computation, we consider:
+
+- whether the produced rowid stream is ordered. If it is, it can be piped
+ into index_merge/intersect (as opposed to sort-intersect)
+
+- whether the strategy can take advantage of the fact that some input streams
+ are already rowid-ordered
+
+- startup cost (cost of producing the first output record)
+
+We see the following possible strategies:
+
+1.1 Temptable
+-------------
+Use a temporary heap-grow-out-to-myisam table with a primary key:
+
+create table temp_table (
+ rowid binary($rowid_size),
+ count n,
+ primary key(rowid);
+);
+
+Then use this algorithm:
+
+ i1= {index with the least E(#records)};
+
+ for each record R in range_scan(i1)
+ temp_table.insert(R.rowid, count=1);
+
+ for each index idx except i1
+ {
+ for each R record in scan(idx) // (INNER-LOOP)
+ {
+ if (temp_table has R)
+ temptable[R].count++;
+ }
+ }
+
+ // The following loop can do ordered or unordered scan
+ // if we want it to be ordered scan, we probably better arrange so that
+ // 'count' column is part of the index.
+ for each record R in temp_table
+ {
+ if (R.count == number_of_streams)
+ emit(R.rowid);
+ }
+
+The algorithm has an option to emit an ordered rowid stream.
+
+In the above form, the cost to produce the first record is high. It's easy to
+adjust the algorithm to make it low - we'll need to just start scanning all
+indexes at once, and finish as soon as we got a full match, i.e. the
+
+ temptable[R].count++
+
+operation resulted in the counter being equal to the number of merged scans.
+
+1.1.1 Improvement
+~~~~~~~~~~~~~~~~~
+When running INNER-LOOP, we could count how many times we've done the
+"count++" operation. If it has been done #records-in-temptable times, that
+means that all further records will not have matches and we can finish the
+scan, i.e. break out of the INNER-LOOP.
+
+1.2 Produce/merge sorted streams
+--------------------------------
+For each of the merged scan, use filesort-like action to end up with an
+ordered stream of rowids. Then merge the ordered streams.
+
+By filesort-like action we mean
+ - Run over index, collect rowids in a buffer.
+ - When the buffer is full, sort it and dump into a temporary file.
+After the above we'll end up with a number of sorted buffers on disk. We can
+use mergebuff() function (it is part of filesort's functions) to produce one
+ordered sequence (i.e. array, which may be partially on disk) of rowids.
+
+Merging of ordered streams with help of priority queue is already implemented
+in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
+
+ child_quick->get_next()
+
+call with a call to read rowid from an ordered sequence.
+
+1.3 Extend Unique class to handle intersection
+----------------------------------------------
+There is no point to use Unique object as a device that accumulates rowids of
+a single scan then produces them in sorted order. One could do the same faster
+with accumulating an array of rowids and then sorting it.
+
+It's possible to use Unique object to collect/merge data from all scans though.
+The idea is as follows:
+
+- Unique should store <rowid, n_scans> pairs
+- Duplicates are pairs with the same rowid
+- Unique should try to avoid creating duplicates:
+ - don't add a duplicate into the in-memory part, instead combine two elements
+ together by adding their n_scans elements.
+ - combine duplicates when it sees them in Unique.get() call
+- The data we get from Unique.get() should be filtered, all records that have
+ n_scans != number_of_scans_being_merged should be discarded.
+
+If we're lucky to have started and finished a scan on some index (denote it
+as S) without flushing the Unique in the process, then:
+- there is no point in adding any new records into the Unique because their
+ absence in the Unique means that they don't have match in S and hence will
+ not get into the result of intersection.
+- we need to only update the counters to be able to tell if the elements that
+ are already in the Unique will have matches in all scans.
+
+1.4 Strategies that do not seem to be useful
+--------------------------------------------
+
+keeping them here so we don't consider them over and over
+
+1.4.1 Remove matches after having produced an ordered stream
+------------------------------------------------------------
+We can dump everything into a rowid stream and get it sorted. Then we read it,
+and if we see a rowid repeated $n_merged_scans times, it belongs to the
+intersection (pass to output), otherwise it doesn't (skip).
+This doesn't have any advantages over the produce/merge sorted streams
+approach.
+
+1.4.2 Sparse rowid bitmaps
+--------------------------
+Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
+bitmaps assume there will always be enough memory to accommodate them.
+
+PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
+disk pages, not ids of individual records. It's hard for us to do something
+similar because our rowids are opaque entities whose meaning depends on the
+storage engines.
+
+This seems to require too much change to be worth it.
+
+2. Optimization
+===============
+
+SEL_TREE objects already represent intersections. The problems with
+optimizations are:
+
+- Cost formula(s)
+- When N keys/conditions are present:
+
+ "cond(key1) AND cond(key2) AND ... AND cond(keyN)",
+
+ somehow avoid considering (2^n - n) possible options.
+
+- Avoid producing (or even considering) apparently suboptimal plans:
+ = Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
+ a subset of columns covered by all other indexes.
+ = (TODO any other rules?)
+
DESCRIPTION:
At the moment index_merge supports intersection only for rowid-ordered streams.
This translates into a limitation that index_merge/intersect can only be
constructed for equality conditions (t.keypart1=const1 AND t.keypart2=const2
AND ... ) and the equalities should cover all index components.
For example, assuming that key1 has 2 parts and key2 has 1 part.
The current optimization works with:
WHERE key1_part1=1 AND key1_part2=2 AND key2_part1=3
but not with:
WHERE key1_part1=1 AND key2_part1=3
or
WHERE key_part1<10 AND key2_part1<100
This WL entry is to lift this limitation by developing algorithms that do
intersection on non-ROR (rowid ordered retrieval) scans.
HIGH-LEVEL SPECIFICATION:
<contents>
1. Execution
1.1 Temptable
1.1.1 Improvement
1.2 Produce/merge sorted streams
1.3 Extend Unique class to handle intersection
1.4 Strategies that do not seem to be useful
1.4.1 Remove matches after having produced an ordered stream
1.4.2 Sparse rowid bitmaps
2. Optimization
</contents>
1. Execution
============
The primary task is to find means to compute an intersection of N unordered
streams. Besides general memory/cpu cost of computation, we consider:
- whether the produced rowid stream is ordered. If it is, it can be piped
into index_merge/intersect (as opposed to sort-intersect)
- whether the strategy can take advantage of the fact that some input streams
are already rowid-ordered
- startup cost (cost of producing the first output record)
We see the following possible strategies:
1.1 Temptable
-------------
[ This is our strategy of choice at the moment]
Use a temporary heap-grow-out-to-myisam table with a primary key:
create table temp_table (
rowid binary($rowid_size),
count n,
primary key(rowid);
);
Then use this algorithm:
i1= {index with the least E(#records)};
for each record R in range_scan(i1)
temp_table.insert(R.rowid, count=1);
for each index idx except i1
{
for each R record in scan(idx) // (INNER-LOOP)
{
if (temp_table has R)
temptable[R].count++;
}
}
// The following loop can do ordered or unordered scan
// if we want it to be ordered scan, we probably better arrange so that
// 'count' column is part of the index.
for each record R in temp_table
{
if (R.count == number_of_streams)
emit(R.rowid);
}
The algorithm has an option to emit an ordered rowid stream.
In the above form, the cost to produce the first record is high. It's easy to
adjust the algorithm to make it low - we'll need to just start scanning all
indexes at once, and finish as soon as we got a full match, i.e. the
temptable[R].count++
operation resulted in the counter being equal to the number of merged scans.
1.1.1 Improvement
~~~~~~~~~~~~~~~~~
When running INNER-LOOP, we could count how many times we've done the
"count++" operation. If it has been done #records-in-temptable times, that
means that all further records will not have matches and we can finish the
scan, i.e. break out of the INNER-LOOP.
1.2 Produce/merge sorted streams
--------------------------------
For each of the merged scan, use filesort-like action to end up with an
ordered stream of rowids. Then merge the ordered streams.
By filesort-like action we mean
- Run over index, collect rowids in a buffer.
- When the buffer is full, sort it and dump into a temporary file.
After the above we'll end up with a number of sorted buffers on disk. We can
use mergebuff() function (it is part of filesort's functions) to produce one
ordered sequence (i.e. array, which may be partially on disk) of rowids.
Merging of ordered streams with help of priority queue is already implemented
in QUICK_ROR_INTERSECT_SELECT. We'll need to substitute the
child_quick->get_next()
call with a call to read rowid from an ordered sequence.
1.3 Extend Unique class to handle intersection
----------------------------------------------
There is no point to use Unique object as a device that accumulates rowids of
a single scan then produces them in sorted order. One could do the same faster
with accumulating an array of rowids and then sorting it.
It's possible to use Unique object to collect/merge data from all scans though.
The idea is as follows:
- Unique should store <rowid, n_scans> pairs
- Duplicates are pairs with the same rowid
- Unique should try to avoid creating duplicates:
- don't add a duplicate into the in-memory part, instead combine two elements
together by adding their n_scans elements.
- combine duplicates when it sees them in Unique.get() call
- The data we get from Unique.get() should be filtered, all records that have
n_scans != number_of_scans_being_merged should be discarded.
If we're lucky to have started and finished a scan on some index (denote it
as S) without flushing the Unique in the process, then:
- there is no point in adding any new records into the Unique because their
absence in the Unique means that they don't have match in S and hence will
not get into the result of intersection.
- we need to only update the counters to be able to tell if the elements that
are already in the Unique will have matches in all scans.
1.4 Strategies that do not seem to be useful
--------------------------------------------
keeping them here so we don't consider them over and over
1.4.1 Remove matches after having produced an ordered stream
------------------------------------------------------------
We can dump everything into a rowid stream and get it sorted. Then we read it,
and if we see a rowid repeated $n_merged_scans times, it belongs to the
intersection (pass to output), otherwise it doesn't (skip).
This doesn't have any advantages over the produce/merge sorted streams
approach.
1.4.2 Sparse rowid bitmaps
--------------------------
Use Falcon-style rowid bitmaps. The problem with that is that Falcon's
bitmaps assume there will always be enough memory to accommodate them.
PostgreSQL makes bitmaps "loose" when they exceed certain size by remembering
disk pages, not ids of individual records. It's hard for us to do something
similar because our rowids are opaque entities whose meaning depends on the
storage engines.
This seems to require too much change to be worth it.
2. Optimization
===============
SEL_TREE objects already represent intersections. The problems with
optimizations are:
- Cost formula(s)
- When N keys/conditions are present:
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
somehow avoid considering (2^n - n) possible options.
- Avoid producing (or even considering) apparently suboptimal plans:
= Don't generate a merge of indexes (I_1, ... I_n) where columns of I_n are
a subset of columns covered by all other indexes.
= (TODO any other rules?)
- Correlation across selectivities. If there is a condition
"cond(key1) AND cond(key2) AND ... AND cond(keyN)",
can we consider satisfaction of AND-parts to be independent?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
Hi!
I have problem with automatic commits sending so sends the diff here
(sorry, will fix it tomorrow).
I also have thought about renaming sql/sql_expression_cache.* to
sql/item_expression_cache and moving the item also there but I am not
sure if it is better.
Also I am not sure that Item_cache_wrapper is the best name but
Item_expression_cache_wrapper IMHO is too long.
I re-made 5.3-mwl-66 so it need re-branching (not pulling) if you need
to look at it.
1
0
[Maria-developers] Please review: MWL#121: DS-MRR support for clustered primary keys
by Sergey Petrunya 22 Jun '10
by Sergey Petrunya 22 Jun '10
22 Jun '10
Hello Igor,
Please find below the combined patch for MWL#121. It is ready for review.
diff -urN --exclude='.*' maria-5.3-dsmrr-for-cpk-clean/mysql-test/r/innodb_mrr_cpk.result maria-5.3-dsmrr-for-cpk-noc/mysql-test/r/innodb_mrr_cpk.result
--- maria-5.3-dsmrr-for-cpk-clean/mysql-test/r/innodb_mrr_cpk.result 1970-01-01 03:00:00.000000000 +0300
+++ maria-5.3-dsmrr-for-cpk-noc/mysql-test/r/innodb_mrr_cpk.result 2010-06-22 23:28:02.000000000 +0400
@@ -0,0 +1,148 @@
+drop table if exists t0,t1,t2,t3;
+set @save_join_cache_level=@@join_cache_level;
+set join_cache_level=6;
+set @save_storage_engine=@@storage_engine;
+set storage_engine=innodb;
+create table t0(a int);
+insert into t0 values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
+create table t1(a char(8), b char(8), filler char(100), primary key(a));
+show create table t1;
+Table Create Table
+t1 CREATE TABLE `t1` (
+ `a` char(8) NOT NULL DEFAULT '',
+ `b` char(8) DEFAULT NULL,
+ `filler` char(100) DEFAULT NULL,
+ PRIMARY KEY (`a`)
+) ENGINE=InnoDB DEFAULT CHARSET=latin1
+insert into t1 select
+concat('a-', 1000 + A.a + B.a*10 + C.a*100, '=A'),
+concat('b-', 1000 + A.a + B.a*10 + C.a*100, '=B'),
+'filler'
+from t0 A, t0 B, t0 C;
+create table t2 (a char(8));
+insert into t2 values ('a-1010=A'), ('a-1030=A'), ('a-1020=A');
+This should use join buffer:
+explain select * from t1, t2 where t1.a=t2.a;
+id select_type table type possible_keys key key_len ref rows Extra
+1 SIMPLE t2 ALL NULL NULL NULL NULL 3
+1 SIMPLE t1 eq_ref PRIMARY PRIMARY 8 test.t2.a 1 Using join buffer
+This output must be sorted by value of t1.a:
+select * from t1, t2 where t1.a=t2.a;
+a b filler a
+a-1010=A b-1010=B filler a-1010=A
+a-1020=A b-1020=B filler a-1020=A
+a-1030=A b-1030=B filler a-1030=A
+drop table t1, t2;
+create table t1(
+a char(8) character set utf8, b int, filler char(100),
+primary key(a,b)
+);
+insert into t1 select
+concat('a-', 1000 + A.a + B.a*10 + C.a*100, '=A'),
+1000 + A.a + B.a*10 + C.a*100,
+'filler'
+from t0 A, t0 B, t0 C;
+create table t2 (a char(8) character set utf8, b int);
+insert into t2 values ('a-1010=A', 1010), ('a-1030=A', 1030), ('a-1020=A', 1020);
+explain select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+id select_type table type possible_keys key key_len ref rows Extra
+1 SIMPLE t2 ALL NULL NULL NULL NULL 3
+1 SIMPLE t1 eq_ref PRIMARY PRIMARY 28 test.t2.a,test.t2.b 1 Using join buffer
+select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+a b filler a b
+a-1010=A 1010 filler a-1010=A 1010
+a-1020=A 1020 filler a-1020=A 1020
+a-1030=A 1030 filler a-1030=A 1030
+insert into t2 values ('a-1030=A', 1030), ('a-1020=A', 1020);
+explain select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+id select_type table type possible_keys key key_len ref rows Extra
+1 SIMPLE t2 ALL NULL NULL NULL NULL 5
+1 SIMPLE t1 eq_ref PRIMARY PRIMARY 28 test.t2.a,test.t2.b 1 Using join buffer
+select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+a b filler a b
+a-1010=A 1010 filler a-1010=A 1010
+a-1020=A 1020 filler a-1020=A 1020
+a-1020=A 1020 filler a-1020=A 1020
+a-1030=A 1030 filler a-1030=A 1030
+a-1030=A 1030 filler a-1030=A 1030
+drop table t1, t2;
+create table t1(
+a varchar(8) character set utf8, b int, filler char(100),
+primary key(a,b)
+);
+insert into t1 select
+concat('a-', 1000 + A.a + B.a*10 + C.a*100, '=A'),
+1000 + A.a + B.a*10 + C.a*100,
+'filler'
+from t0 A, t0 B, t0 C;
+create table t2 (a char(8) character set utf8, b int);
+insert into t2 values ('a-1010=A', 1010), ('a-1030=A', 1030), ('a-1020=A', 1020);
+explain select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+id select_type table type possible_keys key key_len ref rows Extra
+1 SIMPLE t2 ALL NULL NULL NULL NULL 3
+1 SIMPLE t1 eq_ref PRIMARY PRIMARY 30 test.t2.a,test.t2.b 1 Using index condition(BKA); Using join buffer
+select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+a b filler a b
+a-1010=A 1010 filler a-1010=A 1010
+a-1020=A 1020 filler a-1020=A 1020
+a-1030=A 1030 filler a-1030=A 1030
+explain select * from t1, t2 where t1.a=t2.a;
+id select_type table type possible_keys key key_len ref rows Extra
+1 SIMPLE t2 ALL NULL NULL NULL NULL 3
+1 SIMPLE t1 ref PRIMARY PRIMARY 26 test.t2.a 1 Using index condition(BKA); Using join buffer
+select * from t1, t2 where t1.a=t2.a;
+a b filler a b
+a-1010=A 1010 filler a-1010=A 1010
+a-1020=A 1020 filler a-1020=A 1020
+a-1030=A 1030 filler a-1030=A 1030
+drop table t1, t2;
+create table t1 (a int, b int, c int, filler char(100), primary key(a,b,c));
+insert into t1 select A.a, B.a, C.a, 'filler' from t0 A, t0 B, t0 C;
+insert into t1 values (11, 11, 11, 'filler');
+insert into t1 values (11, 11, 12, 'filler');
+insert into t1 values (11, 11, 13, 'filler');
+insert into t1 values (11, 22, 1234, 'filler');
+insert into t1 values (11, 33, 124, 'filler');
+insert into t1 values (11, 33, 125, 'filler');
+create table t2 (a int, b int);
+insert into t2 values (11,33), (11,22), (11,11);
+explain select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+id select_type table type possible_keys key key_len ref rows Extra
+1 SIMPLE t2 ALL NULL NULL NULL NULL 3
+1 SIMPLE t1 ref PRIMARY PRIMARY 8 test.t2.a,test.t2.b 1 Using join buffer
+select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+a b c filler a b
+11 11 11 filler 11 11
+11 11 12 filler 11 11
+11 11 13 filler 11 11
+11 22 1234 filler 11 22
+11 33 124 filler 11 33
+11 33 125 filler 11 33
+set join_cache_level=0;
+select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+a b c filler a b
+11 33 124 filler 11 33
+11 33 125 filler 11 33
+11 22 1234 filler 11 22
+11 11 11 filler 11 11
+11 11 12 filler 11 11
+11 11 13 filler 11 11
+set join_cache_level=6;
+explain select * from t1, t2 where t1.a=t2.a and t2.b + t1.b > 100;
+id select_type table type possible_keys key key_len ref rows Extra
+1 SIMPLE t2 ALL NULL NULL NULL NULL 3
+1 SIMPLE t1 ref PRIMARY PRIMARY 4 test.t2.a 1 Using index condition(BKA); Using join buffer
+select * from t1, t2 where t1.a=t2.a and t2.b + t1.b > 100;
+a b c filler a b
+set optimizer_switch='index_condition_pushdown=off';
+explain select * from t1, t2 where t1.a=t2.a and t2.b + t1.b > 100;
+id select_type table type possible_keys key key_len ref rows Extra
+1 SIMPLE t2 ALL NULL NULL NULL NULL 3
+1 SIMPLE t1 ref PRIMARY PRIMARY 4 test.t2.a 1 Using where; Using join buffer
+select * from t1, t2 where t1.a=t2.a and t2.b + t1.b > 100;
+a b c filler a b
+set optimizer_switch='index_condition_pushdown=on';
+drop table t1,t2;
+set @@join_cache_level= @save_join_cache_level;
+set storage_engine=@save_storage_engine;
+drop table t0;
diff -urN --exclude='.*' maria-5.3-dsmrr-for-cpk-clean/mysql-test/r/innodb_mrr_cpk.result.moved maria-5.3-dsmrr-for-cpk-noc/mysql-test/r/innodb_mrr_cpk.result.moved
--- maria-5.3-dsmrr-for-cpk-clean/mysql-test/r/innodb_mrr_cpk.result.moved 1970-01-01 03:00:00.000000000 +0300
+++ maria-5.3-dsmrr-for-cpk-noc/mysql-test/r/innodb_mrr_cpk.result.moved 2010-06-22 19:23:18.000000000 +0400
@@ -0,0 +1,122 @@
+drop table if exists t0,t1,t2,t3;
+set @save_join_cache_level=@@join_cache_level;
+set join_cache_level=6;
+set @save_storage_engine=@@storage_engine;
+set storage_engine=innodb;
+create table t0(a int);
+insert into t0 values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
+create table t1(a char(8), b char(8), filler char(100), primary key(a));
+show create table t1;
+Table Create Table
+t1 CREATE TABLE `t1` (
+ `a` char(8) NOT NULL DEFAULT '',
+ `b` char(8) DEFAULT NULL,
+ `filler` char(100) DEFAULT NULL,
+ PRIMARY KEY (`a`)
+) ENGINE=InnoDB DEFAULT CHARSET=latin1
+insert into t1 select
+concat('a-', 1000 + A.a + B.a*10 + C.a*100, '=A'),
+concat('b-', 1000 + A.a + B.a*10 + C.a*100, '=B'),
+'filler'
+from t0 A, t0 B, t0 C;
+create table t2 (a char(8));
+insert into t2 values ('a-1010=A'), ('a-1030=A'), ('a-1020=A');
+This should use join buffer:
+explain select * from t1, t2 where t1.a=t2.a;
+id select_type table type possible_keys key key_len ref rows Extra
+1 SIMPLE t2 ALL NULL NULL NULL NULL 3
+1 SIMPLE t1 eq_ref PRIMARY PRIMARY 8 test.t2.a 1 Using join buffer
+This output must be sorted by value of t1.a:
+select * from t1, t2 where t1.a=t2.a;
+a b filler a
+a-1010=A b-1010=B filler a-1010=A
+a-1020=A b-1020=B filler a-1020=A
+a-1030=A b-1030=B filler a-1030=A
+drop table t1, t2;
+create table t1(
+a char(8) character set utf8, b int, filler char(100),
+primary key(a,b)
+);
+insert into t1 select
+concat('a-', 1000 + A.a + B.a*10 + C.a*100, '=A'),
+1000 + A.a + B.a*10 + C.a*100,
+'filler'
+from t0 A, t0 B, t0 C;
+create table t2 (a char(8) character set utf8, b int);
+insert into t2 values ('a-1010=A', 1010), ('a-1030=A', 1030), ('a-1020=A', 1020);
+explain select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+id select_type table type possible_keys key key_len ref rows Extra
+1 SIMPLE t2 ALL NULL NULL NULL NULL 3
+1 SIMPLE t1 eq_ref PRIMARY PRIMARY 28 test.t2.a,test.t2.b 1 Using join buffer
+select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+a b filler a b
+a-1010=A 1010 filler a-1010=A 1010
+a-1020=A 1020 filler a-1020=A 1020
+a-1030=A 1030 filler a-1030=A 1030
+drop table t1, t2;
+create table t1(
+a varchar(8) character set utf8, b int, filler char(100),
+primary key(a,b)
+);
+insert into t1 select
+concat('a-', 1000 + A.a + B.a*10 + C.a*100, '=A'),
+1000 + A.a + B.a*10 + C.a*100,
+'filler'
+from t0 A, t0 B, t0 C;
+create table t2 (a char(8) character set utf8, b int);
+insert into t2 values ('a-1010=A', 1010), ('a-1030=A', 1030), ('a-1020=A', 1020);
+explain select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+id select_type table type possible_keys key key_len ref rows Extra
+1 SIMPLE t2 ALL NULL NULL NULL NULL 3
+1 SIMPLE t1 eq_ref PRIMARY PRIMARY 30 test.t2.a,test.t2.b 1 Using index condition(BKA); Using join buffer
+select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+a b filler a b
+a-1010=A 1010 filler a-1010=A 1010
+a-1020=A 1020 filler a-1020=A 1020
+a-1030=A 1030 filler a-1030=A 1030
+explain select * from t1, t2 where t1.a=t2.a;
+id select_type table type possible_keys key key_len ref rows Extra
+1 SIMPLE t2 ALL NULL NULL NULL NULL 3
+1 SIMPLE t1 ref PRIMARY PRIMARY 26 test.t2.a 1 Using index condition(BKA); Using join buffer
+select * from t1, t2 where t1.a=t2.a;
+a b filler a b
+a-1010=A 1010 filler a-1010=A 1010
+a-1020=A 1020 filler a-1020=A 1020
+a-1030=A 1030 filler a-1030=A 1030
+drop table t1, t2;
+create table t1 (a int, b int, c int, filler char(100), primary key(a,b,c));
+insert into t1 select A.a, B.a, C.a, 'filler' from t0 A, t0 B, t0 C;
+insert into t1 values (11, 11, 11, 'filler');
+insert into t1 values (11, 11, 12, 'filler');
+insert into t1 values (11, 11, 13, 'filler');
+insert into t1 values (11, 22, 1234, 'filler');
+insert into t1 values (11, 33, 124, 'filler');
+insert into t1 values (11, 33, 125, 'filler');
+create table t2 (a int, b int);
+insert into t2 values (11,33), (11,22), (11,11);
+explain select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+id select_type table type possible_keys key key_len ref rows Extra
+1 SIMPLE t2 ALL NULL NULL NULL NULL 3
+1 SIMPLE t1 ref PRIMARY PRIMARY 8 test.t2.a,test.t2.b 1 Using join buffer
+select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+a b c filler a b
+11 11 11 filler 11 11
+11 11 12 filler 11 11
+11 11 13 filler 11 11
+11 22 1234 filler 11 22
+11 33 124 filler 11 33
+11 33 125 filler 11 33
+set join_cache_level=0;
+select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+a b c filler a b
+11 33 124 filler 11 33
+11 33 125 filler 11 33
+11 22 1234 filler 11 22
+11 11 11 filler 11 11
+11 11 12 filler 11 11
+11 11 13 filler 11 11
+set join_cache_level=6;
+drop table t1,t2;
+set @@join_cache_level= @save_join_cache_level;
+set storage_engine=@save_storage_engine;
+drop table t0;
diff -urN --exclude='.*' maria-5.3-dsmrr-for-cpk-clean/mysql-test/t/innodb_mrr_cpk.test maria-5.3-dsmrr-for-cpk-noc/mysql-test/t/innodb_mrr_cpk.test
--- maria-5.3-dsmrr-for-cpk-clean/mysql-test/t/innodb_mrr_cpk.test 1970-01-01 03:00:00.000000000 +0300
+++ maria-5.3-dsmrr-for-cpk-noc/mysql-test/t/innodb_mrr_cpk.test 2010-06-22 23:28:02.000000000 +0400
@@ -0,0 +1,137 @@
+#
+# Tests for DS-MRR over clustered primary key. The only engine that supports
+# this is InnoDB/XtraDB.
+#
+# Basic idea about testing
+# - DS-MRR/CPK works only with BKA
+# - Should also test index condition pushdown
+# - Should also test whatever uses RANGE_SEQ_IF::skip_record() for filtering
+# - Also test access using prefix of primary key
+#
+# - Forget about cost model, BKA's multi_range_read_info() call passes 10 for
+# #rows, the call is there at all only for applicability check
+#
+-- source include/have_innodb.inc
+
+--disable_warnings
+drop table if exists t0,t1,t2,t3;
+--enable_warnings
+
+set @save_join_cache_level=@@join_cache_level;
+set join_cache_level=6;
+
+set @save_storage_engine=@@storage_engine;
+set storage_engine=innodb;
+
+create table t0(a int);
+insert into t0 values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
+create table t1(a char(8), b char(8), filler char(100), primary key(a));
+show create table t1;
+
+insert into t1 select
+ concat('a-', 1000 + A.a + B.a*10 + C.a*100, '=A'),
+ concat('b-', 1000 + A.a + B.a*10 + C.a*100, '=B'),
+ 'filler'
+from t0 A, t0 B, t0 C;
+
+create table t2 (a char(8));
+insert into t2 values ('a-1010=A'), ('a-1030=A'), ('a-1020=A');
+
+--echo This should use join buffer:
+explain select * from t1, t2 where t1.a=t2.a;
+
+--echo This output must be sorted by value of t1.a:
+select * from t1, t2 where t1.a=t2.a;
+drop table t1, t2;
+
+# Try multi-column indexes
+create table t1(
+ a char(8) character set utf8, b int, filler char(100),
+ primary key(a,b)
+);
+
+insert into t1 select
+ concat('a-', 1000 + A.a + B.a*10 + C.a*100, '=A'),
+ 1000 + A.a + B.a*10 + C.a*100,
+ 'filler'
+from t0 A, t0 B, t0 C;
+
+create table t2 (a char(8) character set utf8, b int);
+insert into t2 values ('a-1010=A', 1010), ('a-1030=A', 1030), ('a-1020=A', 1020);
+explain select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+
+# Try with dataset that causes identical lookup keys:
+insert into t2 values ('a-1030=A', 1030), ('a-1020=A', 1020);
+explain select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+
+drop table t1, t2;
+
+create table t1(
+ a varchar(8) character set utf8, b int, filler char(100),
+ primary key(a,b)
+);
+
+insert into t1 select
+ concat('a-', 1000 + A.a + B.a*10 + C.a*100, '=A'),
+ 1000 + A.a + B.a*10 + C.a*100,
+ 'filler'
+from t0 A, t0 B, t0 C;
+
+create table t2 (a char(8) character set utf8, b int);
+insert into t2 values ('a-1010=A', 1010), ('a-1030=A', 1030), ('a-1020=A', 1020);
+explain select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+
+#
+# Try scanning on a CPK prefix
+#
+explain select * from t1, t2 where t1.a=t2.a;
+select * from t1, t2 where t1.a=t2.a;
+drop table t1, t2;
+
+#
+# The above example is not very interesting, as CPK prefix has
+# only one match. Create a dataset where scan on CPK prefix
+# would produce multiple matches:
+#
+create table t1 (a int, b int, c int, filler char(100), primary key(a,b,c));
+insert into t1 select A.a, B.a, C.a, 'filler' from t0 A, t0 B, t0 C;
+
+insert into t1 values (11, 11, 11, 'filler');
+insert into t1 values (11, 11, 12, 'filler');
+insert into t1 values (11, 11, 13, 'filler');
+insert into t1 values (11, 22, 1234, 'filler');
+insert into t1 values (11, 33, 124, 'filler');
+insert into t1 values (11, 33, 125, 'filler');
+
+create table t2 (a int, b int);
+insert into t2 values (11,33), (11,22), (11,11);
+
+explain select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+
+# Check a real resultset for comaprison:
+set join_cache_level=0;
+select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+set join_cache_level=6;
+
+
+#
+# Check that Index Condition Pushdown (BKA) actually works:
+#
+explain select * from t1, t2 where t1.a=t2.a and t2.b + t1.b > 100;
+select * from t1, t2 where t1.a=t2.a and t2.b + t1.b > 100;
+
+set optimizer_switch='index_condition_pushdown=off';
+explain select * from t1, t2 where t1.a=t2.a and t2.b + t1.b > 100;
+select * from t1, t2 where t1.a=t2.a and t2.b + t1.b > 100;
+set optimizer_switch='index_condition_pushdown=on';
+
+drop table t1,t2;
+
+set @@join_cache_level= @save_join_cache_level;
+set storage_engine=@save_storage_engine;
+drop table t0;
+
diff -urN --exclude='.*' maria-5.3-dsmrr-for-cpk-clean/mysql-test/t/innodb_mrr_cpk.test.moved maria-5.3-dsmrr-for-cpk-noc/mysql-test/t/innodb_mrr_cpk.test.moved
--- maria-5.3-dsmrr-for-cpk-clean/mysql-test/t/innodb_mrr_cpk.test.moved 1970-01-01 03:00:00.000000000 +0300
+++ maria-5.3-dsmrr-for-cpk-noc/mysql-test/t/innodb_mrr_cpk.test.moved 2010-06-22 19:23:18.000000000 +0400
@@ -0,0 +1,128 @@
+#
+# Tests for DS-MRR over clustered primary key. The only engine that supports
+# this is InnoDB/XtraDB.
+#
+# Basic idea about testing
+# - DS-MRR/CPK works only with BKA
+# - Should also test index condition pushdown
+# - Should also test whatever uses RANGE_SEQ_IF::skip_record() for filtering
+# - Also test access using prefix of primary key
+#
+# - Forget about cost model, BKA's multi_range_read_info() call passes 10 for
+# #rows, the call is there at all only for applicability check
+#
+-- source include/have_innodb.inc
+
+--disable_warnings
+drop table if exists t0,t1,t2,t3;
+--enable_warnings
+
+set @save_join_cache_level=@@join_cache_level;
+set join_cache_level=6;
+
+set @save_storage_engine=@@storage_engine;
+set storage_engine=innodb;
+
+create table t0(a int);
+insert into t0 values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
+create table t1(a char(8), b char(8), filler char(100), primary key(a));
+show create table t1;
+
+insert into t1 select
+ concat('a-', 1000 + A.a + B.a*10 + C.a*100, '=A'),
+ concat('b-', 1000 + A.a + B.a*10 + C.a*100, '=B'),
+ 'filler'
+from t0 A, t0 B, t0 C;
+
+create table t2 (a char(8));
+insert into t2 values ('a-1010=A'), ('a-1030=A'), ('a-1020=A');
+
+--echo This should use join buffer:
+explain select * from t1, t2 where t1.a=t2.a;
+
+--echo This output must be sorted by value of t1.a:
+select * from t1, t2 where t1.a=t2.a;
+drop table t1, t2;
+
+# Try multi-column indexes
+create table t1(
+ a char(8) character set utf8, b int, filler char(100),
+ primary key(a,b)
+);
+
+insert into t1 select
+ concat('a-', 1000 + A.a + B.a*10 + C.a*100, '=A'),
+ 1000 + A.a + B.a*10 + C.a*100,
+ 'filler'
+from t0 A, t0 B, t0 C;
+
+create table t2 (a char(8) character set utf8, b int);
+insert into t2 values ('a-1010=A', 1010), ('a-1030=A', 1030), ('a-1020=A', 1020);
+explain select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+drop table t1, t2;
+
+create table t1(
+ a varchar(8) character set utf8, b int, filler char(100),
+ primary key(a,b)
+);
+
+insert into t1 select
+ concat('a-', 1000 + A.a + B.a*10 + C.a*100, '=A'),
+ 1000 + A.a + B.a*10 + C.a*100,
+ 'filler'
+from t0 A, t0 B, t0 C;
+
+create table t2 (a char(8) character set utf8, b int);
+insert into t2 values ('a-1010=A', 1010), ('a-1030=A', 1030), ('a-1020=A', 1020);
+explain select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+
+#
+# Try scanning on a CPK prefix
+#
+explain select * from t1, t2 where t1.a=t2.a;
+select * from t1, t2 where t1.a=t2.a;
+drop table t1, t2;
+
+#
+# The above example is not very interesting, as CPK prefix has
+# only one match. Create a dataset where scan on CPK prefix
+# would produce multiple matches:
+#
+create table t1 (a int, b int, c int, filler char(100), primary key(a,b,c));
+insert into t1 select A.a, B.a, C.a, 'filler' from t0 A, t0 B, t0 C;
+
+insert into t1 values (11, 11, 11, 'filler');
+insert into t1 values (11, 11, 12, 'filler');
+insert into t1 values (11, 11, 13, 'filler');
+insert into t1 values (11, 22, 1234, 'filler');
+insert into t1 values (11, 33, 124, 'filler');
+insert into t1 values (11, 33, 125, 'filler');
+
+create table t2 (a int, b int);
+insert into t2 values (11,33), (11,22), (11,11);
+
+explain select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+
+set join_cache_level=0;
+select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+set join_cache_level=6;
+
+drop table t1,t2;
+
+#
+# Check that Index Condition Pushdown (BKA) actually works:
+#
+
+# TODO
+
+#
+# Check that record-check-func is done:
+#
+
+set @@join_cache_level= @save_join_cache_level;
+set storage_engine=@save_storage_engine;
+drop table t0;
+
diff -urN --exclude='.*' maria-5.3-dsmrr-for-cpk-clean/r/innodb_mrr_cpk.result maria-5.3-dsmrr-for-cpk-noc/r/innodb_mrr_cpk.result
--- maria-5.3-dsmrr-for-cpk-clean/r/innodb_mrr_cpk.result 1970-01-01 03:00:00.000000000 +0300
+++ maria-5.3-dsmrr-for-cpk-noc/r/innodb_mrr_cpk.result 2010-06-22 19:23:14.000000000 +0400
@@ -0,0 +1,122 @@
+drop table if exists t0,t1,t2,t3;
+set @save_join_cache_level=@@join_cache_level;
+set join_cache_level=6;
+set @save_storage_engine=@@storage_engine;
+set storage_engine=innodb;
+create table t0(a int);
+insert into t0 values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
+create table t1(a char(8), b char(8), filler char(100), primary key(a));
+show create table t1;
+Table Create Table
+t1 CREATE TABLE `t1` (
+ `a` char(8) NOT NULL DEFAULT '',
+ `b` char(8) DEFAULT NULL,
+ `filler` char(100) DEFAULT NULL,
+ PRIMARY KEY (`a`)
+) ENGINE=InnoDB DEFAULT CHARSET=latin1
+insert into t1 select
+concat('a-', 1000 + A.a + B.a*10 + C.a*100, '=A'),
+concat('b-', 1000 + A.a + B.a*10 + C.a*100, '=B'),
+'filler'
+from t0 A, t0 B, t0 C;
+create table t2 (a char(8));
+insert into t2 values ('a-1010=A'), ('a-1030=A'), ('a-1020=A');
+This should use join buffer:
+explain select * from t1, t2 where t1.a=t2.a;
+id select_type table type possible_keys key key_len ref rows Extra
+1 SIMPLE t2 ALL NULL NULL NULL NULL 3
+1 SIMPLE t1 eq_ref PRIMARY PRIMARY 8 test.t2.a 1 Using join buffer
+This output must be sorted by value of t1.a:
+select * from t1, t2 where t1.a=t2.a;
+a b filler a
+a-1010=A b-1010=B filler a-1010=A
+a-1020=A b-1020=B filler a-1020=A
+a-1030=A b-1030=B filler a-1030=A
+drop table t1, t2;
+create table t1(
+a char(8) character set utf8, b int, filler char(100),
+primary key(a,b)
+);
+insert into t1 select
+concat('a-', 1000 + A.a + B.a*10 + C.a*100, '=A'),
+1000 + A.a + B.a*10 + C.a*100,
+'filler'
+from t0 A, t0 B, t0 C;
+create table t2 (a char(8) character set utf8, b int);
+insert into t2 values ('a-1010=A', 1010), ('a-1030=A', 1030), ('a-1020=A', 1020);
+explain select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+id select_type table type possible_keys key key_len ref rows Extra
+1 SIMPLE t2 ALL NULL NULL NULL NULL 3
+1 SIMPLE t1 eq_ref PRIMARY PRIMARY 28 test.t2.a,test.t2.b 1 Using join buffer
+select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+a b filler a b
+a-1010=A 1010 filler a-1010=A 1010
+a-1020=A 1020 filler a-1020=A 1020
+a-1030=A 1030 filler a-1030=A 1030
+drop table t1, t2;
+create table t1(
+a varchar(8) character set utf8, b int, filler char(100),
+primary key(a,b)
+);
+insert into t1 select
+concat('a-', 1000 + A.a + B.a*10 + C.a*100, '=A'),
+1000 + A.a + B.a*10 + C.a*100,
+'filler'
+from t0 A, t0 B, t0 C;
+create table t2 (a char(8) character set utf8, b int);
+insert into t2 values ('a-1010=A', 1010), ('a-1030=A', 1030), ('a-1020=A', 1020);
+explain select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+id select_type table type possible_keys key key_len ref rows Extra
+1 SIMPLE t2 ALL NULL NULL NULL NULL 3
+1 SIMPLE t1 eq_ref PRIMARY PRIMARY 30 test.t2.a,test.t2.b 1 Using index condition(BKA); Using join buffer
+select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+a b filler a b
+a-1010=A 1010 filler a-1010=A 1010
+a-1020=A 1020 filler a-1020=A 1020
+a-1030=A 1030 filler a-1030=A 1030
+explain select * from t1, t2 where t1.a=t2.a;
+id select_type table type possible_keys key key_len ref rows Extra
+1 SIMPLE t2 ALL NULL NULL NULL NULL 3
+1 SIMPLE t1 ref PRIMARY PRIMARY 26 test.t2.a 1 Using index condition(BKA); Using join buffer
+select * from t1, t2 where t1.a=t2.a;
+a b filler a b
+a-1010=A 1010 filler a-1010=A 1010
+a-1020=A 1020 filler a-1020=A 1020
+a-1030=A 1030 filler a-1030=A 1030
+drop table t1, t2;
+create table t1 (a int, b int, c int, filler char(100), primary key(a,b,c));
+insert into t1 select A.a, B.a, C.a, 'filler' from t0 A, t0 B, t0 C;
+insert into t1 values (11, 11, 11, 'filler');
+insert into t1 values (11, 11, 12, 'filler');
+insert into t1 values (11, 11, 13, 'filler');
+insert into t1 values (11, 22, 1234, 'filler');
+insert into t1 values (11, 33, 124, 'filler');
+insert into t1 values (11, 33, 125, 'filler');
+create table t2 (a int, b int);
+insert into t2 values (11,33), (11,22), (11,11);
+explain select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+id select_type table type possible_keys key key_len ref rows Extra
+1 SIMPLE t2 ALL NULL NULL NULL NULL 3
+1 SIMPLE t1 ref PRIMARY PRIMARY 8 test.t2.a,test.t2.b 1 Using join buffer
+select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+a b c filler a b
+11 11 11 filler 11 11
+11 11 12 filler 11 11
+11 11 13 filler 11 11
+11 22 1234 filler 11 22
+11 33 124 filler 11 33
+11 33 125 filler 11 33
+set join_cache_level=0;
+select * from t1, t2 where t1.a=t2.a and t1.b=t2.b;
+a b c filler a b
+11 33 124 filler 11 33
+11 33 125 filler 11 33
+11 22 1234 filler 11 22
+11 11 11 filler 11 11
+11 11 12 filler 11 11
+11 11 13 filler 11 11
+set join_cache_level=6;
+drop table t1,t2;
+set @@join_cache_level= @save_join_cache_level;
+set storage_engine=@save_storage_engine;
+drop table t0;
diff -urN --exclude='.*' maria-5.3-dsmrr-for-cpk-clean/sql/handler.h maria-5.3-dsmrr-for-cpk-noc/sql/handler.h
--- maria-5.3-dsmrr-for-cpk-clean/sql/handler.h 2010-06-22 19:10:46.000000000 +0400
+++ maria-5.3-dsmrr-for-cpk-noc/sql/handler.h 2010-06-22 23:28:40.000000000 +0400
@@ -1168,9 +1168,9 @@
COST_VECT *cost);
/*
- The below two are not used (and not handled) in this milestone of this WL
- entry because there seems to be no use for them at this stage of
- implementation.
+ Indicates that all scanned ranges will be singlepoint (aka equality) ranges.
+ The ranges may not use the full key but all of them will use the same number
+ of key parts.
*/
#define HA_MRR_SINGLE_POINT 1
#define HA_MRR_FIXED_KEY 2
@@ -1752,9 +1752,10 @@
uint n_ranges, uint *bufsz,
uint *flags, COST_VECT *cost);
virtual ha_rows multi_range_read_info(uint keyno, uint n_ranges, uint keys,
- uint *bufsz, uint *flags, COST_VECT *cost);
+ uint key_parts, uint *bufsz,
+ uint *flags, COST_VECT *cost);
virtual int multi_range_read_init(RANGE_SEQ_IF *seq, void *seq_init_param,
- uint n_ranges, uint mode,
+ uint n_ranges, uint mode,
HANDLER_BUFFER *buf);
virtual int multi_range_read_next(char **range_info);
virtual int read_range_first(const key_range *start_key,
diff -urN --exclude='.*' maria-5.3-dsmrr-for-cpk-clean/sql/multi_range_read.cc maria-5.3-dsmrr-for-cpk-noc/sql/multi_range_read.cc
--- maria-5.3-dsmrr-for-cpk-clean/sql/multi_range_read.cc 2010-06-22 19:10:46.000000000 +0400
+++ maria-5.3-dsmrr-for-cpk-noc/sql/multi_range_read.cc 2010-06-22 23:28:40.000000000 +0400
@@ -1,4 +1,5 @@
#include "mysql_priv.h"
+#include <my_bit.h>
#include "sql_select.h"
/****************************************************************************
@@ -136,10 +137,16 @@
*/
ha_rows handler::multi_range_read_info(uint keyno, uint n_ranges, uint n_rows,
- uint *bufsz, uint *flags, COST_VECT *cost)
+ uint key_parts, uint *bufsz,
+ uint *flags, COST_VECT *cost)
{
- *bufsz= 0; /* Default implementation doesn't need a buffer */
+ /*
+ Currently we expect this function to be called only in preparation of scan
+ with HA_MRR_SINGLE_POINT property.
+ */
+ DBUG_ASSERT(*flags | HA_MRR_SINGLE_POINT);
+ *bufsz= 0; /* Default implementation doesn't need a buffer */
*flags |= HA_MRR_USE_DEFAULT_IMPL;
cost->zero();
@@ -316,25 +323,39 @@
{
use_default_impl= TRUE;
const int retval=
- h->handler::multi_range_read_init(seq_funcs, seq_init_param,
- n_ranges, mode, buf);
+ h->handler::multi_range_read_init(seq_funcs, seq_init_param, n_ranges,
+ mode, buf);
DBUG_RETURN(retval);
}
- rowids_buf= buf->buffer;
+ mrr_buf= buf->buffer;
is_mrr_assoc= !test(mode & HA_MRR_NO_ASSOCIATION);
if (is_mrr_assoc)
status_var_increment(table->in_use->status_var.ha_multi_range_read_init_count);
- rowids_buf_end= buf->buffer_end;
+ mrr_buf_end= buf->buffer_end;
+
+ if ((doing_cpk_scan= check_cpk_scan(h->active_index, mode)))
+ {
+ /* It's a DS-MRR/CPK scan */
+ cpk_tuple_length= 0; /* dummy value telling it needs to be inited */
+ cpk_have_range= FALSE;
+ use_default_impl= FALSE;
+ h->mrr_iter= seq_funcs->init(seq_init_param, n_ranges, mode);
+ h->mrr_funcs= *seq_funcs;
+ dsmrr_fill_buffer_cpk();
+ if (dsmrr_eof)
+ buf->end_of_used_area= mrr_buf_last;
+ DBUG_RETURN(0); /* nothing could go wrong while filling the buffer */
+ }
+
+ /* In regular DS-MRR, buffer stores {rowid, range_id} pairs */
elem_size= h->ref_length + (int)is_mrr_assoc * sizeof(void*);
- rowids_buf_last= rowids_buf +
- ((rowids_buf_end - rowids_buf)/ elem_size)*
- elem_size;
- rowids_buf_end= rowids_buf_last;
+ mrr_buf_last= mrr_buf + ((mrr_buf_end - mrr_buf)/ elem_size)* elem_size;
+ mrr_buf_end= mrr_buf_last;
- /*
+ /*
There can be two cases:
- This is the first call since index_init(), h2==NULL
Need to setup h2 then.
@@ -406,8 +427,8 @@
goto error;
}
- if (h2->handler::multi_range_read_init(seq_funcs, seq_init_param, n_ranges,
- mode, buf) ||
+ if (h2->handler::multi_range_read_init(seq_funcs, seq_init_param, n_ranges,
+ mode, buf) ||
dsmrr_fill_buffer())
{
goto error;
@@ -417,7 +438,7 @@
adjust *buf to indicate that the remaining buffer space will not be used.
*/
if (dsmrr_eof)
- buf->end_of_used_area= rowids_buf_last;
+ buf->end_of_used_area= mrr_buf_last;
/*
h->inited == INDEX may occur when 'range checked for each record' is
@@ -473,6 +494,9 @@
rowid and return.
The function assumes that rowids buffer is empty when it is invoked.
+
+ dsmrr_eof is set to indicate whether we've exhausted the list of ranges we're
+ scanning.
@param h Table handler
@@ -487,8 +511,8 @@
int res;
DBUG_ENTER("DsMrr_impl::dsmrr_fill_buffer");
- rowids_buf_cur= rowids_buf;
- while ((rowids_buf_cur < rowids_buf_end) &&
+ mrr_buf_cur= mrr_buf;
+ while ((mrr_buf_cur < mrr_buf_end) &&
!(res= h2->handler::multi_range_read_next(&range_info)))
{
KEY_MULTI_RANGE *curr_range= &h2->handler::mrr_cur_range;
@@ -498,13 +522,13 @@
/* Put rowid, or {rowid, range_id} pair into the buffer */
h2->position(table->record[0]);
- memcpy(rowids_buf_cur, h2->ref, h2->ref_length);
- rowids_buf_cur += h2->ref_length;
+ memcpy(mrr_buf_cur, h2->ref, h2->ref_length);
+ mrr_buf_cur += h2->ref_length;
if (is_mrr_assoc)
{
- memcpy(rowids_buf_cur, &range_info, sizeof(void*));
- rowids_buf_cur += sizeof(void*);
+ memcpy(mrr_buf_cur, &range_info, sizeof(void*));
+ mrr_buf_cur += sizeof(void*);
}
}
@@ -514,16 +538,224 @@
/* Sort the buffer contents by rowid */
uint elem_size= h->ref_length + (int)is_mrr_assoc * sizeof(void*);
- uint n_rowids= (rowids_buf_cur - rowids_buf) / elem_size;
+ uint n_rowids= (mrr_buf_cur - mrr_buf) / elem_size;
- my_qsort2(rowids_buf, n_rowids, elem_size, (qsort2_cmp)rowid_cmp,
+ my_qsort2(mrr_buf, n_rowids, elem_size, (qsort2_cmp)rowid_cmp,
(void*)h);
- rowids_buf_last= rowids_buf_cur;
- rowids_buf_cur= rowids_buf;
+ mrr_buf_last= mrr_buf_cur;
+ mrr_buf_cur= mrr_buf;
DBUG_RETURN(0);
}
+/*
+ my_qsort2-compatible function to compare key tuples
+*/
+
+int DsMrr_impl::key_tuple_cmp(void* arg, uchar* key1, uchar* key2)
+{
+ DsMrr_impl *dsmrr= (DsMrr_impl*)arg;
+ TABLE *table= dsmrr->h->table;
+
+ KEY_PART_INFO *part= table->key_info[table->s->primary_key].key_part;
+ uchar *key1_end= key1 + dsmrr->cpk_tuple_length;
+
+ while (key1 < key1_end)
+ {
+ Field* f = part->field;
+ int len = part->store_length;
+ int res = f->cmp(key1, key2);
+ if (res)
+ return res;
+ key1 += len;
+ key2 += len;
+ part++;
+ }
+ return 0;
+}
+
+
+/*
+ DS-MRR/CPK: Fill the buffer with (lookup_tuple, range_id) pairs and sort
+
+ SYNOPSIS
+ DsMrr_impl::dsmrr_fill_buffer_cpk()
+
+ DESCRIPTION
+ DS-MRR/CPK: Fill the buffer with (lookup_tuple, range_id) pairs and sort
+
+ dsmrr_eof is set to indicate whether we've exhausted the list of ranges
+ we're scanning.
+*/
+
+void DsMrr_impl::dsmrr_fill_buffer_cpk()
+{
+ int res;
+ KEY_MULTI_RANGE cur_range;
+ DBUG_ENTER("DsMrr_impl::dsmrr_fill_buffer_cpk");
+
+ mrr_buf_cur= mrr_buf;
+ while ((mrr_buf_cur < mrr_buf_end) &&
+ !(res= h->mrr_funcs.next(h->mrr_iter, &cur_range)))
+ {
+ DBUG_ASSERT(cur_range.range_flag & EQ_RANGE);
+ DBUG_ASSERT(!cpk_tuple_length ||
+ cpk_tuple_length == cur_range.start_key.length);
+ if (!cpk_tuple_length)
+ {
+ cpk_tuple_length= cur_range.start_key.length;
+ cpk_is_unique_scan= test(table->key_info[h->active_index].key_parts ==
+ my_count_bits(cur_range.start_key.keypart_map));
+ uint elem_size= cpk_tuple_length + (int)is_mrr_assoc * sizeof(void*);
+ mrr_buf_last= mrr_buf + ((mrr_buf_end - mrr_buf)/elem_size) * elem_size;
+ mrr_buf_end= mrr_buf_last;
+ }
+
+ /* Put key, or {key, range_id} pair into the buffer */
+ memcpy(mrr_buf_cur, cur_range.start_key.key, cpk_tuple_length);
+ mrr_buf_cur += cpk_tuple_length;
+
+ if (is_mrr_assoc)
+ {
+ memcpy(mrr_buf_cur, &cur_range.ptr, sizeof(void*));
+ mrr_buf_cur += sizeof(void*);
+ }
+ }
+
+ dsmrr_eof= test(res);
+
+ /* Sort the buffer contents by rowid */
+ uint elem_size= cpk_tuple_length + (int)is_mrr_assoc * sizeof(void*);
+ uint n_rowids= (mrr_buf_cur - mrr_buf) / elem_size;
+
+ my_qsort2(mrr_buf, n_rowids, elem_size,
+ (qsort2_cmp)DsMrr_impl::key_tuple_cmp, (void*)this);
+ mrr_buf_last= mrr_buf_cur;
+ mrr_buf_cur= mrr_buf;
+ DBUG_VOID_RETURN;
+}
+
+
+/*
+ DS-MRR/CPK: multi_range_read_next() function
+
+ DESCRIPTION
+ DsMrr_impl::dsmrr_next_cpk()
+ range_info OUT identifier of range that the returned record belongs to
+
+ DESCRIPTION
+ DS-MRR/CPK: multi_range_read_next() function.
+ This is similar to DsMrr_impl::dsmrr_next(), the differences are that
+ - we get records with index_read(), not with rnd_pos()
+ - we may get multiple records for one key (=element of the buffer)
+ - unlike dsmrr_fill_buffer(), dsmrr_fill_buffer_cpk() never fails.
+
+ RETURN
+ 0 OK, next record was successfully read
+ HA_ERR_END_OF_FILE End of records
+ Other Some other error
+*/
+
+int DsMrr_impl::dsmrr_next_cpk(char **range_info)
+{
+ int res;
+
+ while (cpk_have_range)
+ {
+
+ if (h->mrr_funcs.skip_record &&
+ h->mrr_funcs.skip_record(h->mrr_iter, cpk_saved_range_info, NULL))
+ {
+ cpk_have_range= FALSE;
+ break;
+ }
+
+ res= h->index_next_same(table->record[0], mrr_buf_cur, cpk_tuple_length);
+
+ if (h->mrr_funcs.skip_index_tuple &&
+ h->mrr_funcs.skip_index_tuple(h->mrr_iter, cpk_saved_range_info))
+ continue;
+
+ if (res != HA_ERR_END_OF_FILE)
+ {
+ if (is_mrr_assoc)
+ memcpy(range_info, &cpk_saved_range_info, sizeof(void*));
+ return res;
+ }
+
+ /* No more records in this range. Exit this loop and go get another range */
+ cpk_have_range= FALSE;
+ }
+
+ do
+ {
+ /* First, make sure we have a range at start of the buffer */
+ if (mrr_buf_cur == mrr_buf_last)
+ {
+ if (dsmrr_eof)
+ {
+ res= HA_ERR_END_OF_FILE;
+ goto end;
+ }
+ dsmrr_fill_buffer_cpk();
+ }
+ if (mrr_buf_cur == mrr_buf_last)
+ {
+ res= HA_ERR_END_OF_FILE;
+ goto end;
+ }
+
+ /* Ok, got the range. Try making a lookup. */
+ uchar *lookup_tuple= mrr_buf_cur;
+ mrr_buf_cur += cpk_tuple_length;
+ if (is_mrr_assoc)
+ {
+ memcpy(&cpk_saved_range_info, mrr_buf_cur, sizeof(void*));
+ mrr_buf_cur += sizeof(void*) * test(is_mrr_assoc);
+ }
+
+ if (h->mrr_funcs.skip_record &&
+ h->mrr_funcs.skip_record(h->mrr_iter, cpk_saved_range_info, NULL))
+ continue;
+
+ res= h->index_read(table->record[0], lookup_tuple, cpk_tuple_length,
+ HA_READ_KEY_EXACT);
+
+ /*
+ Check pushed index condition. Performance-wise, it does not make any
+ sense to put this call here (the above call has already accessed the full
+ record). That's the best I could do, though, because:
+ - ha_innobase doesn't support IndexConditionPushdown on clustered PK
+ - MRR interface doesn't allow the storage engine to refuse a pushed index
+ condition.
+ Having this call here is not fully harmless: EXPLAIN shows "pushed index
+ condition", which is technically true but doesn't bring the benefits that
+ one might expect.
+ */
+ if (h->mrr_funcs.skip_index_tuple &&
+ h->mrr_funcs.skip_index_tuple(h->mrr_iter, cpk_saved_range_info))
+ continue;
+
+ if (res && res != HA_ERR_END_OF_FILE)
+ goto end;
+
+ if (!res)
+ {
+ memcpy(range_info, &cpk_saved_range_info, sizeof(void*));
+ /*
+ Attempt reading more rows from this range only if there actually can
+ be multiple matches:
+ */
+ cpk_have_range= !cpk_is_unique_scan;
+ break;
+ }
+ } while (true);
+
+end:
+ return res;
+}
+
+
/**
DS-MRR implementation: multi_range_read_next() function
*/
@@ -536,10 +768,13 @@
if (use_default_impl)
return h->handler::multi_range_read_next(range_info);
+
+ if (doing_cpk_scan)
+ return dsmrr_next_cpk(range_info);
do
{
- if (rowids_buf_cur == rowids_buf_last)
+ if (mrr_buf_cur == mrr_buf_last)
{
if (dsmrr_eof)
{
@@ -552,17 +787,17 @@
}
/* return eof if there are no rowids in the buffer after re-fill attempt */
- if (rowids_buf_cur == rowids_buf_last)
+ if (mrr_buf_cur == mrr_buf_last)
{
res= HA_ERR_END_OF_FILE;
goto end;
}
- rowid= rowids_buf_cur;
+ rowid= mrr_buf_cur;
if (is_mrr_assoc)
- memcpy(&cur_range_info, rowids_buf_cur + h->ref_length, sizeof(uchar**));
+ memcpy(&cur_range_info, mrr_buf_cur + h->ref_length, sizeof(uchar**));
- rowids_buf_cur += h->ref_length + sizeof(void*) * test(is_mrr_assoc);
+ mrr_buf_cur += h->ref_length + sizeof(void*) * test(is_mrr_assoc);
if (h2->mrr_funcs.skip_record &&
h2->mrr_funcs.skip_record(h2->mrr_iter, (char *) cur_range_info, rowid))
continue;
@@ -582,7 +817,8 @@
/**
DS-MRR implementation: multi_range_read_info() function
*/
-ha_rows DsMrr_impl::dsmrr_info(uint keyno, uint n_ranges, uint rows,
+ha_rows DsMrr_impl::dsmrr_info(uint keyno, uint n_ranges, uint rows,
+ uint key_parts,
uint *bufsz, uint *flags, COST_VECT *cost)
{
ha_rows res;
@@ -590,8 +826,8 @@
uint def_bufsz= *bufsz;
/* Get cost/flags/mem_usage of default MRR implementation */
- res= h->handler::multi_range_read_info(keyno, n_ranges, rows, &def_bufsz,
- &def_flags, cost);
+ res= h->handler::multi_range_read_info(keyno, n_ranges, rows, key_parts,
+ &def_bufsz, &def_flags, cost);
DBUG_ASSERT(!res);
if ((*flags & HA_MRR_USE_DEFAULT_IMPL) ||
@@ -683,7 +919,33 @@
return FALSE;
}
-/**
+
+/*
+ Check if key/flags allow DS-MRR/CPK strategy to be used
+
+ SYNOPSIS
+ DsMrr_impl::check_cpk_scan()
+ keyno Index that will be used
+ mrr_flags
+
+ DESCRIPTION
+ Check if key/flags allow DS-MRR/CPK strategy to be used.
+
+ RETURN
+ TRUE DS-MRR/CPK should be used
+ FALSE Otherwise
+*/
+
+bool DsMrr_impl::check_cpk_scan(uint keyno, uint mrr_flags)
+{
+ return test((mrr_flags & HA_MRR_SINGLE_POINT) &&
+ !(mrr_flags & HA_MRR_SORTED) &&
+ keyno == table->s->primary_key &&
+ h->primary_key_is_clustered());
+}
+
+
+/*
DS-MRR Internals: Choose between Default MRR implementation and DS-MRR
Make the choice between using Default MRR implementation and DS-MRR.
@@ -706,14 +968,18 @@
@retval FALSE DS-MRR implementation should be used
*/
+
bool DsMrr_impl::choose_mrr_impl(uint keyno, ha_rows rows, uint *flags,
uint *bufsz, COST_VECT *cost)
{
COST_VECT dsmrr_cost;
bool res;
THD *thd= current_thd;
+
+ doing_cpk_scan= check_cpk_scan(keyno, *flags);
if (thd->variables.optimizer_use_mrr == 2 || *flags & HA_MRR_INDEX_ONLY ||
- (keyno == table->s->primary_key && h->primary_key_is_clustered()) ||
+ (keyno == table->s->primary_key && h->primary_key_is_clustered() &&
+ !doing_cpk_scan) ||
key_uses_partial_cols(table, keyno))
{
/* Use the default implementation */
diff -urN --exclude='.*' maria-5.3-dsmrr-for-cpk-clean/sql/multi_range_read.h maria-5.3-dsmrr-for-cpk-noc/sql/multi_range_read.h
--- maria-5.3-dsmrr-for-cpk-clean/sql/multi_range_read.h 2010-06-22 19:10:46.000000000 +0400
+++ maria-5.3-dsmrr-for-cpk-noc/sql/multi_range_read.h 2010-06-22 23:28:40.000000000 +0400
@@ -1,16 +1,76 @@
/*
- This file contains declarations for
- - Disk-Sweep MultiRangeRead (DS-MRR) implementation
+ This file contains declarations for Disk-Sweep MultiRangeRead (DS-MRR)
+ implementation
*/
/**
- A Disk-Sweep MRR interface implementation
+ A Disk-Sweep implementation of MRR Interface (DS-MRR for short)
- This implementation makes range (and, in the future, 'ref') scans to read
- table rows in disk sweeps.
-
- Currently it is used by MyISAM and InnoDB. Potentially it can be used with
- any table handler that has non-clustered indexes and on-disk rows.
+ This is a "plugin"(*) for storage engines that allows make index scans
+ read table rows in rowid order. For disk-based storage engines, this is
+ faster than reading table rows in whatever-SQL-layer-makes-calls-in order.
+
+ (*) - only conceptually. No dynamic loading or binary compatibility of any
+ kind.
+
+ General scheme of things:
+
+ SQL Layer code
+ | | |
+ -v---v---v---- handler->multi_range_read_XXX() function calls
+ | | |
+ ____________________________________
+ / DS-MRR module \
+ | (scan indexes, order rowids, do |
+ | full record reads in rowid order) |
+ \____________________________________/
+ | | |
+ -|---|---|----- handler->read_range_first()/read_range_next(),
+ | | | handler->index_read(), handler->rnd_pos() calls.
+ | | |
+ v v v
+ Storage engine internals
+
+ Currently DS-MRR is used by MyISAM, InnoDB/XtraDB and Maria storage engines.
+ Potentially it can be used with any table handler that has disk-based data
+ storage and has better performance when reading data in rowid order.
+*/
+
+
+/*
+ DS-MRR implementation for one table. Create/use one object of this class for
+ each ha_{myisam/innobase/etc} object. That object will be further referred to
+ as "the handler"
+
+ There are actually three strategies
+ S1. Bypass DS-MRR, pass all calls to default implementation (i.e. to
+ MRR-to-non-MRR calls converter)
+ S2. Regular DS-MRR
+ S3. DS-MRR/CPK for doing scans on clustered primary keys.
+
+ S1 is used for cases which DS-MRR is unable to handle for some reason.
+
+ S2 is the actual DS-MRR. The basic algorithm is as follows:
+ 1. Scan the index (and only index, that is, with HA_EXTRA_KEYREAD on) and
+ fill the buffer with {rowid, range_id} pairs
+ 2. Sort the buffer by rowid
+ 3. for each {rowid, range_id} pair in the buffer
+ get record by rowid and return the {record, range_id} pair
+ 4. Repeat the above steps until we've exhausted the list of ranges we're
+ scanning.
+
+ S3 is the variant of DS-MRR for use with clustered primary keys (or any
+ clustered index). The idea is that in clustered index it is sufficient to
+ access the index in index order, and we don't need an intermediate steps to
+ get rowid (like step #1 in S2).
+
+ DS-MRR/CPK's basic algorithm is as follows:
+ 1. Collect a number of ranges (=lookup keys)
+ 2. Sort them so that they follow in index order.
+ 3. for each {lookup_key, range_id} pair in the buffer
+ get record(s) matching the lookup key and return {record, range_id} pairs
+ 4. Repeat the above steps until we've exhausted the list of ranges we're
+ scanning.
*/
class DsMrr_impl
@@ -21,21 +81,38 @@
DsMrr_impl()
: h2(NULL) {};
+ void init(handler *h_arg, TABLE *table_arg)
+ {
+ h= h_arg;
+ table= table_arg;
+ }
+ int dsmrr_init(handler *h, RANGE_SEQ_IF *seq_funcs, void *seq_init_param,
+ uint n_ranges, uint mode, HANDLER_BUFFER *buf);
+ void dsmrr_close();
+ int dsmrr_next(char **range_info);
+
+ ha_rows dsmrr_info(uint keyno, uint n_ranges, uint keys, uint key_parts,
+ uint *bufsz, uint *flags, COST_VECT *cost);
+
+ ha_rows dsmrr_info_const(uint keyno, RANGE_SEQ_IF *seq,
+ void *seq_init_param, uint n_ranges, uint *bufsz,
+ uint *flags, COST_VECT *cost);
+private:
/*
The "owner" handler object (the one that calls dsmrr_XXX functions.
It is used to retrieve full table rows by calling rnd_pos().
*/
handler *h;
TABLE *table; /* Always equal to h->table */
-private:
+
/* Secondary handler object. It is used for scanning the index */
handler *h2;
/* Buffer to store rowids, or (rowid, range_id) pairs */
- uchar *rowids_buf;
- uchar *rowids_buf_cur; /* Current position when reading/writing */
- uchar *rowids_buf_last; /* When reading: end of used buffer space */
- uchar *rowids_buf_end; /* End of the buffer */
+ uchar *mrr_buf;
+ uchar *mrr_buf_cur; /* Current position when reading/writing */
+ uchar *mrr_buf_last; /* When reading: end of used buffer space */
+ uchar *mrr_buf_end; /* End of the buffer */
bool dsmrr_eof; /* TRUE <=> We have reached EOF when reading index tuples */
@@ -43,28 +120,31 @@
bool is_mrr_assoc;
bool use_default_impl; /* TRUE <=> shortcut all calls to default MRR impl */
-public:
- void init(handler *h_arg, TABLE *table_arg)
- {
- h= h_arg;
- table= table_arg;
- }
- int dsmrr_init(handler *h, RANGE_SEQ_IF *seq_funcs, void *seq_init_param,
- uint n_ranges, uint mode, HANDLER_BUFFER *buf);
- void dsmrr_close();
- int dsmrr_fill_buffer();
- int dsmrr_next(char **range_info);
- ha_rows dsmrr_info(uint keyno, uint n_ranges, uint keys, uint *bufsz,
- uint *flags, COST_VECT *cost);
+ bool doing_cpk_scan; /* TRUE <=> DS-MRR/CPK variant is used */
+
+ /** DS-MRR/CPK variables start */
+
+ /* Length of lookup tuple being used, in bytes */
+ uint cpk_tuple_length;
+ /*
+ TRUE <=> We're scanning on a full primary key (and not on prefix), and so
+ can get max. one match for each key
+ */
+ bool cpk_is_unique_scan;
+ /* TRUE<=> we're in a middle of enumerating records from a range */
+ bool cpk_have_range;
+ /* Valid if cpk_have_range==TRUE: range_id of the range we're enumerating */
+ char *cpk_saved_range_info;
- ha_rows dsmrr_info_const(uint keyno, RANGE_SEQ_IF *seq,
- void *seq_init_param, uint n_ranges, uint *bufsz,
- uint *flags, COST_VECT *cost);
-private:
bool choose_mrr_impl(uint keyno, ha_rows rows, uint *flags, uint *bufsz,
COST_VECT *cost);
bool get_disk_sweep_mrr_cost(uint keynr, ha_rows rows, uint flags,
uint *buffer_size, COST_VECT *cost);
+ bool check_cpk_scan(uint keyno, uint mrr_flags);
+ static int key_tuple_cmp(void* arg, uchar* key1, uchar* key2);
+ int dsmrr_fill_buffer();
+ void dsmrr_fill_buffer_cpk();
+ int dsmrr_next_cpk(char **range_info);
};
diff -urN --exclude='.*' maria-5.3-dsmrr-for-cpk-clean/sql/opt_range.cc maria-5.3-dsmrr-for-cpk-noc/sql/opt_range.cc
--- maria-5.3-dsmrr-for-cpk-clean/sql/opt_range.cc 2010-06-22 19:10:46.000000000 +0400
+++ maria-5.3-dsmrr-for-cpk-noc/sql/opt_range.cc 2010-06-22 23:28:40.000000000 +0400
@@ -8006,6 +8006,7 @@
quick->mrr_buf_size= thd->variables.mrr_buff_size;
if (table->file->multi_range_read_info(quick->index, 1, (uint)records,
+ uint(-1),
&quick->mrr_buf_size,
&quick->mrr_flags, &cost))
goto err;
diff -urN --exclude='.*' maria-5.3-dsmrr-for-cpk-clean/sql/sql_join_cache.cc maria-5.3-dsmrr-for-cpk-noc/sql/sql_join_cache.cc
--- maria-5.3-dsmrr-for-cpk-clean/sql/sql_join_cache.cc 2010-06-22 19:10:46.000000000 +0400
+++ maria-5.3-dsmrr-for-cpk-noc/sql/sql_join_cache.cc 2010-06-22 23:28:40.000000000 +0400
@@ -2376,8 +2376,8 @@
*/
if (!file->inited)
file->ha_index_init(join_tab->ref.key, 1);
- if ((error= file->multi_range_read_init(seq_funcs, (void*) this, ranges,
- mrr_mode, &mrr_buff)))
+ if ((error= file->multi_range_read_init(seq_funcs, (void*) this, ranges,
+ mrr_mode, &mrr_buff)))
rc= error < 0 ? NESTED_LOOP_NO_MORE_ROWS: NESTED_LOOP_ERROR;
return rc;
diff -urN --exclude='.*' maria-5.3-dsmrr-for-cpk-clean/sql/sql_select.cc maria-5.3-dsmrr-for-cpk-noc/sql/sql_select.cc
--- maria-5.3-dsmrr-for-cpk-clean/sql/sql_select.cc 2010-06-22 19:10:46.000000000 +0400
+++ maria-5.3-dsmrr-for-cpk-noc/sql/sql_select.cc 2010-06-22 19:06:54.000000000 +0400
@@ -7318,10 +7318,11 @@
case JT_EQ_REF:
if (cache_level <= 4)
return 0;
- flags= HA_MRR_NO_NULL_ENDPOINTS;
+ flags= HA_MRR_NO_NULL_ENDPOINTS | HA_MRR_SINGLE_POINT;
if (tab->table->covering_keys.is_set(tab->ref.key))
flags|= HA_MRR_INDEX_ONLY;
rows= tab->table->file->multi_range_read_info(tab->ref.key, 10, 20,
+ tab->ref.key_parts,
&bufsz, &flags, &cost);
if ((rows != HA_POS_ERROR) && !(flags & HA_MRR_USE_DEFAULT_IMPL) &&
(!(flags & HA_MRR_NO_ASSOCIATION) || cache_level > 6) &&
diff -urN --exclude='.*' maria-5.3-dsmrr-for-cpk-clean/storage/maria/ha_maria.cc maria-5.3-dsmrr-for-cpk-noc/storage/maria/ha_maria.cc
--- maria-5.3-dsmrr-for-cpk-clean/storage/maria/ha_maria.cc 2010-06-22 19:10:46.000000000 +0400
+++ maria-5.3-dsmrr-for-cpk-noc/storage/maria/ha_maria.cc 2010-06-22 23:28:40.000000000 +0400
@@ -3501,8 +3501,8 @@
***************************************************************************/
int ha_maria::multi_range_read_init(RANGE_SEQ_IF *seq, void *seq_init_param,
- uint n_ranges, uint mode,
- HANDLER_BUFFER *buf)
+ uint n_ranges, uint mode,
+ HANDLER_BUFFER *buf)
{
return ds_mrr.dsmrr_init(this, seq, seq_init_param, n_ranges, mode, buf);
}
@@ -3528,11 +3528,11 @@
}
ha_rows ha_maria::multi_range_read_info(uint keyno, uint n_ranges, uint keys,
- uint *bufsz, uint *flags,
- COST_VECT *cost)
+ uint key_parts, uint *bufsz,
+ uint *flags, COST_VECT *cost)
{
ds_mrr.init(this, table);
- return ds_mrr.dsmrr_info(keyno, n_ranges, keys, bufsz, flags, cost);
+ return ds_mrr.dsmrr_info(keyno, n_ranges, keys, key_parts, bufsz, flags, cost);
}
/* MyISAM MRR implementation ends */
diff -urN --exclude='.*' maria-5.3-dsmrr-for-cpk-clean/storage/maria/ha_maria.h maria-5.3-dsmrr-for-cpk-noc/storage/maria/ha_maria.h
--- maria-5.3-dsmrr-for-cpk-clean/storage/maria/ha_maria.h 2010-06-22 19:10:46.000000000 +0400
+++ maria-5.3-dsmrr-for-cpk-noc/storage/maria/ha_maria.h 2010-06-22 23:28:40.000000000 +0400
@@ -181,7 +181,8 @@
uint n_ranges, uint *bufsz,
uint *flags, COST_VECT *cost);
ha_rows multi_range_read_info(uint keyno, uint n_ranges, uint keys,
- uint *bufsz, uint *flags, COST_VECT *cost);
+ uint key_parts, uint *bufsz,
+ uint *flags, COST_VECT *cost);
/* Index condition pushdown implementation */
Item *idx_cond_push(uint keyno, Item* idx_cond);
diff -urN --exclude='.*' maria-5.3-dsmrr-for-cpk-clean/storage/myisam/ha_myisam.cc maria-5.3-dsmrr-for-cpk-noc/storage/myisam/ha_myisam.cc
--- maria-5.3-dsmrr-for-cpk-clean/storage/myisam/ha_myisam.cc 2010-06-22 19:10:46.000000000 +0400
+++ maria-5.3-dsmrr-for-cpk-noc/storage/myisam/ha_myisam.cc 2010-06-22 23:28:40.000000000 +0400
@@ -2244,11 +2244,11 @@
}
ha_rows ha_myisam::multi_range_read_info(uint keyno, uint n_ranges, uint keys,
- uint *bufsz, uint *flags,
- COST_VECT *cost)
+ uint key_parts, uint *bufsz,
+ uint *flags, COST_VECT *cost)
{
ds_mrr.init(this, table);
- return ds_mrr.dsmrr_info(keyno, n_ranges, keys, bufsz, flags, cost);
+ return ds_mrr.dsmrr_info(keyno, n_ranges, keys, key_parts, bufsz, flags, cost);
}
/* MyISAM MRR implementation ends */
diff -urN --exclude='.*' maria-5.3-dsmrr-for-cpk-clean/storage/myisam/ha_myisam.h maria-5.3-dsmrr-for-cpk-noc/storage/myisam/ha_myisam.h
--- maria-5.3-dsmrr-for-cpk-clean/storage/myisam/ha_myisam.h 2010-06-22 19:10:46.000000000 +0400
+++ maria-5.3-dsmrr-for-cpk-noc/storage/myisam/ha_myisam.h 2010-06-22 23:28:40.000000000 +0400
@@ -169,7 +169,8 @@
uint n_ranges, uint *bufsz,
uint *flags, COST_VECT *cost);
ha_rows multi_range_read_info(uint keyno, uint n_ranges, uint keys,
- uint *bufsz, uint *flags, COST_VECT *cost);
+ uint key_parts, uint *bufsz,
+ uint *flags, COST_VECT *cost);
/* Index condition pushdown implementation */
Item *idx_cond_push(uint keyno, Item* idx_cond);
diff -urN --exclude='.*' maria-5.3-dsmrr-for-cpk-clean/storage/xtradb/handler/ha_innodb.cc maria-5.3-dsmrr-for-cpk-noc/storage/xtradb/handler/ha_innodb.cc
--- maria-5.3-dsmrr-for-cpk-clean/storage/xtradb/handler/ha_innodb.cc 2010-06-22 19:10:46.000000000 +0400
+++ maria-5.3-dsmrr-for-cpk-noc/storage/xtradb/handler/ha_innodb.cc 2010-06-22 23:28:40.000000000 +0400
@@ -11025,7 +11025,8 @@
*/
int ha_innobase::multi_range_read_init(RANGE_SEQ_IF *seq, void *seq_init_param,
- uint n_ranges, uint mode, HANDLER_BUFFER *buf)
+ uint n_ranges, uint mode,
+ HANDLER_BUFFER *buf)
{
return ds_mrr.dsmrr_init(this, seq, seq_init_param, n_ranges, mode, buf);
}
@@ -11052,12 +11053,13 @@
return res;
}
-ha_rows ha_innobase::multi_range_read_info(uint keyno, uint n_ranges,
- uint keys, uint *bufsz,
+ha_rows ha_innobase::multi_range_read_info(uint keyno, uint n_ranges, uint keys,
+ uint key_parts, uint *bufsz,
uint *flags, COST_VECT *cost)
{
ds_mrr.init(this, table);
- ha_rows res= ds_mrr.dsmrr_info(keyno, n_ranges, keys, bufsz, flags, cost);
+ ha_rows res= ds_mrr.dsmrr_info(keyno, n_ranges, keys, key_parts, bufsz,
+ flags, cost);
return res;
}
diff -urN --exclude='.*' maria-5.3-dsmrr-for-cpk-clean/storage/xtradb/handler/ha_innodb.h maria-5.3-dsmrr-for-cpk-noc/storage/xtradb/handler/ha_innodb.h
--- maria-5.3-dsmrr-for-cpk-clean/storage/xtradb/handler/ha_innodb.h 2010-06-22 19:10:46.000000000 +0400
+++ maria-5.3-dsmrr-for-cpk-noc/storage/xtradb/handler/ha_innodb.h 2010-06-22 23:28:40.000000000 +0400
@@ -217,7 +217,8 @@
uint n_ranges, uint *bufsz,
uint *flags, COST_VECT *cost);
ha_rows multi_range_read_info(uint keyno, uint n_ranges, uint keys,
- uint *bufsz, uint *flags, COST_VECT *cost);
+ uint key_parts, uint *bufsz,
+ uint *flags, COST_VECT *cost);
DsMrr_impl ds_mrr;
Item *idx_cond_push(uint keyno, Item* idx_cond);
BR
Sergey
--
Sergey Petrunia, Software Developer
Monty Program AB, http://askmonty.org
Blog: http://s.petrunia.net/blog
1
0
[Maria-developers] Progress (by Knielsen): New replication APIs (107)
by worklog-noreply@askmonty.org 21 Jun '10
by worklog-noreply@askmonty.org 21 Jun '10
21 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: New replication APIs
CREATION DATE..: Mon, 15 Mar 2010, 13:55
SUPERVISOR.....: Knielsen
IMPLEMENTOR....: Knielsen
COPIES TO......: Sergei
CATEGORY.......: Server-Sprint
TASK ID........: 107 (http://askmonty.org/worklog/?tid=107)
VERSION........: Server-9.x
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 69
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Knielsen - Mon, 21 Jun 2010, 08:36)=-=-
Research and design thoughts.
Worked 19 hours and estimate 0 hours remain (original estimate increased by 19 hours).
-=-=(Knielsen - Mon, 07 Jun 2010, 12:11)=-=-
High Level Description modified.
--- /tmp/wklog.107.old.31097 2010-06-07 12:11:57.000000000 +0000
+++ /tmp/wklog.107.new.31097 2010-06-07 12:11:57.000000000 +0000
@@ -7,3 +7,6 @@
https://lists.launchpad.net/maria-developers/msg01998.html
+Wiki page for the project:
+
+ http://askmonty.org/wiki/ReplicationProject
-=-=(Knielsen - Mon, 29 Mar 2010, 07:33)=-=-
Research and design discussions: Galera, 2pc/XA, group commit, multi-engine transactions.
Worked 14 hours and estimate 0 hours remain (original estimate increased by 14 hours).
-=-=(Knielsen - Wed, 24 Mar 2010, 10:39)=-=-
Design discussions
Worked 11 hours and estimate 0 hours remain (original estimate increased by 11 hours).
-=-=(Knielsen - Mon, 15 Mar 2010, 14:28)=-=-
Research into the problem, and discussions on phone/mailing list
Worked 25 hours and estimate 0 hours remain (original estimate increased by 25 hours).
-=-=(Guest - Mon, 15 Mar 2010, 14:18)=-=-
High-Level Specification modified.
--- /tmp/wklog.107.old.9086 2010-03-15 14:18:18.000000000 +0000
+++ /tmp/wklog.107.new.9086 2010-03-15 14:18:18.000000000 +0000
@@ -1 +1,43 @@
+Current ideas/status after discussions on the mailing list:
+
+ - Implement a set of plugin APIs and use them to move all of the existing
+ MySQL replication into a (set of) plugins.
+
+ - Design the APIs so that they can support full MySQL replication, but also
+ so that they do not hardcode assumptions about how this replication
+ implementation is done, and so that they will be suitable for other types of
+ replication (Tungsten, Galera, parallel replication, ...).
+
+ - APIs need to include the concept of a global transaction ID. Need to
+ determine the extent to which the semantics of such ID will be defined
+ by the API, and to which extend it will be defined by the plugin
+ implementations.
+
+ - APIs should properly support reliable crash-recovery with decent
+ performance (eg. not require multiple mandatory fsync()s per commit, and
+ not make group commit impossible).
+
+ - Would be nice if the API provided facilities for implementing good
+ consistency checking support (mainly checking master tables against slave
+ tables is hard here I think, but also applying wrong binlog data and
+ individual event checksums).
+
+
+Steps to make this more concrete:
+
+ - Investigate the current MySQL replication, and list all of the places where
+ a plugin implementation will need to connect/hook into the MySQL server.
+ * handler::{write,update,delete}_row()
+ * Statement execution
+ * Transaction start/commit
+ * Table open
+ * Query safe/not/safe for statement based replication
+ * Statement-based logging details (user variables, random seed, etc.)
+ * ...
+
+ - Use this list to make an initial sketch of the set of APIs we need.
+
+ - Use the list to determine the feasibility of this project and the level of
+ detail in the API needed to support a full replication implementation as a
+ plugin.
-=-=(Serg - Mon, 15 Mar 2010, 14:13)=-=-
Observers changed: Sergei
DESCRIPTION:
This is a top-level task for the project of designing a new set of replication
APIs for MariaDB.
This task is for the initial discussion of what to do and where to focus.
The project is started in this email thread:
https://lists.launchpad.net/maria-developers/msg01998.html
Wiki page for the project:
http://askmonty.org/wiki/ReplicationProject
HIGH-LEVEL SPECIFICATION:
Current ideas/status after discussions on the mailing list:
- Implement a set of plugin APIs and use them to move all of the existing
MySQL replication into a (set of) plugins.
- Design the APIs so that they can support full MySQL replication, but also
so that they do not hardcode assumptions about how this replication
implementation is done, and so that they will be suitable for other types of
replication (Tungsten, Galera, parallel replication, ...).
- APIs need to include the concept of a global transaction ID. Need to
determine the extent to which the semantics of such ID will be defined
by the API, and to which extend it will be defined by the plugin
implementations.
- APIs should properly support reliable crash-recovery with decent
performance (eg. not require multiple mandatory fsync()s per commit, and
not make group commit impossible).
- Would be nice if the API provided facilities for implementing good
consistency checking support (mainly checking master tables against slave
tables is hard here I think, but also applying wrong binlog data and
individual event checksums).
Steps to make this more concrete:
- Investigate the current MySQL replication, and list all of the places where
a plugin implementation will need to connect/hook into the MySQL server.
* handler::{write,update,delete}_row()
* Statement execution
* Transaction start/commit
* Table open
* Query safe/not/safe for statement based replication
* Statement-based logging details (user variables, random seed, etc.)
* ...
- Use this list to make an initial sketch of the set of APIs we need.
- Use the list to determine the feasibility of this project and the level of
detail in the API needed to support a full replication implementation as a
plugin.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Progress (by Knielsen): New replication APIs (107)
by worklog-noreply@askmonty.org 21 Jun '10
by worklog-noreply@askmonty.org 21 Jun '10
21 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: New replication APIs
CREATION DATE..: Mon, 15 Mar 2010, 13:55
SUPERVISOR.....: Knielsen
IMPLEMENTOR....: Knielsen
COPIES TO......: Sergei
CATEGORY.......: Server-Sprint
TASK ID........: 107 (http://askmonty.org/worklog/?tid=107)
VERSION........: Server-9.x
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 69
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Knielsen - Mon, 21 Jun 2010, 08:36)=-=-
Research and design thoughts.
Worked 19 hours and estimate 0 hours remain (original estimate increased by 19 hours).
-=-=(Knielsen - Mon, 07 Jun 2010, 12:11)=-=-
High Level Description modified.
--- /tmp/wklog.107.old.31097 2010-06-07 12:11:57.000000000 +0000
+++ /tmp/wklog.107.new.31097 2010-06-07 12:11:57.000000000 +0000
@@ -7,3 +7,6 @@
https://lists.launchpad.net/maria-developers/msg01998.html
+Wiki page for the project:
+
+ http://askmonty.org/wiki/ReplicationProject
-=-=(Knielsen - Mon, 29 Mar 2010, 07:33)=-=-
Research and design discussions: Galera, 2pc/XA, group commit, multi-engine transactions.
Worked 14 hours and estimate 0 hours remain (original estimate increased by 14 hours).
-=-=(Knielsen - Wed, 24 Mar 2010, 10:39)=-=-
Design discussions
Worked 11 hours and estimate 0 hours remain (original estimate increased by 11 hours).
-=-=(Knielsen - Mon, 15 Mar 2010, 14:28)=-=-
Research into the problem, and discussions on phone/mailing list
Worked 25 hours and estimate 0 hours remain (original estimate increased by 25 hours).
-=-=(Guest - Mon, 15 Mar 2010, 14:18)=-=-
High-Level Specification modified.
--- /tmp/wklog.107.old.9086 2010-03-15 14:18:18.000000000 +0000
+++ /tmp/wklog.107.new.9086 2010-03-15 14:18:18.000000000 +0000
@@ -1 +1,43 @@
+Current ideas/status after discussions on the mailing list:
+
+ - Implement a set of plugin APIs and use them to move all of the existing
+ MySQL replication into a (set of) plugins.
+
+ - Design the APIs so that they can support full MySQL replication, but also
+ so that they do not hardcode assumptions about how this replication
+ implementation is done, and so that they will be suitable for other types of
+ replication (Tungsten, Galera, parallel replication, ...).
+
+ - APIs need to include the concept of a global transaction ID. Need to
+ determine the extent to which the semantics of such ID will be defined
+ by the API, and to which extend it will be defined by the plugin
+ implementations.
+
+ - APIs should properly support reliable crash-recovery with decent
+ performance (eg. not require multiple mandatory fsync()s per commit, and
+ not make group commit impossible).
+
+ - Would be nice if the API provided facilities for implementing good
+ consistency checking support (mainly checking master tables against slave
+ tables is hard here I think, but also applying wrong binlog data and
+ individual event checksums).
+
+
+Steps to make this more concrete:
+
+ - Investigate the current MySQL replication, and list all of the places where
+ a plugin implementation will need to connect/hook into the MySQL server.
+ * handler::{write,update,delete}_row()
+ * Statement execution
+ * Transaction start/commit
+ * Table open
+ * Query safe/not/safe for statement based replication
+ * Statement-based logging details (user variables, random seed, etc.)
+ * ...
+
+ - Use this list to make an initial sketch of the set of APIs we need.
+
+ - Use the list to determine the feasibility of this project and the level of
+ detail in the API needed to support a full replication implementation as a
+ plugin.
-=-=(Serg - Mon, 15 Mar 2010, 14:13)=-=-
Observers changed: Sergei
DESCRIPTION:
This is a top-level task for the project of designing a new set of replication
APIs for MariaDB.
This task is for the initial discussion of what to do and where to focus.
The project is started in this email thread:
https://lists.launchpad.net/maria-developers/msg01998.html
Wiki page for the project:
http://askmonty.org/wiki/ReplicationProject
HIGH-LEVEL SPECIFICATION:
Current ideas/status after discussions on the mailing list:
- Implement a set of plugin APIs and use them to move all of the existing
MySQL replication into a (set of) plugins.
- Design the APIs so that they can support full MySQL replication, but also
so that they do not hardcode assumptions about how this replication
implementation is done, and so that they will be suitable for other types of
replication (Tungsten, Galera, parallel replication, ...).
- APIs need to include the concept of a global transaction ID. Need to
determine the extent to which the semantics of such ID will be defined
by the API, and to which extend it will be defined by the plugin
implementations.
- APIs should properly support reliable crash-recovery with decent
performance (eg. not require multiple mandatory fsync()s per commit, and
not make group commit impossible).
- Would be nice if the API provided facilities for implementing good
consistency checking support (mainly checking master tables against slave
tables is hard here I think, but also applying wrong binlog data and
individual event checksums).
Steps to make this more concrete:
- Investigate the current MySQL replication, and list all of the places where
a plugin implementation will need to connect/hook into the MySQL server.
* handler::{write,update,delete}_row()
* Statement execution
* Transaction start/commit
* Table open
* Query safe/not/safe for statement based replication
* Statement-based logging details (user variables, random seed, etc.)
* ...
- Use this list to make an initial sketch of the set of APIs we need.
- Use the list to determine the feasibility of this project and the level of
detail in the API needed to support a full replication implementation as a
plugin.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Progress (by Knielsen): Replication API for stacked event generators (120)
by worklog-noreply@askmonty.org 21 Jun '10
by worklog-noreply@askmonty.org 21 Jun '10
21 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Replication API for stacked event generators
CREATION DATE..: Mon, 07 Jun 2010, 13:13
SUPERVISOR.....: Knielsen
IMPLEMENTOR....:
COPIES TO......:
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 120 (http://askmonty.org/worklog/?tid=120)
VERSION........: Server-9.x
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 2
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Knielsen - Mon, 21 Jun 2010, 08:35)=-=-
Research and design thoughts.
Worked 2 hours and estimate 0 hours remain (original estimate increased by 2 hours).
DESCRIPTION:
A part of the replication project, MWL#107.
Events are produced by event Generators. Examples are
- Generation of statement-based replication events
- Generation of row-based events
- Generation of PBXT engine-level replication events
and maybe reading of events from relay log on slave may also be an example of
generating events.
Event generators can be stacked, and a generator may defer event generation to
the next one down the stack. For example, the row-level replication event
generator may defer DLL to the statement-level replication event generator.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Progress (by Knielsen): Replication API for stacked event generators (120)
by worklog-noreply@askmonty.org 21 Jun '10
by worklog-noreply@askmonty.org 21 Jun '10
21 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Replication API for stacked event generators
CREATION DATE..: Mon, 07 Jun 2010, 13:13
SUPERVISOR.....: Knielsen
IMPLEMENTOR....:
COPIES TO......:
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 120 (http://askmonty.org/worklog/?tid=120)
VERSION........: Server-9.x
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 2
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Knielsen - Mon, 21 Jun 2010, 08:35)=-=-
Research and design thoughts.
Worked 2 hours and estimate 0 hours remain (original estimate increased by 2 hours).
DESCRIPTION:
A part of the replication project, MWL#107.
Events are produced by event Generators. Examples are
- Generation of statement-based replication events
- Generation of row-based events
- Generation of PBXT engine-level replication events
and maybe reading of events from relay log on slave may also be an example of
generating events.
Event generators can be stacked, and a generator may defer event generation to
the next one down the stack. For example, the row-level replication event
generator may defer DLL to the statement-level replication event generator.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Progress (by Knielsen): Store in binlog text of statements that caused RBR events (47)
by worklog-noreply@askmonty.org 21 Jun '10
by worklog-noreply@askmonty.org 21 Jun '10
21 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Store in binlog text of statements that caused RBR events
CREATION DATE..: Sat, 15 Aug 2009, 23:48
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Knielsen, Serg
CATEGORY.......: Server-Sprint
TASK ID........: 47 (http://askmonty.org/worklog/?tid=47)
VERSION........: Server-9.x
STATUS.........: Code-Review
PRIORITY.......: 60
WORKED HOURS...: 42
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 35
PROGRESS NOTES:
-=-=(Knielsen - Mon, 21 Jun 2010, 08:32)=-=-
Final review.
Assist with some problems applying the patch.
Worked 1 hour and estimate 0 hours remain (original estimate increased by 1 hour).
-=-=(Guest - Thu, 17 Jun 2010, 00:38)=-=-
Dependency deleted: 39 no longer depends on 47
-=-=(Knielsen - Mon, 07 Jun 2010, 07:13)=-=-
Help debug some test failures seen in Buildbot.
Worked 6 hours and estimate 0 hours remain (original estimate increased by 6 hours).
-=-=(Knielsen - Mon, 31 May 2010, 06:49)=-=-
Help Alexi debug+fix some test problems in the patch.
Worked 4 hours and estimate 0 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 25 May 2010, 08:29)=-=-
Help debug strange problem in mysqlbinlog.test.
Worked 1 hour and estimate 4 hours remain (original estimate unchanged).
-=-=(Knielsen - Mon, 17 May 2010, 08:45)=-=-
Merge with latest trunk and run Buildbot tests.
Worked 1 hour and estimate 5 hours remain (original estimate unchanged).
-=-=(Knielsen - Wed, 05 May 2010, 13:53)=-=-
Review of fixes to first review done. No new issues found.
Worked 2 hours and estimate 6 hours remain (original estimate unchanged).
-=-=(Knielsen - Fri, 23 Apr 2010, 12:51)=-=-
Status updated.
--- /tmp/wklog.47.old.28747 2010-04-23 12:51:36.000000000 +0000
+++ /tmp/wklog.47.new.28747 2010-04-23 12:51:36.000000000 +0000
@@ -1 +1 @@
-In-Progress
+Code-Review
-=-=(Knielsen - Tue, 06 Apr 2010, 15:26)=-=-
Code review (mailed to maria-developers@).
Worked 7 hours and estimate 8 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 06 Apr 2010, 15:25)=-=-
Status updated.
--- /tmp/wklog.47.old.12734 2010-04-06 15:25:54.000000000 +0000
+++ /tmp/wklog.47.new.12734 2010-04-06 15:25:54.000000000 +0000
@@ -1 +1 @@
-Code-Review
+In-Progress
------------------------------------------------------------
-=-=(View All Progress Notes, 35 total)=-=-
http://askmonty.org/worklog/index.pl?tid=47&nolimit=1
DESCRIPTION:
Store in binlog (and show in mysqlbinlog output) texts of statements that
caused RBR events
This is needed for (list from Monty):
- Easier to understand why updates happened
- Would make it easier to find out where in application things went
wrong (as you can search for exact strings)
- Allow one to filter things based on comments in the statement.
The cost of this can be that the binlog will be approximately 2x in size
(especially insert of big blob's would be a bit painful), so this should
be an optional feature.
HIGH-LEVEL SPECIFICATION:
Content
~~~~~~~
1. Annotate_rows_log_event
2. Server option: --binlog-annotate-rows-events
3. Server option: --replicate-annotate-rows-events
4. mysqlbinlog option: --print-annotate-rows-events
5. mysqlbinlog output
1. Annotate_rows_log_event [ ANNOTATE_ROWS_EVENT ]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Describes the query which caused the corresponding rows events. Has empty
post-header and contains the query text in its data part. Example:
************************
ANNOTATE_ROWS_EVENT
************************
00000220 | B6 A0 2C 4B | time_when = 1261215926
00000224 | 33 | event_type = 51
00000225 | 64 00 00 00 | server_id = 100
00000229 | 36 00 00 00 | event_len = 54
0000022D | 56 02 00 00 | log_pos = 00000256
00000231 | 00 00 | flags = <none>
------------------------
00000233 | 49 4E 53 45 | query = "INSERT INTO t1 VALUES (1), (2), (3)"
00000237 | 52 54 20 49 |
0000023B | 4E 54 4F 20 |
0000023F | 74 31 20 56 |
00000243 | 41 4C 55 45 |
00000247 | 53 20 28 31 |
0000024B | 29 2C 20 28 |
0000024F | 32 29 2C 20 |
00000253 | 28 33 29 |
************************
In binary log, Annotate_rows event follows the (possible) 'BEGIN' Query event
and precedes the first of Table map events which accompany the corresponding
rows events. (See example in the "mysqlbinlog output" section below.)
2. Server option: --binlog-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the master to write Annotate_rows events to the binary log.
* Variable Name: binlog_annotate_rows_events
* Scope: Global & Session
* Access Type: Dynamic
* Data Type: bool
* Default Value: OFF
NOTE. Session values allows to annotate only some selected statements:
...
SET SESSION binlog_annotate_rows_events=ON;
... statements to be annotated ...
SET SESSION binlog_annotate_rows_events=OFF;
... statements not to be annotated ...
3. Server option: --replicate-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the slave to reproduce Annotate_rows events recieved from the master
in its own binary log (sensible only in pair with log-slave-updates option).
* Variable Name: replicate_annotate_rows_events
* Scope: Global
* Access Type: Read only
* Data Type: bool
* Default Value: OFF
NOTE. Why do we additionally need this 'replicate' option? Why not to make
the slave to reproduce this events when its binlog-annotate-rows-events
global value is ON? Well, because, for example, we may want to configure
the slave which should reproduce Annotate_rows events but has global
binlog-annotate-rows-events = OFF meaning this to be the default value for
the client threads (see also "How slave treats replicate-annotate-rows-events
option" in LLD part).
4. mysqlbinlog option: --print-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
With this option, mysqlbinlog prints the content of Annotate_rows events (if
the binary log does contain them). Without this option (i.e. by default),
mysqlbinlog skips Annotate_rows events.
5. mysqlbinlog output
~~~~~~~~~~~~~~~~~~~~~
With --print-annotate-rows-events, mysqlbinlog outputs Annotate_rows events
in a form like this:
...
# at 1646
#091219 12:45:26 server id 100 end_log_pos 1714 Query thread_id=1
exec_time=0 error_code=0
SET TIMESTAMP=1261215926/*!*/;
BEGIN
/*!*/;
# at 1714
# at 1812
# at 1853
# at 1894
# at 1938
#091219 12:45:26 server id 100 end_log_pos 1812 Query: `DELETE t1, t2 FROM
t1 INNER JOIN t2 INNER JOIN t3 WHERE t1.a=t2.a AND t2.a=t3.a`
#091219 12:45:26 server id 100 end_log_pos 1853 Table_map: `test`.`t1`
mapped to number 16
#091219 12:45:26 server id 100 end_log_pos 1894 Table_map: `test`.`t2`
mapped to number 17
#091219 12:45:26 server id 100 end_log_pos 1938 Delete_rows: table id 16
#091219 12:45:26 server id 100 end_log_pos 1982 Delete_rows: table id 17
flags: STMT_END_F
...
LOW-LEVEL DESIGN:
Content
~~~~~~~
1. Annotate_rows event number
2. Outline of Annotate_rows event behavior
3. How Master writes Annotate_rows events to the binary log
4. How slave treats replicate-annotate-rows-events option
5. How slave IO thread requests Annotate_rows events
6. How master executes the request
7. How slave SQL thread processes Annotate_rows events
8. General remarks
1. Annotate_rows event number
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To avoid possible event numbers conflict with MySQL/Sun, we leave a gap
between the last MySQL event number and the Annotate_rows event number:
enum Log_event_type
{ ...
INCIDENT_EVENT= 26,
// New MySQL event numbers are to be added here
MYSQL_EVENTS_END,
MARIA_EVENTS_BEGIN= 51,
// New Maria event numbers start from here
ANNOTATE_ROWS_EVENT= 51,
ENUM_END_EVENT
};
together with the corresponding extension of 'post_header_len' array in the
Format description event. (This extension does not affect the compatibility
of the binary log). Here is how Format description event looks like with
this extension:
************************
FORMAT_DESCRIPTION_EVENT
************************
00000004 | A1 A0 2C 4B | time_when = 1261215905
00000008 | 0F | event_type = 15
00000009 | 64 00 00 00 | server_id = 100
0000000D | 7F 00 00 00 | event_len = 127
00000011 | 83 00 00 00 | log_pos = 00000083
00000015 | 01 00 | flags = LOG_EVENT_BINLOG_IN_USE_F
------------------------
00000017 | 04 00 | binlog_ver = 4
00000019 | 35 2E 32 2E | server_ver = 5.2.0-MariaDB-alpha-debug-log
..... ...
0000004B | A1 A0 2C 4B | time_created = 1261215905
0000004F | 13 | common_header_len = 19
------------------------
post_header_len
------------------------
00000050 | 38 | 56 - START_EVENT_V3 [1]
..... ...
00000069 | 02 | 2 - INCIDENT_EVENT [26]
0000006A | 00 | 0 - RESERVED [27]
..... ...
00000081 | 00 | 0 - RESERVED [50]
00000082 | 00 | 0 - ANNOTATE_ROWS_EVENT [51]
************************
2. Outline of Annotate_rows event behavior
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Each Annotate_rows_log_event object has two private members describing the
corresponding query:
char *m_query_txt;
uint m_query_len;
When the object is created for writing to a binary log, this query is taken
from 'thd' (for short, below we omit the 'Annotate_rows_log_event::' prefix
as well as other implementation details):
Annotate_rows_log_event(THD *thd)
{
m_query_txt = thd->query();
m_query_len = thd->query_length();
}
When the object is read from a binary log, the query is taken from the buffer
containing the binary log representation of the event (this buffer is allocated
in Log_event object from which all Log events are derived):
Annotate_rows_log_event(char *buf, uint event_len,
Format_description_log_event *desc)
{
m_query_len = event_len - desc->common_header_len;
m_query_txt = buf + desc->common_header_len;
}
The events are written to the binary log by the Log_event::write() member
which calls virtual write_data_header() and write_data_body() members
("data header" and "post header" are synonym in replication terminology).
In our case, data header is empty and data body is just the query:
bool write_data_body(IO_CACHE *file)
{
return my_b_safe_write(file, (uchar*) m_query_txt, m_query_len);
}
Printing the event is just printing the query:
void Annotate_rows_log_event::print(FILE *file, PRINT_EVENT_INFO *pinfo)
{
my_b_printf(&pinfo->head_cache, "\tQuery: `%s`\n", m_query_txt);
}
3. How Master writes Annotate_rows events to the binary log
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The event is written to the binary log just before the group of Table_map
events which precede corresponding Rows events (one query may generate
several Table map events in the binary log, but the corresponding
Annotate_rows event must be written only once before the first Table map
event; hence the boolean variable 'with_annotate' below):
int write_locked_table_maps(THD *thd)
{ ...
bool with_annotate= thd->variables.binlog_annotate_rows_events;
...
for (uint i= 0; i < ... <number of tables> ...; ++i)
{ ...
thd->binlog_write_table_map(table, ..., with_annotate);
with_annotate= 0; // write Annotate_event not more than once
...
}
...
}
int THD::binlog_write_table_map(TABLE *table, ..., bool with_annotate)
{ ...
Table_map_log_event the_event(...);
...
if (with_annotate)
{
Annotate_rows_log_event anno(this);
mysql_bin_log.write(&anno);
}
mysql_bin_log.write(&the_event);
...
}
4. How slave treats replicate-annotate-rows-events option
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The replicate-annotate-rows-events option is treated just as the session
value of the binlog_annotate_rows_events variable for the slave IO and
SQL threads. This setting is done during initialization of these threads:
pthread_handler_t handle_slave_io(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_IO);
...
}
pthread_handler_t handle_slave_sql(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_SQL);
...
}
int init_slave_thread(THD* thd, SLAVE_THD_TYPE thd_type)
{ ...
thd->variables.binlog_annotate_rows_events=
opt_replicate_annotate_rows_events;
...
}
5. How slave IO thread requests Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If the replicate-annotate-rows-events option is not set on a slave, there
is no need for master to send Annotate_rows events to this slave. The slave
(or mysqlbinlog in remote case), before requesting binlog dump via the
COM_BINLOG_DUMP command, informs the master whether it should send these
events by executing the newly added COM_BINLOG_DUMP_OPTIONS_EXT server
command:
case COM_BINLOG_DUMP_OPTIONS_EXT:
thd->binlog_dump_flags_ext= packet[0];
my_ok(thd);
break;
Note. We add this new command and don't use COM_BINLOG_DUMP to avoid possible
conflicts with MySQL/Sun.
6. How master executes the request
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
case COM_BINLOG_DUMP:
{ ...
flags= uint2korr(packet + 4);
...
mysql_binlog_send(thd, ..., flags);
...
}
void mysql_binlog_send(THD* thd, ..., ushort flags)
{ ...
Log_event::read_log_event(&log, packet, ...);
...
if ((*packet)[EVENT_TYPE_OFFSET + 1] != ANNOTATE_ROWS_EVENT ||
flags & BINLOG_SEND_ANNOTATE_ROWS_EVENT)
{
my_net_write(net, packet->ptr(), packet->length());
}
...
}
7. How slave SQL thread processes Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The slave processes each recieved event by "applying" it, i.e. by
calling the Log_event::apply_event() function which in turn calls
the virtual do_apply_event() member specific for each type of the
event.
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev = next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
int apply_event_and_update_pos(Log_event *ev, ...)
{ ...
ev->apply_event(...);
...
}
int Log_event::apply_event(...)
{
return do_apply_event(...);
}
What does it mean to "apply" an Annotate_rows event? It means to set current
thd query to that of the described by the event, i.e. to the query which
caused the subsequent Rows events (see "How Master writes Annotate_rows
events to the binary log" to follow what happens further when the subsequent
Rows events are applied):
int Annotate_rows_log_event::do_apply_event(...)
{
thd->set_query(m_query_txt, m_query_len);
}
NOTE. I am not sure, but possibly current values of thd->query and
thd->query_length should be saved before calling set_query() and to be
restored on the Annotate_rows_log_event object deletion.
Is it really needed ?
After calling this do_apply_event() function we may not delete the
Annotate_rows_log_event object immediatedly (see exec_relay_log_event()
above) because thd->query now points to the string inside this object.
We may keep the pointer to this object in the Relay_log_info:
class Relay_log_info
{
public:
...
void set_annotate_event(Annotate_rows_log_event*);
Annotate_rows_log_event* get_annotate_event();
void free_annotate_event();
...
private:
Annotate_rows_log_event* m_annotate_event;
};
The saved Annotate_rows object should be deleted when all corresponding
Rows events will be processed:
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev= next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (rli->get_annotate_event() && is_last_rows_event(ev))
rli->free_annotate_event();
else if (ev->get_type_code() == ANNOTATE_ROWS_EVENT)
rli->set_annotate_event((Annotate_rows_log_event*) ev);
else if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
where
bool is_last_rows_event(Log_event* ev)
{
Log_event_type type= ev->get_type_code();
if (IS_ROWS_EVENT_TYPE(type))
{
Rows_log_event* rows= (Rows_log_event*)ev;
return rows->get_flags(Rows_log_event::STMT_END_F);
}
return 0;
}
#define IS_ROWS_EVENT_TYPE(type) ((type) == WRITE_ROWS_EVENT || \
(type) == UPDATE_ROWS_EVENT || \
(type) == DELETE_ROWS_EVENT)
8. General remarks
~~~~~~~~~~~~~~~~~~
Kristian noticed that introducing new log event type should be coordinated
somehow with MySQL/Sun:
Kristian: The numeric code for this event must be assigned carefully.
It should be coordinated with MySQL/Sun, otherwise we can get into a
situation where MySQL uses the same numeric code for one event that
MariaDB uses for ANNOTATE_ROWS_EVENT, which would make merging the two
impossible.
Alex: I reserved about 20 numbers not to have possible conflicts
with MySQL.
Kristian: Still, I think it would be appropriate to send a polite email
to internals(a)lists.mysql.com about this and suggesting to reserve the
event number.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Progress (by Knielsen): Store in binlog text of statements that caused RBR events (47)
by worklog-noreply@askmonty.org 21 Jun '10
by worklog-noreply@askmonty.org 21 Jun '10
21 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Store in binlog text of statements that caused RBR events
CREATION DATE..: Sat, 15 Aug 2009, 23:48
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Knielsen, Serg
CATEGORY.......: Server-Sprint
TASK ID........: 47 (http://askmonty.org/worklog/?tid=47)
VERSION........: Server-9.x
STATUS.........: Code-Review
PRIORITY.......: 60
WORKED HOURS...: 42
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 35
PROGRESS NOTES:
-=-=(Knielsen - Mon, 21 Jun 2010, 08:32)=-=-
Final review.
Assist with some problems applying the patch.
Worked 1 hour and estimate 0 hours remain (original estimate increased by 1 hour).
-=-=(Guest - Thu, 17 Jun 2010, 00:38)=-=-
Dependency deleted: 39 no longer depends on 47
-=-=(Knielsen - Mon, 07 Jun 2010, 07:13)=-=-
Help debug some test failures seen in Buildbot.
Worked 6 hours and estimate 0 hours remain (original estimate increased by 6 hours).
-=-=(Knielsen - Mon, 31 May 2010, 06:49)=-=-
Help Alexi debug+fix some test problems in the patch.
Worked 4 hours and estimate 0 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 25 May 2010, 08:29)=-=-
Help debug strange problem in mysqlbinlog.test.
Worked 1 hour and estimate 4 hours remain (original estimate unchanged).
-=-=(Knielsen - Mon, 17 May 2010, 08:45)=-=-
Merge with latest trunk and run Buildbot tests.
Worked 1 hour and estimate 5 hours remain (original estimate unchanged).
-=-=(Knielsen - Wed, 05 May 2010, 13:53)=-=-
Review of fixes to first review done. No new issues found.
Worked 2 hours and estimate 6 hours remain (original estimate unchanged).
-=-=(Knielsen - Fri, 23 Apr 2010, 12:51)=-=-
Status updated.
--- /tmp/wklog.47.old.28747 2010-04-23 12:51:36.000000000 +0000
+++ /tmp/wklog.47.new.28747 2010-04-23 12:51:36.000000000 +0000
@@ -1 +1 @@
-In-Progress
+Code-Review
-=-=(Knielsen - Tue, 06 Apr 2010, 15:26)=-=-
Code review (mailed to maria-developers@).
Worked 7 hours and estimate 8 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 06 Apr 2010, 15:25)=-=-
Status updated.
--- /tmp/wklog.47.old.12734 2010-04-06 15:25:54.000000000 +0000
+++ /tmp/wklog.47.new.12734 2010-04-06 15:25:54.000000000 +0000
@@ -1 +1 @@
-Code-Review
+In-Progress
------------------------------------------------------------
-=-=(View All Progress Notes, 35 total)=-=-
http://askmonty.org/worklog/index.pl?tid=47&nolimit=1
DESCRIPTION:
Store in binlog (and show in mysqlbinlog output) texts of statements that
caused RBR events
This is needed for (list from Monty):
- Easier to understand why updates happened
- Would make it easier to find out where in application things went
wrong (as you can search for exact strings)
- Allow one to filter things based on comments in the statement.
The cost of this can be that the binlog will be approximately 2x in size
(especially insert of big blob's would be a bit painful), so this should
be an optional feature.
HIGH-LEVEL SPECIFICATION:
Content
~~~~~~~
1. Annotate_rows_log_event
2. Server option: --binlog-annotate-rows-events
3. Server option: --replicate-annotate-rows-events
4. mysqlbinlog option: --print-annotate-rows-events
5. mysqlbinlog output
1. Annotate_rows_log_event [ ANNOTATE_ROWS_EVENT ]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Describes the query which caused the corresponding rows events. Has empty
post-header and contains the query text in its data part. Example:
************************
ANNOTATE_ROWS_EVENT
************************
00000220 | B6 A0 2C 4B | time_when = 1261215926
00000224 | 33 | event_type = 51
00000225 | 64 00 00 00 | server_id = 100
00000229 | 36 00 00 00 | event_len = 54
0000022D | 56 02 00 00 | log_pos = 00000256
00000231 | 00 00 | flags = <none>
------------------------
00000233 | 49 4E 53 45 | query = "INSERT INTO t1 VALUES (1), (2), (3)"
00000237 | 52 54 20 49 |
0000023B | 4E 54 4F 20 |
0000023F | 74 31 20 56 |
00000243 | 41 4C 55 45 |
00000247 | 53 20 28 31 |
0000024B | 29 2C 20 28 |
0000024F | 32 29 2C 20 |
00000253 | 28 33 29 |
************************
In binary log, Annotate_rows event follows the (possible) 'BEGIN' Query event
and precedes the first of Table map events which accompany the corresponding
rows events. (See example in the "mysqlbinlog output" section below.)
2. Server option: --binlog-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the master to write Annotate_rows events to the binary log.
* Variable Name: binlog_annotate_rows_events
* Scope: Global & Session
* Access Type: Dynamic
* Data Type: bool
* Default Value: OFF
NOTE. Session values allows to annotate only some selected statements:
...
SET SESSION binlog_annotate_rows_events=ON;
... statements to be annotated ...
SET SESSION binlog_annotate_rows_events=OFF;
... statements not to be annotated ...
3. Server option: --replicate-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the slave to reproduce Annotate_rows events recieved from the master
in its own binary log (sensible only in pair with log-slave-updates option).
* Variable Name: replicate_annotate_rows_events
* Scope: Global
* Access Type: Read only
* Data Type: bool
* Default Value: OFF
NOTE. Why do we additionally need this 'replicate' option? Why not to make
the slave to reproduce this events when its binlog-annotate-rows-events
global value is ON? Well, because, for example, we may want to configure
the slave which should reproduce Annotate_rows events but has global
binlog-annotate-rows-events = OFF meaning this to be the default value for
the client threads (see also "How slave treats replicate-annotate-rows-events
option" in LLD part).
4. mysqlbinlog option: --print-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
With this option, mysqlbinlog prints the content of Annotate_rows events (if
the binary log does contain them). Without this option (i.e. by default),
mysqlbinlog skips Annotate_rows events.
5. mysqlbinlog output
~~~~~~~~~~~~~~~~~~~~~
With --print-annotate-rows-events, mysqlbinlog outputs Annotate_rows events
in a form like this:
...
# at 1646
#091219 12:45:26 server id 100 end_log_pos 1714 Query thread_id=1
exec_time=0 error_code=0
SET TIMESTAMP=1261215926/*!*/;
BEGIN
/*!*/;
# at 1714
# at 1812
# at 1853
# at 1894
# at 1938
#091219 12:45:26 server id 100 end_log_pos 1812 Query: `DELETE t1, t2 FROM
t1 INNER JOIN t2 INNER JOIN t3 WHERE t1.a=t2.a AND t2.a=t3.a`
#091219 12:45:26 server id 100 end_log_pos 1853 Table_map: `test`.`t1`
mapped to number 16
#091219 12:45:26 server id 100 end_log_pos 1894 Table_map: `test`.`t2`
mapped to number 17
#091219 12:45:26 server id 100 end_log_pos 1938 Delete_rows: table id 16
#091219 12:45:26 server id 100 end_log_pos 1982 Delete_rows: table id 17
flags: STMT_END_F
...
LOW-LEVEL DESIGN:
Content
~~~~~~~
1. Annotate_rows event number
2. Outline of Annotate_rows event behavior
3. How Master writes Annotate_rows events to the binary log
4. How slave treats replicate-annotate-rows-events option
5. How slave IO thread requests Annotate_rows events
6. How master executes the request
7. How slave SQL thread processes Annotate_rows events
8. General remarks
1. Annotate_rows event number
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To avoid possible event numbers conflict with MySQL/Sun, we leave a gap
between the last MySQL event number and the Annotate_rows event number:
enum Log_event_type
{ ...
INCIDENT_EVENT= 26,
// New MySQL event numbers are to be added here
MYSQL_EVENTS_END,
MARIA_EVENTS_BEGIN= 51,
// New Maria event numbers start from here
ANNOTATE_ROWS_EVENT= 51,
ENUM_END_EVENT
};
together with the corresponding extension of 'post_header_len' array in the
Format description event. (This extension does not affect the compatibility
of the binary log). Here is how Format description event looks like with
this extension:
************************
FORMAT_DESCRIPTION_EVENT
************************
00000004 | A1 A0 2C 4B | time_when = 1261215905
00000008 | 0F | event_type = 15
00000009 | 64 00 00 00 | server_id = 100
0000000D | 7F 00 00 00 | event_len = 127
00000011 | 83 00 00 00 | log_pos = 00000083
00000015 | 01 00 | flags = LOG_EVENT_BINLOG_IN_USE_F
------------------------
00000017 | 04 00 | binlog_ver = 4
00000019 | 35 2E 32 2E | server_ver = 5.2.0-MariaDB-alpha-debug-log
..... ...
0000004B | A1 A0 2C 4B | time_created = 1261215905
0000004F | 13 | common_header_len = 19
------------------------
post_header_len
------------------------
00000050 | 38 | 56 - START_EVENT_V3 [1]
..... ...
00000069 | 02 | 2 - INCIDENT_EVENT [26]
0000006A | 00 | 0 - RESERVED [27]
..... ...
00000081 | 00 | 0 - RESERVED [50]
00000082 | 00 | 0 - ANNOTATE_ROWS_EVENT [51]
************************
2. Outline of Annotate_rows event behavior
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Each Annotate_rows_log_event object has two private members describing the
corresponding query:
char *m_query_txt;
uint m_query_len;
When the object is created for writing to a binary log, this query is taken
from 'thd' (for short, below we omit the 'Annotate_rows_log_event::' prefix
as well as other implementation details):
Annotate_rows_log_event(THD *thd)
{
m_query_txt = thd->query();
m_query_len = thd->query_length();
}
When the object is read from a binary log, the query is taken from the buffer
containing the binary log representation of the event (this buffer is allocated
in Log_event object from which all Log events are derived):
Annotate_rows_log_event(char *buf, uint event_len,
Format_description_log_event *desc)
{
m_query_len = event_len - desc->common_header_len;
m_query_txt = buf + desc->common_header_len;
}
The events are written to the binary log by the Log_event::write() member
which calls virtual write_data_header() and write_data_body() members
("data header" and "post header" are synonym in replication terminology).
In our case, data header is empty and data body is just the query:
bool write_data_body(IO_CACHE *file)
{
return my_b_safe_write(file, (uchar*) m_query_txt, m_query_len);
}
Printing the event is just printing the query:
void Annotate_rows_log_event::print(FILE *file, PRINT_EVENT_INFO *pinfo)
{
my_b_printf(&pinfo->head_cache, "\tQuery: `%s`\n", m_query_txt);
}
3. How Master writes Annotate_rows events to the binary log
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The event is written to the binary log just before the group of Table_map
events which precede corresponding Rows events (one query may generate
several Table map events in the binary log, but the corresponding
Annotate_rows event must be written only once before the first Table map
event; hence the boolean variable 'with_annotate' below):
int write_locked_table_maps(THD *thd)
{ ...
bool with_annotate= thd->variables.binlog_annotate_rows_events;
...
for (uint i= 0; i < ... <number of tables> ...; ++i)
{ ...
thd->binlog_write_table_map(table, ..., with_annotate);
with_annotate= 0; // write Annotate_event not more than once
...
}
...
}
int THD::binlog_write_table_map(TABLE *table, ..., bool with_annotate)
{ ...
Table_map_log_event the_event(...);
...
if (with_annotate)
{
Annotate_rows_log_event anno(this);
mysql_bin_log.write(&anno);
}
mysql_bin_log.write(&the_event);
...
}
4. How slave treats replicate-annotate-rows-events option
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The replicate-annotate-rows-events option is treated just as the session
value of the binlog_annotate_rows_events variable for the slave IO and
SQL threads. This setting is done during initialization of these threads:
pthread_handler_t handle_slave_io(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_IO);
...
}
pthread_handler_t handle_slave_sql(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_SQL);
...
}
int init_slave_thread(THD* thd, SLAVE_THD_TYPE thd_type)
{ ...
thd->variables.binlog_annotate_rows_events=
opt_replicate_annotate_rows_events;
...
}
5. How slave IO thread requests Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If the replicate-annotate-rows-events option is not set on a slave, there
is no need for master to send Annotate_rows events to this slave. The slave
(or mysqlbinlog in remote case), before requesting binlog dump via the
COM_BINLOG_DUMP command, informs the master whether it should send these
events by executing the newly added COM_BINLOG_DUMP_OPTIONS_EXT server
command:
case COM_BINLOG_DUMP_OPTIONS_EXT:
thd->binlog_dump_flags_ext= packet[0];
my_ok(thd);
break;
Note. We add this new command and don't use COM_BINLOG_DUMP to avoid possible
conflicts with MySQL/Sun.
6. How master executes the request
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
case COM_BINLOG_DUMP:
{ ...
flags= uint2korr(packet + 4);
...
mysql_binlog_send(thd, ..., flags);
...
}
void mysql_binlog_send(THD* thd, ..., ushort flags)
{ ...
Log_event::read_log_event(&log, packet, ...);
...
if ((*packet)[EVENT_TYPE_OFFSET + 1] != ANNOTATE_ROWS_EVENT ||
flags & BINLOG_SEND_ANNOTATE_ROWS_EVENT)
{
my_net_write(net, packet->ptr(), packet->length());
}
...
}
7. How slave SQL thread processes Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The slave processes each recieved event by "applying" it, i.e. by
calling the Log_event::apply_event() function which in turn calls
the virtual do_apply_event() member specific for each type of the
event.
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev = next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
int apply_event_and_update_pos(Log_event *ev, ...)
{ ...
ev->apply_event(...);
...
}
int Log_event::apply_event(...)
{
return do_apply_event(...);
}
What does it mean to "apply" an Annotate_rows event? It means to set current
thd query to that of the described by the event, i.e. to the query which
caused the subsequent Rows events (see "How Master writes Annotate_rows
events to the binary log" to follow what happens further when the subsequent
Rows events are applied):
int Annotate_rows_log_event::do_apply_event(...)
{
thd->set_query(m_query_txt, m_query_len);
}
NOTE. I am not sure, but possibly current values of thd->query and
thd->query_length should be saved before calling set_query() and to be
restored on the Annotate_rows_log_event object deletion.
Is it really needed ?
After calling this do_apply_event() function we may not delete the
Annotate_rows_log_event object immediatedly (see exec_relay_log_event()
above) because thd->query now points to the string inside this object.
We may keep the pointer to this object in the Relay_log_info:
class Relay_log_info
{
public:
...
void set_annotate_event(Annotate_rows_log_event*);
Annotate_rows_log_event* get_annotate_event();
void free_annotate_event();
...
private:
Annotate_rows_log_event* m_annotate_event;
};
The saved Annotate_rows object should be deleted when all corresponding
Rows events will be processed:
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev= next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (rli->get_annotate_event() && is_last_rows_event(ev))
rli->free_annotate_event();
else if (ev->get_type_code() == ANNOTATE_ROWS_EVENT)
rli->set_annotate_event((Annotate_rows_log_event*) ev);
else if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
where
bool is_last_rows_event(Log_event* ev)
{
Log_event_type type= ev->get_type_code();
if (IS_ROWS_EVENT_TYPE(type))
{
Rows_log_event* rows= (Rows_log_event*)ev;
return rows->get_flags(Rows_log_event::STMT_END_F);
}
return 0;
}
#define IS_ROWS_EVENT_TYPE(type) ((type) == WRITE_ROWS_EVENT || \
(type) == UPDATE_ROWS_EVENT || \
(type) == DELETE_ROWS_EVENT)
8. General remarks
~~~~~~~~~~~~~~~~~~
Kristian noticed that introducing new log event type should be coordinated
somehow with MySQL/Sun:
Kristian: The numeric code for this event must be assigned carefully.
It should be coordinated with MySQL/Sun, otherwise we can get into a
situation where MySQL uses the same numeric code for one event that
MariaDB uses for ANNOTATE_ROWS_EVENT, which would make merging the two
impossible.
Alex: I reserved about 20 numbers not to have possible conflicts
with MySQL.
Kristian: Still, I think it would be appropriate to send a polite email
to internals(a)lists.mysql.com about this and suggesting to reserve the
event number.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Progress (by Knielsen): Store in binlog text of statements that caused RBR events (47)
by worklog-noreply@askmonty.org 21 Jun '10
by worklog-noreply@askmonty.org 21 Jun '10
21 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Store in binlog text of statements that caused RBR events
CREATION DATE..: Sat, 15 Aug 2009, 23:48
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Knielsen, Serg
CATEGORY.......: Server-Sprint
TASK ID........: 47 (http://askmonty.org/worklog/?tid=47)
VERSION........: Server-9.x
STATUS.........: Code-Review
PRIORITY.......: 60
WORKED HOURS...: 42
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 35
PROGRESS NOTES:
-=-=(Knielsen - Mon, 21 Jun 2010, 08:32)=-=-
Final review.
Assist with some problems applying the patch.
Worked 1 hour and estimate 0 hours remain (original estimate increased by 1 hour).
-=-=(Guest - Thu, 17 Jun 2010, 00:38)=-=-
Dependency deleted: 39 no longer depends on 47
-=-=(Knielsen - Mon, 07 Jun 2010, 07:13)=-=-
Help debug some test failures seen in Buildbot.
Worked 6 hours and estimate 0 hours remain (original estimate increased by 6 hours).
-=-=(Knielsen - Mon, 31 May 2010, 06:49)=-=-
Help Alexi debug+fix some test problems in the patch.
Worked 4 hours and estimate 0 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 25 May 2010, 08:29)=-=-
Help debug strange problem in mysqlbinlog.test.
Worked 1 hour and estimate 4 hours remain (original estimate unchanged).
-=-=(Knielsen - Mon, 17 May 2010, 08:45)=-=-
Merge with latest trunk and run Buildbot tests.
Worked 1 hour and estimate 5 hours remain (original estimate unchanged).
-=-=(Knielsen - Wed, 05 May 2010, 13:53)=-=-
Review of fixes to first review done. No new issues found.
Worked 2 hours and estimate 6 hours remain (original estimate unchanged).
-=-=(Knielsen - Fri, 23 Apr 2010, 12:51)=-=-
Status updated.
--- /tmp/wklog.47.old.28747 2010-04-23 12:51:36.000000000 +0000
+++ /tmp/wklog.47.new.28747 2010-04-23 12:51:36.000000000 +0000
@@ -1 +1 @@
-In-Progress
+Code-Review
-=-=(Knielsen - Tue, 06 Apr 2010, 15:26)=-=-
Code review (mailed to maria-developers@).
Worked 7 hours and estimate 8 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 06 Apr 2010, 15:25)=-=-
Status updated.
--- /tmp/wklog.47.old.12734 2010-04-06 15:25:54.000000000 +0000
+++ /tmp/wklog.47.new.12734 2010-04-06 15:25:54.000000000 +0000
@@ -1 +1 @@
-Code-Review
+In-Progress
------------------------------------------------------------
-=-=(View All Progress Notes, 35 total)=-=-
http://askmonty.org/worklog/index.pl?tid=47&nolimit=1
DESCRIPTION:
Store in binlog (and show in mysqlbinlog output) texts of statements that
caused RBR events
This is needed for (list from Monty):
- Easier to understand why updates happened
- Would make it easier to find out where in application things went
wrong (as you can search for exact strings)
- Allow one to filter things based on comments in the statement.
The cost of this can be that the binlog will be approximately 2x in size
(especially insert of big blob's would be a bit painful), so this should
be an optional feature.
HIGH-LEVEL SPECIFICATION:
Content
~~~~~~~
1. Annotate_rows_log_event
2. Server option: --binlog-annotate-rows-events
3. Server option: --replicate-annotate-rows-events
4. mysqlbinlog option: --print-annotate-rows-events
5. mysqlbinlog output
1. Annotate_rows_log_event [ ANNOTATE_ROWS_EVENT ]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Describes the query which caused the corresponding rows events. Has empty
post-header and contains the query text in its data part. Example:
************************
ANNOTATE_ROWS_EVENT
************************
00000220 | B6 A0 2C 4B | time_when = 1261215926
00000224 | 33 | event_type = 51
00000225 | 64 00 00 00 | server_id = 100
00000229 | 36 00 00 00 | event_len = 54
0000022D | 56 02 00 00 | log_pos = 00000256
00000231 | 00 00 | flags = <none>
------------------------
00000233 | 49 4E 53 45 | query = "INSERT INTO t1 VALUES (1), (2), (3)"
00000237 | 52 54 20 49 |
0000023B | 4E 54 4F 20 |
0000023F | 74 31 20 56 |
00000243 | 41 4C 55 45 |
00000247 | 53 20 28 31 |
0000024B | 29 2C 20 28 |
0000024F | 32 29 2C 20 |
00000253 | 28 33 29 |
************************
In binary log, Annotate_rows event follows the (possible) 'BEGIN' Query event
and precedes the first of Table map events which accompany the corresponding
rows events. (See example in the "mysqlbinlog output" section below.)
2. Server option: --binlog-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the master to write Annotate_rows events to the binary log.
* Variable Name: binlog_annotate_rows_events
* Scope: Global & Session
* Access Type: Dynamic
* Data Type: bool
* Default Value: OFF
NOTE. Session values allows to annotate only some selected statements:
...
SET SESSION binlog_annotate_rows_events=ON;
... statements to be annotated ...
SET SESSION binlog_annotate_rows_events=OFF;
... statements not to be annotated ...
3. Server option: --replicate-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the slave to reproduce Annotate_rows events recieved from the master
in its own binary log (sensible only in pair with log-slave-updates option).
* Variable Name: replicate_annotate_rows_events
* Scope: Global
* Access Type: Read only
* Data Type: bool
* Default Value: OFF
NOTE. Why do we additionally need this 'replicate' option? Why not to make
the slave to reproduce this events when its binlog-annotate-rows-events
global value is ON? Well, because, for example, we may want to configure
the slave which should reproduce Annotate_rows events but has global
binlog-annotate-rows-events = OFF meaning this to be the default value for
the client threads (see also "How slave treats replicate-annotate-rows-events
option" in LLD part).
4. mysqlbinlog option: --print-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
With this option, mysqlbinlog prints the content of Annotate_rows events (if
the binary log does contain them). Without this option (i.e. by default),
mysqlbinlog skips Annotate_rows events.
5. mysqlbinlog output
~~~~~~~~~~~~~~~~~~~~~
With --print-annotate-rows-events, mysqlbinlog outputs Annotate_rows events
in a form like this:
...
# at 1646
#091219 12:45:26 server id 100 end_log_pos 1714 Query thread_id=1
exec_time=0 error_code=0
SET TIMESTAMP=1261215926/*!*/;
BEGIN
/*!*/;
# at 1714
# at 1812
# at 1853
# at 1894
# at 1938
#091219 12:45:26 server id 100 end_log_pos 1812 Query: `DELETE t1, t2 FROM
t1 INNER JOIN t2 INNER JOIN t3 WHERE t1.a=t2.a AND t2.a=t3.a`
#091219 12:45:26 server id 100 end_log_pos 1853 Table_map: `test`.`t1`
mapped to number 16
#091219 12:45:26 server id 100 end_log_pos 1894 Table_map: `test`.`t2`
mapped to number 17
#091219 12:45:26 server id 100 end_log_pos 1938 Delete_rows: table id 16
#091219 12:45:26 server id 100 end_log_pos 1982 Delete_rows: table id 17
flags: STMT_END_F
...
LOW-LEVEL DESIGN:
Content
~~~~~~~
1. Annotate_rows event number
2. Outline of Annotate_rows event behavior
3. How Master writes Annotate_rows events to the binary log
4. How slave treats replicate-annotate-rows-events option
5. How slave IO thread requests Annotate_rows events
6. How master executes the request
7. How slave SQL thread processes Annotate_rows events
8. General remarks
1. Annotate_rows event number
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To avoid possible event numbers conflict with MySQL/Sun, we leave a gap
between the last MySQL event number and the Annotate_rows event number:
enum Log_event_type
{ ...
INCIDENT_EVENT= 26,
// New MySQL event numbers are to be added here
MYSQL_EVENTS_END,
MARIA_EVENTS_BEGIN= 51,
// New Maria event numbers start from here
ANNOTATE_ROWS_EVENT= 51,
ENUM_END_EVENT
};
together with the corresponding extension of 'post_header_len' array in the
Format description event. (This extension does not affect the compatibility
of the binary log). Here is how Format description event looks like with
this extension:
************************
FORMAT_DESCRIPTION_EVENT
************************
00000004 | A1 A0 2C 4B | time_when = 1261215905
00000008 | 0F | event_type = 15
00000009 | 64 00 00 00 | server_id = 100
0000000D | 7F 00 00 00 | event_len = 127
00000011 | 83 00 00 00 | log_pos = 00000083
00000015 | 01 00 | flags = LOG_EVENT_BINLOG_IN_USE_F
------------------------
00000017 | 04 00 | binlog_ver = 4
00000019 | 35 2E 32 2E | server_ver = 5.2.0-MariaDB-alpha-debug-log
..... ...
0000004B | A1 A0 2C 4B | time_created = 1261215905
0000004F | 13 | common_header_len = 19
------------------------
post_header_len
------------------------
00000050 | 38 | 56 - START_EVENT_V3 [1]
..... ...
00000069 | 02 | 2 - INCIDENT_EVENT [26]
0000006A | 00 | 0 - RESERVED [27]
..... ...
00000081 | 00 | 0 - RESERVED [50]
00000082 | 00 | 0 - ANNOTATE_ROWS_EVENT [51]
************************
2. Outline of Annotate_rows event behavior
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Each Annotate_rows_log_event object has two private members describing the
corresponding query:
char *m_query_txt;
uint m_query_len;
When the object is created for writing to a binary log, this query is taken
from 'thd' (for short, below we omit the 'Annotate_rows_log_event::' prefix
as well as other implementation details):
Annotate_rows_log_event(THD *thd)
{
m_query_txt = thd->query();
m_query_len = thd->query_length();
}
When the object is read from a binary log, the query is taken from the buffer
containing the binary log representation of the event (this buffer is allocated
in Log_event object from which all Log events are derived):
Annotate_rows_log_event(char *buf, uint event_len,
Format_description_log_event *desc)
{
m_query_len = event_len - desc->common_header_len;
m_query_txt = buf + desc->common_header_len;
}
The events are written to the binary log by the Log_event::write() member
which calls virtual write_data_header() and write_data_body() members
("data header" and "post header" are synonym in replication terminology).
In our case, data header is empty and data body is just the query:
bool write_data_body(IO_CACHE *file)
{
return my_b_safe_write(file, (uchar*) m_query_txt, m_query_len);
}
Printing the event is just printing the query:
void Annotate_rows_log_event::print(FILE *file, PRINT_EVENT_INFO *pinfo)
{
my_b_printf(&pinfo->head_cache, "\tQuery: `%s`\n", m_query_txt);
}
3. How Master writes Annotate_rows events to the binary log
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The event is written to the binary log just before the group of Table_map
events which precede corresponding Rows events (one query may generate
several Table map events in the binary log, but the corresponding
Annotate_rows event must be written only once before the first Table map
event; hence the boolean variable 'with_annotate' below):
int write_locked_table_maps(THD *thd)
{ ...
bool with_annotate= thd->variables.binlog_annotate_rows_events;
...
for (uint i= 0; i < ... <number of tables> ...; ++i)
{ ...
thd->binlog_write_table_map(table, ..., with_annotate);
with_annotate= 0; // write Annotate_event not more than once
...
}
...
}
int THD::binlog_write_table_map(TABLE *table, ..., bool with_annotate)
{ ...
Table_map_log_event the_event(...);
...
if (with_annotate)
{
Annotate_rows_log_event anno(this);
mysql_bin_log.write(&anno);
}
mysql_bin_log.write(&the_event);
...
}
4. How slave treats replicate-annotate-rows-events option
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The replicate-annotate-rows-events option is treated just as the session
value of the binlog_annotate_rows_events variable for the slave IO and
SQL threads. This setting is done during initialization of these threads:
pthread_handler_t handle_slave_io(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_IO);
...
}
pthread_handler_t handle_slave_sql(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_SQL);
...
}
int init_slave_thread(THD* thd, SLAVE_THD_TYPE thd_type)
{ ...
thd->variables.binlog_annotate_rows_events=
opt_replicate_annotate_rows_events;
...
}
5. How slave IO thread requests Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If the replicate-annotate-rows-events option is not set on a slave, there
is no need for master to send Annotate_rows events to this slave. The slave
(or mysqlbinlog in remote case), before requesting binlog dump via the
COM_BINLOG_DUMP command, informs the master whether it should send these
events by executing the newly added COM_BINLOG_DUMP_OPTIONS_EXT server
command:
case COM_BINLOG_DUMP_OPTIONS_EXT:
thd->binlog_dump_flags_ext= packet[0];
my_ok(thd);
break;
Note. We add this new command and don't use COM_BINLOG_DUMP to avoid possible
conflicts with MySQL/Sun.
6. How master executes the request
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
case COM_BINLOG_DUMP:
{ ...
flags= uint2korr(packet + 4);
...
mysql_binlog_send(thd, ..., flags);
...
}
void mysql_binlog_send(THD* thd, ..., ushort flags)
{ ...
Log_event::read_log_event(&log, packet, ...);
...
if ((*packet)[EVENT_TYPE_OFFSET + 1] != ANNOTATE_ROWS_EVENT ||
flags & BINLOG_SEND_ANNOTATE_ROWS_EVENT)
{
my_net_write(net, packet->ptr(), packet->length());
}
...
}
7. How slave SQL thread processes Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The slave processes each recieved event by "applying" it, i.e. by
calling the Log_event::apply_event() function which in turn calls
the virtual do_apply_event() member specific for each type of the
event.
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev = next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
int apply_event_and_update_pos(Log_event *ev, ...)
{ ...
ev->apply_event(...);
...
}
int Log_event::apply_event(...)
{
return do_apply_event(...);
}
What does it mean to "apply" an Annotate_rows event? It means to set current
thd query to that of the described by the event, i.e. to the query which
caused the subsequent Rows events (see "How Master writes Annotate_rows
events to the binary log" to follow what happens further when the subsequent
Rows events are applied):
int Annotate_rows_log_event::do_apply_event(...)
{
thd->set_query(m_query_txt, m_query_len);
}
NOTE. I am not sure, but possibly current values of thd->query and
thd->query_length should be saved before calling set_query() and to be
restored on the Annotate_rows_log_event object deletion.
Is it really needed ?
After calling this do_apply_event() function we may not delete the
Annotate_rows_log_event object immediatedly (see exec_relay_log_event()
above) because thd->query now points to the string inside this object.
We may keep the pointer to this object in the Relay_log_info:
class Relay_log_info
{
public:
...
void set_annotate_event(Annotate_rows_log_event*);
Annotate_rows_log_event* get_annotate_event();
void free_annotate_event();
...
private:
Annotate_rows_log_event* m_annotate_event;
};
The saved Annotate_rows object should be deleted when all corresponding
Rows events will be processed:
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev= next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (rli->get_annotate_event() && is_last_rows_event(ev))
rli->free_annotate_event();
else if (ev->get_type_code() == ANNOTATE_ROWS_EVENT)
rli->set_annotate_event((Annotate_rows_log_event*) ev);
else if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
where
bool is_last_rows_event(Log_event* ev)
{
Log_event_type type= ev->get_type_code();
if (IS_ROWS_EVENT_TYPE(type))
{
Rows_log_event* rows= (Rows_log_event*)ev;
return rows->get_flags(Rows_log_event::STMT_END_F);
}
return 0;
}
#define IS_ROWS_EVENT_TYPE(type) ((type) == WRITE_ROWS_EVENT || \
(type) == UPDATE_ROWS_EVENT || \
(type) == DELETE_ROWS_EVENT)
8. General remarks
~~~~~~~~~~~~~~~~~~
Kristian noticed that introducing new log event type should be coordinated
somehow with MySQL/Sun:
Kristian: The numeric code for this event must be assigned carefully.
It should be coordinated with MySQL/Sun, otherwise we can get into a
situation where MySQL uses the same numeric code for one event that
MariaDB uses for ANNOTATE_ROWS_EVENT, which would make merging the two
impossible.
Alex: I reserved about 20 numbers not to have possible conflicts
with MySQL.
Kristian: Still, I think it would be appropriate to send a polite email
to internals(a)lists.mysql.com about this and suggesting to reserve the
event number.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Progress (by Knielsen): Store in binlog text of statements that caused RBR events (47)
by worklog-noreply@askmonty.org 21 Jun '10
by worklog-noreply@askmonty.org 21 Jun '10
21 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Store in binlog text of statements that caused RBR events
CREATION DATE..: Sat, 15 Aug 2009, 23:48
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Knielsen, Serg
CATEGORY.......: Server-Sprint
TASK ID........: 47 (http://askmonty.org/worklog/?tid=47)
VERSION........: Server-9.x
STATUS.........: Code-Review
PRIORITY.......: 60
WORKED HOURS...: 42
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 35
PROGRESS NOTES:
-=-=(Knielsen - Mon, 21 Jun 2010, 08:32)=-=-
Final review.
Assist with some problems applying the patch.
Worked 1 hour and estimate 0 hours remain (original estimate increased by 1 hour).
-=-=(Guest - Thu, 17 Jun 2010, 00:38)=-=-
Dependency deleted: 39 no longer depends on 47
-=-=(Knielsen - Mon, 07 Jun 2010, 07:13)=-=-
Help debug some test failures seen in Buildbot.
Worked 6 hours and estimate 0 hours remain (original estimate increased by 6 hours).
-=-=(Knielsen - Mon, 31 May 2010, 06:49)=-=-
Help Alexi debug+fix some test problems in the patch.
Worked 4 hours and estimate 0 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 25 May 2010, 08:29)=-=-
Help debug strange problem in mysqlbinlog.test.
Worked 1 hour and estimate 4 hours remain (original estimate unchanged).
-=-=(Knielsen - Mon, 17 May 2010, 08:45)=-=-
Merge with latest trunk and run Buildbot tests.
Worked 1 hour and estimate 5 hours remain (original estimate unchanged).
-=-=(Knielsen - Wed, 05 May 2010, 13:53)=-=-
Review of fixes to first review done. No new issues found.
Worked 2 hours and estimate 6 hours remain (original estimate unchanged).
-=-=(Knielsen - Fri, 23 Apr 2010, 12:51)=-=-
Status updated.
--- /tmp/wklog.47.old.28747 2010-04-23 12:51:36.000000000 +0000
+++ /tmp/wklog.47.new.28747 2010-04-23 12:51:36.000000000 +0000
@@ -1 +1 @@
-In-Progress
+Code-Review
-=-=(Knielsen - Tue, 06 Apr 2010, 15:26)=-=-
Code review (mailed to maria-developers@).
Worked 7 hours and estimate 8 hours remain (original estimate unchanged).
-=-=(Knielsen - Tue, 06 Apr 2010, 15:25)=-=-
Status updated.
--- /tmp/wklog.47.old.12734 2010-04-06 15:25:54.000000000 +0000
+++ /tmp/wklog.47.new.12734 2010-04-06 15:25:54.000000000 +0000
@@ -1 +1 @@
-Code-Review
+In-Progress
------------------------------------------------------------
-=-=(View All Progress Notes, 35 total)=-=-
http://askmonty.org/worklog/index.pl?tid=47&nolimit=1
DESCRIPTION:
Store in binlog (and show in mysqlbinlog output) texts of statements that
caused RBR events
This is needed for (list from Monty):
- Easier to understand why updates happened
- Would make it easier to find out where in application things went
wrong (as you can search for exact strings)
- Allow one to filter things based on comments in the statement.
The cost of this can be that the binlog will be approximately 2x in size
(especially insert of big blob's would be a bit painful), so this should
be an optional feature.
HIGH-LEVEL SPECIFICATION:
Content
~~~~~~~
1. Annotate_rows_log_event
2. Server option: --binlog-annotate-rows-events
3. Server option: --replicate-annotate-rows-events
4. mysqlbinlog option: --print-annotate-rows-events
5. mysqlbinlog output
1. Annotate_rows_log_event [ ANNOTATE_ROWS_EVENT ]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Describes the query which caused the corresponding rows events. Has empty
post-header and contains the query text in its data part. Example:
************************
ANNOTATE_ROWS_EVENT
************************
00000220 | B6 A0 2C 4B | time_when = 1261215926
00000224 | 33 | event_type = 51
00000225 | 64 00 00 00 | server_id = 100
00000229 | 36 00 00 00 | event_len = 54
0000022D | 56 02 00 00 | log_pos = 00000256
00000231 | 00 00 | flags = <none>
------------------------
00000233 | 49 4E 53 45 | query = "INSERT INTO t1 VALUES (1), (2), (3)"
00000237 | 52 54 20 49 |
0000023B | 4E 54 4F 20 |
0000023F | 74 31 20 56 |
00000243 | 41 4C 55 45 |
00000247 | 53 20 28 31 |
0000024B | 29 2C 20 28 |
0000024F | 32 29 2C 20 |
00000253 | 28 33 29 |
************************
In binary log, Annotate_rows event follows the (possible) 'BEGIN' Query event
and precedes the first of Table map events which accompany the corresponding
rows events. (See example in the "mysqlbinlog output" section below.)
2. Server option: --binlog-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the master to write Annotate_rows events to the binary log.
* Variable Name: binlog_annotate_rows_events
* Scope: Global & Session
* Access Type: Dynamic
* Data Type: bool
* Default Value: OFF
NOTE. Session values allows to annotate only some selected statements:
...
SET SESSION binlog_annotate_rows_events=ON;
... statements to be annotated ...
SET SESSION binlog_annotate_rows_events=OFF;
... statements not to be annotated ...
3. Server option: --replicate-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the slave to reproduce Annotate_rows events recieved from the master
in its own binary log (sensible only in pair with log-slave-updates option).
* Variable Name: replicate_annotate_rows_events
* Scope: Global
* Access Type: Read only
* Data Type: bool
* Default Value: OFF
NOTE. Why do we additionally need this 'replicate' option? Why not to make
the slave to reproduce this events when its binlog-annotate-rows-events
global value is ON? Well, because, for example, we may want to configure
the slave which should reproduce Annotate_rows events but has global
binlog-annotate-rows-events = OFF meaning this to be the default value for
the client threads (see also "How slave treats replicate-annotate-rows-events
option" in LLD part).
4. mysqlbinlog option: --print-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
With this option, mysqlbinlog prints the content of Annotate_rows events (if
the binary log does contain them). Without this option (i.e. by default),
mysqlbinlog skips Annotate_rows events.
5. mysqlbinlog output
~~~~~~~~~~~~~~~~~~~~~
With --print-annotate-rows-events, mysqlbinlog outputs Annotate_rows events
in a form like this:
...
# at 1646
#091219 12:45:26 server id 100 end_log_pos 1714 Query thread_id=1
exec_time=0 error_code=0
SET TIMESTAMP=1261215926/*!*/;
BEGIN
/*!*/;
# at 1714
# at 1812
# at 1853
# at 1894
# at 1938
#091219 12:45:26 server id 100 end_log_pos 1812 Query: `DELETE t1, t2 FROM
t1 INNER JOIN t2 INNER JOIN t3 WHERE t1.a=t2.a AND t2.a=t3.a`
#091219 12:45:26 server id 100 end_log_pos 1853 Table_map: `test`.`t1`
mapped to number 16
#091219 12:45:26 server id 100 end_log_pos 1894 Table_map: `test`.`t2`
mapped to number 17
#091219 12:45:26 server id 100 end_log_pos 1938 Delete_rows: table id 16
#091219 12:45:26 server id 100 end_log_pos 1982 Delete_rows: table id 17
flags: STMT_END_F
...
LOW-LEVEL DESIGN:
Content
~~~~~~~
1. Annotate_rows event number
2. Outline of Annotate_rows event behavior
3. How Master writes Annotate_rows events to the binary log
4. How slave treats replicate-annotate-rows-events option
5. How slave IO thread requests Annotate_rows events
6. How master executes the request
7. How slave SQL thread processes Annotate_rows events
8. General remarks
1. Annotate_rows event number
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To avoid possible event numbers conflict with MySQL/Sun, we leave a gap
between the last MySQL event number and the Annotate_rows event number:
enum Log_event_type
{ ...
INCIDENT_EVENT= 26,
// New MySQL event numbers are to be added here
MYSQL_EVENTS_END,
MARIA_EVENTS_BEGIN= 51,
// New Maria event numbers start from here
ANNOTATE_ROWS_EVENT= 51,
ENUM_END_EVENT
};
together with the corresponding extension of 'post_header_len' array in the
Format description event. (This extension does not affect the compatibility
of the binary log). Here is how Format description event looks like with
this extension:
************************
FORMAT_DESCRIPTION_EVENT
************************
00000004 | A1 A0 2C 4B | time_when = 1261215905
00000008 | 0F | event_type = 15
00000009 | 64 00 00 00 | server_id = 100
0000000D | 7F 00 00 00 | event_len = 127
00000011 | 83 00 00 00 | log_pos = 00000083
00000015 | 01 00 | flags = LOG_EVENT_BINLOG_IN_USE_F
------------------------
00000017 | 04 00 | binlog_ver = 4
00000019 | 35 2E 32 2E | server_ver = 5.2.0-MariaDB-alpha-debug-log
..... ...
0000004B | A1 A0 2C 4B | time_created = 1261215905
0000004F | 13 | common_header_len = 19
------------------------
post_header_len
------------------------
00000050 | 38 | 56 - START_EVENT_V3 [1]
..... ...
00000069 | 02 | 2 - INCIDENT_EVENT [26]
0000006A | 00 | 0 - RESERVED [27]
..... ...
00000081 | 00 | 0 - RESERVED [50]
00000082 | 00 | 0 - ANNOTATE_ROWS_EVENT [51]
************************
2. Outline of Annotate_rows event behavior
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Each Annotate_rows_log_event object has two private members describing the
corresponding query:
char *m_query_txt;
uint m_query_len;
When the object is created for writing to a binary log, this query is taken
from 'thd' (for short, below we omit the 'Annotate_rows_log_event::' prefix
as well as other implementation details):
Annotate_rows_log_event(THD *thd)
{
m_query_txt = thd->query();
m_query_len = thd->query_length();
}
When the object is read from a binary log, the query is taken from the buffer
containing the binary log representation of the event (this buffer is allocated
in Log_event object from which all Log events are derived):
Annotate_rows_log_event(char *buf, uint event_len,
Format_description_log_event *desc)
{
m_query_len = event_len - desc->common_header_len;
m_query_txt = buf + desc->common_header_len;
}
The events are written to the binary log by the Log_event::write() member
which calls virtual write_data_header() and write_data_body() members
("data header" and "post header" are synonym in replication terminology).
In our case, data header is empty and data body is just the query:
bool write_data_body(IO_CACHE *file)
{
return my_b_safe_write(file, (uchar*) m_query_txt, m_query_len);
}
Printing the event is just printing the query:
void Annotate_rows_log_event::print(FILE *file, PRINT_EVENT_INFO *pinfo)
{
my_b_printf(&pinfo->head_cache, "\tQuery: `%s`\n", m_query_txt);
}
3. How Master writes Annotate_rows events to the binary log
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The event is written to the binary log just before the group of Table_map
events which precede corresponding Rows events (one query may generate
several Table map events in the binary log, but the corresponding
Annotate_rows event must be written only once before the first Table map
event; hence the boolean variable 'with_annotate' below):
int write_locked_table_maps(THD *thd)
{ ...
bool with_annotate= thd->variables.binlog_annotate_rows_events;
...
for (uint i= 0; i < ... <number of tables> ...; ++i)
{ ...
thd->binlog_write_table_map(table, ..., with_annotate);
with_annotate= 0; // write Annotate_event not more than once
...
}
...
}
int THD::binlog_write_table_map(TABLE *table, ..., bool with_annotate)
{ ...
Table_map_log_event the_event(...);
...
if (with_annotate)
{
Annotate_rows_log_event anno(this);
mysql_bin_log.write(&anno);
}
mysql_bin_log.write(&the_event);
...
}
4. How slave treats replicate-annotate-rows-events option
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The replicate-annotate-rows-events option is treated just as the session
value of the binlog_annotate_rows_events variable for the slave IO and
SQL threads. This setting is done during initialization of these threads:
pthread_handler_t handle_slave_io(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_IO);
...
}
pthread_handler_t handle_slave_sql(void *arg)
{
THD *thd= new THD;
...
init_slave_thread(thd, SLAVE_THD_SQL);
...
}
int init_slave_thread(THD* thd, SLAVE_THD_TYPE thd_type)
{ ...
thd->variables.binlog_annotate_rows_events=
opt_replicate_annotate_rows_events;
...
}
5. How slave IO thread requests Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If the replicate-annotate-rows-events option is not set on a slave, there
is no need for master to send Annotate_rows events to this slave. The slave
(or mysqlbinlog in remote case), before requesting binlog dump via the
COM_BINLOG_DUMP command, informs the master whether it should send these
events by executing the newly added COM_BINLOG_DUMP_OPTIONS_EXT server
command:
case COM_BINLOG_DUMP_OPTIONS_EXT:
thd->binlog_dump_flags_ext= packet[0];
my_ok(thd);
break;
Note. We add this new command and don't use COM_BINLOG_DUMP to avoid possible
conflicts with MySQL/Sun.
6. How master executes the request
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
case COM_BINLOG_DUMP:
{ ...
flags= uint2korr(packet + 4);
...
mysql_binlog_send(thd, ..., flags);
...
}
void mysql_binlog_send(THD* thd, ..., ushort flags)
{ ...
Log_event::read_log_event(&log, packet, ...);
...
if ((*packet)[EVENT_TYPE_OFFSET + 1] != ANNOTATE_ROWS_EVENT ||
flags & BINLOG_SEND_ANNOTATE_ROWS_EVENT)
{
my_net_write(net, packet->ptr(), packet->length());
}
...
}
7. How slave SQL thread processes Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The slave processes each recieved event by "applying" it, i.e. by
calling the Log_event::apply_event() function which in turn calls
the virtual do_apply_event() member specific for each type of the
event.
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev = next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
int apply_event_and_update_pos(Log_event *ev, ...)
{ ...
ev->apply_event(...);
...
}
int Log_event::apply_event(...)
{
return do_apply_event(...);
}
What does it mean to "apply" an Annotate_rows event? It means to set current
thd query to that of the described by the event, i.e. to the query which
caused the subsequent Rows events (see "How Master writes Annotate_rows
events to the binary log" to follow what happens further when the subsequent
Rows events are applied):
int Annotate_rows_log_event::do_apply_event(...)
{
thd->set_query(m_query_txt, m_query_len);
}
NOTE. I am not sure, but possibly current values of thd->query and
thd->query_length should be saved before calling set_query() and to be
restored on the Annotate_rows_log_event object deletion.
Is it really needed ?
After calling this do_apply_event() function we may not delete the
Annotate_rows_log_event object immediatedly (see exec_relay_log_event()
above) because thd->query now points to the string inside this object.
We may keep the pointer to this object in the Relay_log_info:
class Relay_log_info
{
public:
...
void set_annotate_event(Annotate_rows_log_event*);
Annotate_rows_log_event* get_annotate_event();
void free_annotate_event();
...
private:
Annotate_rows_log_event* m_annotate_event;
};
The saved Annotate_rows object should be deleted when all corresponding
Rows events will be processed:
int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{ ...
Log_event *ev= next_event(rli);
...
apply_event_and_update_pos(ev, ...);
if (rli->get_annotate_event() && is_last_rows_event(ev))
rli->free_annotate_event();
else if (ev->get_type_code() == ANNOTATE_ROWS_EVENT)
rli->set_annotate_event((Annotate_rows_log_event*) ev);
else if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
delete ev;
...
}
where
bool is_last_rows_event(Log_event* ev)
{
Log_event_type type= ev->get_type_code();
if (IS_ROWS_EVENT_TYPE(type))
{
Rows_log_event* rows= (Rows_log_event*)ev;
return rows->get_flags(Rows_log_event::STMT_END_F);
}
return 0;
}
#define IS_ROWS_EVENT_TYPE(type) ((type) == WRITE_ROWS_EVENT || \
(type) == UPDATE_ROWS_EVENT || \
(type) == DELETE_ROWS_EVENT)
8. General remarks
~~~~~~~~~~~~~~~~~~
Kristian noticed that introducing new log event type should be coordinated
somehow with MySQL/Sun:
Kristian: The numeric code for this event must be assigned carefully.
It should be coordinated with MySQL/Sun, otherwise we can get into a
situation where MySQL uses the same numeric code for one event that
MariaDB uses for ANNOTATE_ROWS_EVENT, which would make merging the two
impossible.
Alex: I reserved about 20 numbers not to have possible conflicts
with MySQL.
Kristian: Still, I think it would be appropriate to send a polite email
to internals(a)lists.mysql.com about this and suggesting to reserve the
event number.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
19 Jun '10
Sergei (and everyone else),
The Release Notes and Changelog pages for the MariaDB 5.2.1-beta release
are ready:
http://askmonty.org/wiki/Manual:MariaDB_5.2.1_Release_Notes
http://askmonty.org/wiki/Manual:MariaDB_5.2.1_Changelog
Please let me know if the Release Notes should mention anything else
or if there is anything on that page which should be changed. The
Changelog should have the full list of commits from the 5.2.0-beta up
through the commit with the 5.2.1-beta tag.
The download page for this release is also ready to go, but I haven't
activated it yet. I will activate it (i.e. link to it from the download
page, and other wiki pages) once the mirrors have been seeded (later
tonight or tomorrow).
Thanks.
--
Daniel Bartholomew
Monty Program - http://askmonty.org
1
1
[Maria-developers] Updated (by Guest): Add a mysqlbinlog option to change the used database (36)
by worklog-noreply@askmonty.org 18 Jun '10
by worklog-noreply@askmonty.org 18 Jun '10
18 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Add a mysqlbinlog option to change the used database
CREATION DATE..: Fri, 07 Aug 2009, 14:57
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......:
CATEGORY.......: Server-Sprint
TASK ID........: 36 (http://askmonty.org/worklog/?tid=36)
VERSION........: Server-5.3
STATUS.........: Complete
PRIORITY.......: 60
WORKED HOURS...: 49
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Guest - Fri, 18 Jun 2010, 15:20)=-=-
Version updated.
--- /tmp/wklog.36.old.11335 2010-06-18 15:20:26.000000000 +0000
+++ /tmp/wklog.36.new.11335 2010-06-18 15:20:26.000000000 +0000
@@ -1 +1 @@
-Server-9.x
+Server-5.3
-=-=(Guest - Thu, 17 Jun 2010, 00:39)=-=-
Dependency deleted: 39 no longer depends on 36
-=-=(Guest - Sat, 07 Nov 2009, 22:43)=-=-
Category updated.
--- /tmp/wklog.36.old.9112 2009-11-07 22:43:50.000000000 +0200
+++ /tmp/wklog.36.new.9112 2009-11-07 22:43:50.000000000 +0200
@@ -1 +1 @@
-Server-RawIdeaBin
+Server-Sprint
-=-=(Guest - Sat, 07 Nov 2009, 22:43)=-=-
Status updated.
--- /tmp/wklog.36.old.9112 2009-11-07 22:43:50.000000000 +0200
+++ /tmp/wklog.36.new.9112 2009-11-07 22:43:50.000000000 +0200
@@ -1 +1 @@
-Un-Assigned
+Complete
-=-=(Bothorsen - Tue, 03 Nov 2009, 13:49)=-=-
More cleanup work done by Alexi, Bo and Sergey.
Worked 4 hours and estimate 0 hours remain (original estimate increased by 4 hours).
-=-=(Bothorsen - Tue, 03 Nov 2009, 13:49)=-=-
Sergey and Bo has been working on getting the patch ready, and Alexi has fixed some issues with the
patch.
Worked 15 hours and estimate 0 hours remain (original estimate increased by 15 hours).
-=-=(Bothorsen - Tue, 03 Nov 2009, 13:47)=-=-
Alexi has implemented a patch for this item.
Worked 30 hours and estimate 0 hours remain (original estimate increased by 30 hours).
-=-=(Guest - Tue, 15 Sep 2009, 18:04)=-=-
Low Level Design modified.
--- /tmp/wklog.36.old.19322 2009-09-15 18:04:49.000000000 +0300
+++ /tmp/wklog.36.new.19322 2009-09-15 18:04:49.000000000 +0300
@@ -191,7 +191,7 @@
- In process_event() function add switch case for Load_log_event and
add print_use_stmt() invocations where needed (according to the
- events lis above), e.g.:
+ events list above), e.g.:
Exit_status process_event(
PRINT_EVENT_INFO *print_event_info,
-=-=(Guest - Tue, 15 Sep 2009, 15:53)=-=-
Low Level Design modified.
--- /tmp/wklog.36.old.13421 2009-09-15 15:53:31.000000000 +0300
+++ /tmp/wklog.36.new.13421 2009-09-15 15:53:31.000000000 +0300
@@ -150,10 +150,17 @@
following events (see process_event() function):
- Query_log_event
-- Execute_load_query_log_event
-- Create_file_log_event
-
-TODO. Needed to check this list requires carefully !!!
+- Load_log_event
+- Execute_load_query_log_event [ :public Query_log_event ]
+- Create_file_log_event [ :public Load_log_event ]
+
+TODO. Needed to check this list carefully (not sure for Create_file_log_event)
+ Notes.
+ - In replication, only Query_log_event and Load_log_event uses
+ rpl_filter->get_rewrite_db();
+ - In mysqlbinlog (process_event), Execute_load_query_log_event
+ and Create_file_log_event are processed in separate switch
+ cases. And Load_log_event is processed in the default switch case.
Conditions for emiting use-statement:
- LOG_EVENT_SUPPRESS_USE_F is OFF for the event
@@ -182,8 +189,9 @@
*/
}
-- In process_event() function add print_use_stmt() invocations where
- needed (according to the events lis above), e.g.:
+- In process_event() function add switch case for Load_log_event and
+ add print_use_stmt() invocations where needed (according to the
+ events lis above), e.g.:
Exit_status process_event(
PRINT_EVENT_INFO *print_event_info,
@@ -207,6 +215,11 @@
}
break;
...
+ case LOAD_EVENT:
+ print_use_stmt((Load_log_event*)ev, print_event_info);
+ break;
+ default:
+ ...
}
...
}
-=-=(Guest - Tue, 15 Sep 2009, 12:12)=-=-
Low Level Design modified.
--- /tmp/wklog.36.old.3961 2009-09-15 12:12:26.000000000 +0300
+++ /tmp/wklog.36.new.3961 2009-09-15 12:12:26.000000000 +0300
@@ -144,6 +144,8 @@
3. Supporting rewrite-db for SBR events
---------------------------------------
+Limited to emiting USE <db_to> instead of USE <db_from>.
+
USE statements can be emited by mysqlbinlog as a result of processing the
following events (see process_event() function):
------------------------------------------------------------
-=-=(View All Progress Notes, 20 total)=-=-
http://askmonty.org/worklog/index.pl?tid=36&nolimit=1
DESCRIPTION:
Sometimes there is a need to take a binary log and apply it to a database with
a different name than the original name of the database on binlog producer.
If one is using statement-based replication, he can achieve this by grepping
out "USE dbname" statements out of the output of mysqlbinlog(*). With
row-based replication this is no longer possible, as database name is encoded
within the the BINLOG '....' statement.
This task is about adding an option to mysqlbinlog that would allow to change
the names of used databases in both RBR and SBR events.
(*) this implies that all statements refer to tables in the current database,
doesn't catch updates made inside stored functions and so forth, but still
works for a practially-important subset of cases.
HIGH-LEVEL SPECIFICATION:
Context
-------
(See http://askmonty.org/wiki/index.php/Scratch/ReplicationOptions for global
overview)
At the moment, the server has a replication slave option
--replicate-rewrite-db="from->to"
the option affects
- Table_map_log_event (all RBR events)
- Load_log_event (LOAD DATA)
- Query_log_event (SBR-based updates, with the usual assumption that the
statement refers to tables in current database, so that changing the current
database will make the statement to work on a table in a different database).
See also MySQL BUG#42941. Note this bug is fixed in MySQL 5.1.37, which is not
merged into MariaDB at the time of writing, but planned to be merged before
release.
What we could do
----------------
Option1: make mysqlbinlog accept --replicate-rewrite-db option
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Make mysqlbinlog accept --replicate-rewrite-db options and process them to the
same extent as replication slave would process --replicate-rewrite-db option.
Option2: Add database-agnostic RBR events and --strip-db option
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Right now RBR events require a databasename. It is not possible to have RBR
event stream that won't mention which database the events are for. When I
tried to use debugger and specify empty database name, attempt to apply the
binlog resulted in this error:
090809 17:38:44 [ERROR] Slave SQL: Error 'Table '.tablename' doesn't exist' on
opening tables,
We could do as follows:
- Make the server interpret empty database name in RBR event (i.e. in a
Table_map_log_event) as "use current database". Binlog slave thread
probably should not allow such events as it doesn't have a natural current
database.
- Add a mysqlbinlog --strip-db option that would
= not produce any "USE dbname" statements
= change databasename for all RBR events to be empty
That way, mysqlbinlog output will be database-agnostic and apply to the
current database.
(this will have the usual limitations that we assume that all statements in
the binlog refer to the current database).
Option3: Enhance database rewrite
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If there is a need to support database change for statements that use
dbname.tablename notation and are replicated as statements (i.e. are DDL
statements and/or DML statements that are binlogged as statements),
then that could be supported as follows:
- Make the server's parser recognize special form of comments
/* !database-alias(oldname,newname) */
and save the mapping somewhere
- Put the hooks in table open and name resolution code to use the saved
mapping.
Once we've done the above, it will be easy to perform a complete,
no-compromise or restrictions database name change in binary log.
It will be possible to do the rewrites either on the slave (
--replicate-rewrite-db will work for all kinds of statements), or in
mysqlbinlog (adding a comment is easy and doesn't require mysqlbinlog to
parse the statement).
LOW-LEVEL DESIGN:
Content
-------
1. Adding rewrite-db option
2. Supporting rewrite-db option for RBR events
3. Supporting rewrite-db option for SBR events
(Limited to affecting only USE statements)
4. Current status
1. Adding rewrite-db option
---------------------------
1.1. Syntax:
--rewrite-db='db_from->db_to'
1.2. Add 'OPT_REWRITE_DB' to 'options_client' (in client_priv.h).
1.3. In mysqlbinlog.cc:
- Add { "rewrite-db", OPT_REWRITE_DB, ...} record to my_long_options:
- Add Rpl_filter object to mysqlbinlog.cc
Rpl_filter* binlog_filter;
- Add corresponding switch case to get_one_option():
case OPT_REWRITE_DB:
<extract db-from and db-to strings>
binlog_filter->add_db_rewrite(db_from, db_to);
break;
.
Note. To make Rpl_filter usable in a MYSQL_CLIENT context, few small
additional changes are required:
- In sql_list.cc/h, Sql_alloc::new(size_t) and Sql_alloc::new[](size_t)
uses sql_alloc() which is THD dependent. These are to be modified
as follows:
#ifdef MYSQL_CLIENT
extern MEM_ROOT sql_list_client_mem_root; // defined in sql_list.cc
#endif
class Sql_alloc
{ ...
static void *operator new(size_t size) throw ()
{
#ifndef MYSQL_CLIENT
return sql_alloc(size);
#else
return alloc_root(&sql_list_client_mem_root, size);
#endif
}
static void *operator new[](size_t size) throw ()
{
#ifndef MYSQL_CLIENT
return sql_alloc(size);
#else
return alloc_root(&sql_list_client_mem_root, size);
#endif
}
...
}
- In rpl_filter.cc:
Rpl_filter::Rpl_filter() :
...
{
#ifdef MYSQL_CLIENT
init_alloc_root(&sql_list_client_mem_root, ...);
#endif
...
}
Rpl_filter::~Rpl_filter()
{ ...
#ifdef MYSQL_CLIENT
free_root(&sql_list_client_mem_root, ...);
#endif
}
2. Supporting rewrite-db for RBR events
---------------------------------------
In binlog, each row operation event is preceded by Table map event(s) which maps
table id(s) to database and table names. So, it's enough to support rewriting
database name in a Table map.
2.1. Add rewrite_db() member to Table_map_log_event:
int Table_map_log_event::rewrite_db(
const char* new_db,
size_t new_db_len,
const Format_description_log_event* desc)
{
/* 1. In temp_buf member (possibly reallocating it) rewrite
event length, db length, and db parts
2. Change m_dblen and m_dbnam members
*/
}
Comment. This function assumes that temp_buf member contains Table map
binlog representaion (temp_buf is used for creating corresponding
BINLOG statement).
2.2. In mysqlbinlog modify corresponding switch case in the
process_event() function:
Exit_status process_event(
PRINT_EVENT_INFO *print_event_info,
Log_event *ev, ...)
{
...
switch (ev_type) {
...
case TABLE_MAP_EVENT:
{
Table_map_log_event *map= ((Table_map_log_event *)ev);
if (shall_skip_database(map->get_db_name()))
{ ...
}
// WL36
size_t new_len= 0;
const char* new_db= binlog_filter->get_rewrite_db(
map->get_db_name(), &new_len);
if (new_len && map->rewrite_db(new_db, new_len,
glob_description_event))
{ error("Could not rewrite database name");
goto err;
}
}
case WRITE_ROWS_EVENT:
case DELETE_ROWS_EVENT:
case UPDATE_ROWS_EVENT:
...
}
...
}
Comment. Rpl_filter::get_rewrite_db(db_from, &len): if filter contains
a (db_from, db_to) pair, this function returns pointer to db_to and
sets len = db_to length; otherwise, it returns db_from and does not
change len value.
3. Supporting rewrite-db for SBR events
---------------------------------------
Limited to emiting USE <db_to> instead of USE <db_from>.
USE statements can be emited by mysqlbinlog as a result of processing the
following events (see process_event() function):
- Query_log_event
- Load_log_event
- Execute_load_query_log_event [ :public Query_log_event ]
- Create_file_log_event [ :public Load_log_event ]
TODO. Needed to check this list carefully (not sure for Create_file_log_event)
Notes.
- In replication, only Query_log_event and Load_log_event uses
rpl_filter->get_rewrite_db();
- In mysqlbinlog (process_event), Execute_load_query_log_event
and Create_file_log_event are processed in separate switch
cases. And Load_log_event is processed in the default switch case.
Conditions for emiting use-statement:
- LOG_EVENT_SUPPRESS_USE_F is OFF for the event
(e.g. it is ON for 'create database' statement)
- event's db name differs from db_name in PRINT_EVENT_INFO
(PRINT_EVENT_INFO keeps db name of the last issued USE statement;
initially, this db name is empty).
3.1. In mysqlbinlog.cc
- Add the following function:
void print_use_stmt(Log_event* event, PRINT_EVENT_INFO* pinfo)
{
if (event->flags & LOG_EVENT_SUPPRESS_USE_F)
return;
/*
- For events listed above get db_from = event->db;
- If db_from is the same as pinfo->db then return;
- If there is rewrite-db rule db_from->db_to,
set db = db_to. Else set db = db_from;
- Print "use <db>" to mysqlbinlog output
- Set pinfo->db = db_from
(this suppresses emiting use-statements by corresponding
log_event's print-function)
*/
}
- In process_event() function add switch case for Load_log_event and
add print_use_stmt() invocations where needed (according to the
events list above), e.g.:
Exit_status process_event(
PRINT_EVENT_INFO *print_event_info,
Log_event *ev, ...)
{
...
switch (ev_type) {
case QUERY_EVENT:
if (shall_skip_database(((Query_log_event*)ev)->db))
goto end;
if (opt_base64_output_mode == BASE64_OUTPUT_ALWAYS)
{
// Possibly in case of rewite-db rule for ev->db
// a warning should be emited here (see note below)
... write_event_header_and_base64(ev, ...) ...
}
else
{
print_use_stmt((Query_log_event*)ev, print_event_info);
ev->print(result_file, print_event_info);
}
break;
...
case LOAD_EVENT:
print_use_stmt((Load_log_event*)ev, print_event_info);
break;
default:
...
}
...
}
Note. write_event_header_and_base64() does not print use-statement. It
produces BINLOG statement using ev->temp_buf content (i.e. the binary
log representation of the event). We don't rewrite temp_buf here with
db_to name (as we do it for Table map event) - this implies the
limitation 3 mentioned above.
Question: Is supporting of rewite_db + --base64-output really needed
currently?
4. Current status
-----------------
The outlined design (implemented for mysql-5.1.37) is tested for
simple test-cases.
TODO. 1. Check list of events which can emit use-statement.
2. Supporting of rewite_db + --base64-output ?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Guest): Add a mysqlbinlog option to change the used database (36)
by worklog-noreply@askmonty.org 18 Jun '10
by worklog-noreply@askmonty.org 18 Jun '10
18 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Add a mysqlbinlog option to change the used database
CREATION DATE..: Fri, 07 Aug 2009, 14:57
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......:
CATEGORY.......: Server-Sprint
TASK ID........: 36 (http://askmonty.org/worklog/?tid=36)
VERSION........: Server-5.3
STATUS.........: Complete
PRIORITY.......: 60
WORKED HOURS...: 49
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Guest - Fri, 18 Jun 2010, 15:20)=-=-
Version updated.
--- /tmp/wklog.36.old.11335 2010-06-18 15:20:26.000000000 +0000
+++ /tmp/wklog.36.new.11335 2010-06-18 15:20:26.000000000 +0000
@@ -1 +1 @@
-Server-9.x
+Server-5.3
-=-=(Guest - Thu, 17 Jun 2010, 00:39)=-=-
Dependency deleted: 39 no longer depends on 36
-=-=(Guest - Sat, 07 Nov 2009, 22:43)=-=-
Category updated.
--- /tmp/wklog.36.old.9112 2009-11-07 22:43:50.000000000 +0200
+++ /tmp/wklog.36.new.9112 2009-11-07 22:43:50.000000000 +0200
@@ -1 +1 @@
-Server-RawIdeaBin
+Server-Sprint
-=-=(Guest - Sat, 07 Nov 2009, 22:43)=-=-
Status updated.
--- /tmp/wklog.36.old.9112 2009-11-07 22:43:50.000000000 +0200
+++ /tmp/wklog.36.new.9112 2009-11-07 22:43:50.000000000 +0200
@@ -1 +1 @@
-Un-Assigned
+Complete
-=-=(Bothorsen - Tue, 03 Nov 2009, 13:49)=-=-
More cleanup work done by Alexi, Bo and Sergey.
Worked 4 hours and estimate 0 hours remain (original estimate increased by 4 hours).
-=-=(Bothorsen - Tue, 03 Nov 2009, 13:49)=-=-
Sergey and Bo has been working on getting the patch ready, and Alexi has fixed some issues with the
patch.
Worked 15 hours and estimate 0 hours remain (original estimate increased by 15 hours).
-=-=(Bothorsen - Tue, 03 Nov 2009, 13:47)=-=-
Alexi has implemented a patch for this item.
Worked 30 hours and estimate 0 hours remain (original estimate increased by 30 hours).
-=-=(Guest - Tue, 15 Sep 2009, 18:04)=-=-
Low Level Design modified.
--- /tmp/wklog.36.old.19322 2009-09-15 18:04:49.000000000 +0300
+++ /tmp/wklog.36.new.19322 2009-09-15 18:04:49.000000000 +0300
@@ -191,7 +191,7 @@
- In process_event() function add switch case for Load_log_event and
add print_use_stmt() invocations where needed (according to the
- events lis above), e.g.:
+ events list above), e.g.:
Exit_status process_event(
PRINT_EVENT_INFO *print_event_info,
-=-=(Guest - Tue, 15 Sep 2009, 15:53)=-=-
Low Level Design modified.
--- /tmp/wklog.36.old.13421 2009-09-15 15:53:31.000000000 +0300
+++ /tmp/wklog.36.new.13421 2009-09-15 15:53:31.000000000 +0300
@@ -150,10 +150,17 @@
following events (see process_event() function):
- Query_log_event
-- Execute_load_query_log_event
-- Create_file_log_event
-
-TODO. Needed to check this list requires carefully !!!
+- Load_log_event
+- Execute_load_query_log_event [ :public Query_log_event ]
+- Create_file_log_event [ :public Load_log_event ]
+
+TODO. Needed to check this list carefully (not sure for Create_file_log_event)
+ Notes.
+ - In replication, only Query_log_event and Load_log_event uses
+ rpl_filter->get_rewrite_db();
+ - In mysqlbinlog (process_event), Execute_load_query_log_event
+ and Create_file_log_event are processed in separate switch
+ cases. And Load_log_event is processed in the default switch case.
Conditions for emiting use-statement:
- LOG_EVENT_SUPPRESS_USE_F is OFF for the event
@@ -182,8 +189,9 @@
*/
}
-- In process_event() function add print_use_stmt() invocations where
- needed (according to the events lis above), e.g.:
+- In process_event() function add switch case for Load_log_event and
+ add print_use_stmt() invocations where needed (according to the
+ events lis above), e.g.:
Exit_status process_event(
PRINT_EVENT_INFO *print_event_info,
@@ -207,6 +215,11 @@
}
break;
...
+ case LOAD_EVENT:
+ print_use_stmt((Load_log_event*)ev, print_event_info);
+ break;
+ default:
+ ...
}
...
}
-=-=(Guest - Tue, 15 Sep 2009, 12:12)=-=-
Low Level Design modified.
--- /tmp/wklog.36.old.3961 2009-09-15 12:12:26.000000000 +0300
+++ /tmp/wklog.36.new.3961 2009-09-15 12:12:26.000000000 +0300
@@ -144,6 +144,8 @@
3. Supporting rewrite-db for SBR events
---------------------------------------
+Limited to emiting USE <db_to> instead of USE <db_from>.
+
USE statements can be emited by mysqlbinlog as a result of processing the
following events (see process_event() function):
------------------------------------------------------------
-=-=(View All Progress Notes, 20 total)=-=-
http://askmonty.org/worklog/index.pl?tid=36&nolimit=1
DESCRIPTION:
Sometimes there is a need to take a binary log and apply it to a database with
a different name than the original name of the database on binlog producer.
If one is using statement-based replication, he can achieve this by grepping
out "USE dbname" statements out of the output of mysqlbinlog(*). With
row-based replication this is no longer possible, as database name is encoded
within the the BINLOG '....' statement.
This task is about adding an option to mysqlbinlog that would allow to change
the names of used databases in both RBR and SBR events.
(*) this implies that all statements refer to tables in the current database,
doesn't catch updates made inside stored functions and so forth, but still
works for a practially-important subset of cases.
HIGH-LEVEL SPECIFICATION:
Context
-------
(See http://askmonty.org/wiki/index.php/Scratch/ReplicationOptions for global
overview)
At the moment, the server has a replication slave option
--replicate-rewrite-db="from->to"
the option affects
- Table_map_log_event (all RBR events)
- Load_log_event (LOAD DATA)
- Query_log_event (SBR-based updates, with the usual assumption that the
statement refers to tables in current database, so that changing the current
database will make the statement to work on a table in a different database).
See also MySQL BUG#42941. Note this bug is fixed in MySQL 5.1.37, which is not
merged into MariaDB at the time of writing, but planned to be merged before
release.
What we could do
----------------
Option1: make mysqlbinlog accept --replicate-rewrite-db option
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Make mysqlbinlog accept --replicate-rewrite-db options and process them to the
same extent as replication slave would process --replicate-rewrite-db option.
Option2: Add database-agnostic RBR events and --strip-db option
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Right now RBR events require a databasename. It is not possible to have RBR
event stream that won't mention which database the events are for. When I
tried to use debugger and specify empty database name, attempt to apply the
binlog resulted in this error:
090809 17:38:44 [ERROR] Slave SQL: Error 'Table '.tablename' doesn't exist' on
opening tables,
We could do as follows:
- Make the server interpret empty database name in RBR event (i.e. in a
Table_map_log_event) as "use current database". Binlog slave thread
probably should not allow such events as it doesn't have a natural current
database.
- Add a mysqlbinlog --strip-db option that would
= not produce any "USE dbname" statements
= change databasename for all RBR events to be empty
That way, mysqlbinlog output will be database-agnostic and apply to the
current database.
(this will have the usual limitations that we assume that all statements in
the binlog refer to the current database).
Option3: Enhance database rewrite
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If there is a need to support database change for statements that use
dbname.tablename notation and are replicated as statements (i.e. are DDL
statements and/or DML statements that are binlogged as statements),
then that could be supported as follows:
- Make the server's parser recognize special form of comments
/* !database-alias(oldname,newname) */
and save the mapping somewhere
- Put the hooks in table open and name resolution code to use the saved
mapping.
Once we've done the above, it will be easy to perform a complete,
no-compromise or restrictions database name change in binary log.
It will be possible to do the rewrites either on the slave (
--replicate-rewrite-db will work for all kinds of statements), or in
mysqlbinlog (adding a comment is easy and doesn't require mysqlbinlog to
parse the statement).
LOW-LEVEL DESIGN:
Content
-------
1. Adding rewrite-db option
2. Supporting rewrite-db option for RBR events
3. Supporting rewrite-db option for SBR events
(Limited to affecting only USE statements)
4. Current status
1. Adding rewrite-db option
---------------------------
1.1. Syntax:
--rewrite-db='db_from->db_to'
1.2. Add 'OPT_REWRITE_DB' to 'options_client' (in client_priv.h).
1.3. In mysqlbinlog.cc:
- Add { "rewrite-db", OPT_REWRITE_DB, ...} record to my_long_options:
- Add Rpl_filter object to mysqlbinlog.cc
Rpl_filter* binlog_filter;
- Add corresponding switch case to get_one_option():
case OPT_REWRITE_DB:
<extract db-from and db-to strings>
binlog_filter->add_db_rewrite(db_from, db_to);
break;
.
Note. To make Rpl_filter usable in a MYSQL_CLIENT context, few small
additional changes are required:
- In sql_list.cc/h, Sql_alloc::new(size_t) and Sql_alloc::new[](size_t)
uses sql_alloc() which is THD dependent. These are to be modified
as follows:
#ifdef MYSQL_CLIENT
extern MEM_ROOT sql_list_client_mem_root; // defined in sql_list.cc
#endif
class Sql_alloc
{ ...
static void *operator new(size_t size) throw ()
{
#ifndef MYSQL_CLIENT
return sql_alloc(size);
#else
return alloc_root(&sql_list_client_mem_root, size);
#endif
}
static void *operator new[](size_t size) throw ()
{
#ifndef MYSQL_CLIENT
return sql_alloc(size);
#else
return alloc_root(&sql_list_client_mem_root, size);
#endif
}
...
}
- In rpl_filter.cc:
Rpl_filter::Rpl_filter() :
...
{
#ifdef MYSQL_CLIENT
init_alloc_root(&sql_list_client_mem_root, ...);
#endif
...
}
Rpl_filter::~Rpl_filter()
{ ...
#ifdef MYSQL_CLIENT
free_root(&sql_list_client_mem_root, ...);
#endif
}
2. Supporting rewrite-db for RBR events
---------------------------------------
In binlog, each row operation event is preceded by Table map event(s) which maps
table id(s) to database and table names. So, it's enough to support rewriting
database name in a Table map.
2.1. Add rewrite_db() member to Table_map_log_event:
int Table_map_log_event::rewrite_db(
const char* new_db,
size_t new_db_len,
const Format_description_log_event* desc)
{
/* 1. In temp_buf member (possibly reallocating it) rewrite
event length, db length, and db parts
2. Change m_dblen and m_dbnam members
*/
}
Comment. This function assumes that temp_buf member contains Table map
binlog representaion (temp_buf is used for creating corresponding
BINLOG statement).
2.2. In mysqlbinlog modify corresponding switch case in the
process_event() function:
Exit_status process_event(
PRINT_EVENT_INFO *print_event_info,
Log_event *ev, ...)
{
...
switch (ev_type) {
...
case TABLE_MAP_EVENT:
{
Table_map_log_event *map= ((Table_map_log_event *)ev);
if (shall_skip_database(map->get_db_name()))
{ ...
}
// WL36
size_t new_len= 0;
const char* new_db= binlog_filter->get_rewrite_db(
map->get_db_name(), &new_len);
if (new_len && map->rewrite_db(new_db, new_len,
glob_description_event))
{ error("Could not rewrite database name");
goto err;
}
}
case WRITE_ROWS_EVENT:
case DELETE_ROWS_EVENT:
case UPDATE_ROWS_EVENT:
...
}
...
}
Comment. Rpl_filter::get_rewrite_db(db_from, &len): if filter contains
a (db_from, db_to) pair, this function returns pointer to db_to and
sets len = db_to length; otherwise, it returns db_from and does not
change len value.
3. Supporting rewrite-db for SBR events
---------------------------------------
Limited to emiting USE <db_to> instead of USE <db_from>.
USE statements can be emited by mysqlbinlog as a result of processing the
following events (see process_event() function):
- Query_log_event
- Load_log_event
- Execute_load_query_log_event [ :public Query_log_event ]
- Create_file_log_event [ :public Load_log_event ]
TODO. Needed to check this list carefully (not sure for Create_file_log_event)
Notes.
- In replication, only Query_log_event and Load_log_event uses
rpl_filter->get_rewrite_db();
- In mysqlbinlog (process_event), Execute_load_query_log_event
and Create_file_log_event are processed in separate switch
cases. And Load_log_event is processed in the default switch case.
Conditions for emiting use-statement:
- LOG_EVENT_SUPPRESS_USE_F is OFF for the event
(e.g. it is ON for 'create database' statement)
- event's db name differs from db_name in PRINT_EVENT_INFO
(PRINT_EVENT_INFO keeps db name of the last issued USE statement;
initially, this db name is empty).
3.1. In mysqlbinlog.cc
- Add the following function:
void print_use_stmt(Log_event* event, PRINT_EVENT_INFO* pinfo)
{
if (event->flags & LOG_EVENT_SUPPRESS_USE_F)
return;
/*
- For events listed above get db_from = event->db;
- If db_from is the same as pinfo->db then return;
- If there is rewrite-db rule db_from->db_to,
set db = db_to. Else set db = db_from;
- Print "use <db>" to mysqlbinlog output
- Set pinfo->db = db_from
(this suppresses emiting use-statements by corresponding
log_event's print-function)
*/
}
- In process_event() function add switch case for Load_log_event and
add print_use_stmt() invocations where needed (according to the
events list above), e.g.:
Exit_status process_event(
PRINT_EVENT_INFO *print_event_info,
Log_event *ev, ...)
{
...
switch (ev_type) {
case QUERY_EVENT:
if (shall_skip_database(((Query_log_event*)ev)->db))
goto end;
if (opt_base64_output_mode == BASE64_OUTPUT_ALWAYS)
{
// Possibly in case of rewite-db rule for ev->db
// a warning should be emited here (see note below)
... write_event_header_and_base64(ev, ...) ...
}
else
{
print_use_stmt((Query_log_event*)ev, print_event_info);
ev->print(result_file, print_event_info);
}
break;
...
case LOAD_EVENT:
print_use_stmt((Load_log_event*)ev, print_event_info);
break;
default:
...
}
...
}
Note. write_event_header_and_base64() does not print use-statement. It
produces BINLOG statement using ev->temp_buf content (i.e. the binary
log representation of the event). We don't rewrite temp_buf here with
db_to name (as we do it for Table map event) - this implies the
limitation 3 mentioned above.
Question: Is supporting of rewite_db + --base64-output really needed
currently?
4. Current status
-----------------
The outlined design (implemented for mysql-5.1.37) is tested for
simple test-cases.
TODO. 1. Check list of events which can emit use-statement.
2. Supporting of rewite_db + --base64-output ?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
Hi everyone,
I'm currently working on a Windows installer for MariaDB, and I have two
options for you to consider. This mail covers the first of them.
The first and currently biggest contender is CPack + NSIS. This
combination has two very big things going for it: It's the same that
MySQL uses, and it integrates really well with the CMake system. In
fact, all you have to do with this solution is to install NSIS on your
system and run "cpack.exe" in a directory where you already built the
solution.
NSIS creates a single binary exe file that installs in C:\Program
Files\MariaDB-5.1.47 (for example).
NSIS is very limited in what you can actually do with the system. For
example, there is no support in there for asking the user if he wants to
delete the database files, they just vanish. This is potentially
*extremely* bad. However, I have a theory on how to work around this
particular problem, by hacking the nsis.cmake file.
NSIS does not support upgrading of packages. Instead, it does "upgrades"
by allowing packages with different versions to install next to each
other. So if you installed the 5.1.47 version and want to upgrade to
5.1.49, you simply install 5.1.49, copy your database files over (over,
even better, use database files in a different directory). When you are
ready, you can remove the 5.1.47 package.
This clearly has some advantages, but it's just not the way most
software updates run. When you update to a newer version of the software
on most Windows software, and certainly on all systems using apt or RPM,
you just replace the old version with the new one.
There is no support for setting up the database in the installer, or
setting up MariaDB as a service. CMake+NSIS is just a dumb file copy
system. MySQL works around this by running another executable at the end
of the install process and this program does the setup. IMHO, that's a
very good solution, and it also allows the user to run the setup program
again later. But it's still a workaround due to the limited installer
system.
NSIS would be my choice for an installer right now. But because of the
limitations, I'd consider this a temporary solution until we have a
better one. See my next mail for a better but much more complex system.
Comments, please.
Bo Thorsen.
Monty Program AB.
--
MariaDB: MySQL replacement
Community developed. Feature enhanced. Backward compatible.
2
3
Hi again,
The other contender for installer system of MariaDB 2010, is CPack +
WiX. This is a much more powerful solution, but also one that will take
a lot longer to implement.
CPack doesn't actually support WiX yet, but there is a patch out there
to implement the support. This patch is so simple, I don't understand
why they didn't just add it already. All it does is copy the built files
into a directory structure, and call the WiX binaries. It doesn't output
a specification file for the installer, like the CPack NSIS integration
does.
Instead, the implementor has to supply a .xsl file which the WiX
binaries takes as input for creating a .xml file, which another WiX
binary uses to build the package.
The actual package is a single .msi file which runs like any other
graphical Windows installer.
Without CPack, the implementor writes the .xml file by hand. The CPack
integration makes it simpler to identify the files that will be
installed. If the implementor writes the .xml file manually, we have to
always keep the cmake built files and the WiX spec in sync. So even
though the CPack integration is really small, it does make sense.
WiX is capable of very powerful installers that would work exactly like
I'd hope to achieve. This means seamless upgrading, user account
creation (for setting MariaDB up as a service), service installation
etc. These are all things that NSIS just can't do directly, where we'd
be trying to bend the system to support what we want.
The downside of using WiX is that it's going to take a lot longer to
implement a good installer than it is to implement a simple installer
with NSIS. I already have a patch for a complete installer with NSIS,
albeit one that doesn't ask about deleting database files or with the
ability to set up as a service. Making it to this point with WiX is not
that easy.
I'm convinced that once the WiX installer is done, it's going to be easy
to maintain it. Probably as easy as maintaining the NSIS system. And
implementing features in the installer will be a lot simpler with WiX,
because the system is designed to be powerful.
I would like to hear some discussion about this. Should I start spending
the longer time on this, or go with the simple NSIS solution for now?
Bo Thorsen.
Monty Program AB.
--
MariaDB: MySQL replacement
Community developed. Feature enhanced. Backward compatible.
3
2
[Maria-developers] [Branch ~maria-captains/maria/5.1-converting] Rev 2868: Fixed compiler warnings
by noreply@launchpad.net 17 Jun '10
by noreply@launchpad.net 17 Jun '10
17 Jun '10
------------------------------------------------------------
revno: 2868
committer: Michael Widenius <monty(a)askmonty.org>
branch nick: maria-5.1
timestamp: Wed 2010-06-16 01:00:51 +0300
message:
Fixed compiler warnings
modified:
sql/log_event.cc
storage/maria/ma_state.c
storage/maria/maria_chk.c
storage/myisam/mi_dynrec.c
support-files/compiler_warnings.supp
--
lp:~maria-captains/maria/5.1-converting
https://code.launchpad.net/~maria-captains/maria/5.1-converting
Your team Maria developers is subscribed to branch lp:~maria-captains/maria/5.1-converting.
To unsubscribe from this branch go to https://code.launchpad.net/~maria-captains/maria/5.1-converting/+edit-subsc…
1
0
[Maria-developers] [Branch ~maria-captains/maria/5.1-converting] Rev 2867: Don't flush pinned pages in checkpoint (fix for my last push)
by noreply@launchpad.net 17 Jun '10
by noreply@launchpad.net 17 Jun '10
17 Jun '10
------------------------------------------------------------
revno: 2867
committer: Michael Widenius <monty(a)askmonty.org>
branch nick: maria-5.1
timestamp: Wed 2010-06-16 00:39:28 +0300
message:
Don't flush pinned pages in checkpoint (fix for my last push)
modified:
storage/maria/ma_pagecache.c
storage/maria/unittest/ma_pagecache_single.c
--
lp:~maria-captains/maria/5.1-converting
https://code.launchpad.net/~maria-captains/maria/5.1-converting
Your team Maria developers is subscribed to branch lp:~maria-captains/maria/5.1-converting.
To unsubscribe from this branch go to https://code.launchpad.net/~maria-captains/maria/5.1-converting/+edit-subsc…
1
0
[Maria-developers] [Branch ~maria-captains/maria/5.1-converting] Rev 2866: merged
by noreply@launchpad.net 17 Jun '10
by noreply@launchpad.net 17 Jun '10
17 Jun '10
Merge authors:
Bo Thorsen (bo.thorsen)
Michael Widenius (monty)
Sergei (sergii)
------------------------------------------------------------
revno: 2866 [merge]
committer: Sergei Golubchik <sergii(a)pisem.net>
branch nick: 5.1
timestamp: Mon 2010-06-14 19:05:32 +0200
message:
merged
modified:
CMakeLists.txt
client/mysqldump.c
client/mysqltest.cc
mysql-test/r/mysqldump.result
mysql-test/r/openssl_1.result
mysql-test/suite/maria/r/maria-recover.result
mysql-test/suite/maria/r/maria3.result
mysql-test/suite/maria/t/maria3.test
storage/maria/ha_maria.cc
storage/maria/ha_maria.h
storage/maria/ma_blockrec.h
storage/maria/ma_init.c
storage/maria/ma_open.c
storage/maria/ma_pagecache.c
storage/maria/ma_recovery.c
storage/maria/ma_state.c
storage/maria/ma_static.c
storage/maria/maria_def.h
--
lp:~maria-captains/maria/5.1-converting
https://code.launchpad.net/~maria-captains/maria/5.1-converting
Your team Maria developers is subscribed to branch lp:~maria-captains/maria/5.1-converting.
To unsubscribe from this branch go to https://code.launchpad.net/~maria-captains/maria/5.1-converting/+edit-subsc…
1
0
[Maria-developers] [Branch ~maria-captains/maria/5.1-converting] Rev 2865: mtr: when applying @opt_extra_mysqld_opt for --help,
by noreply@launchpad.net 17 Jun '10
by noreply@launchpad.net 17 Jun '10
17 Jun '10
------------------------------------------------------------
revno: 2865
committer: Sergei Golubchik <sergii(a)pisem.net>
branch nick: 5.1
timestamp: Mon 2010-06-14 18:57:30 +0200
message:
mtr: when applying @opt_extra_mysqld_opt for --help,
filter out --binlog-format - it makes mysqld to fail without --log-bin,
and we don't need either anyway for --help to work.
modified:
mysql-test/mysql-test-run.pl
--
lp:~maria-captains/maria/5.1-converting
https://code.launchpad.net/~maria-captains/maria/5.1-converting
Your team Maria developers is subscribed to branch lp:~maria-captains/maria/5.1-converting.
To unsubscribe from this branch go to https://code.launchpad.net/~maria-captains/maria/5.1-converting/+edit-subsc…
1
0
[Maria-developers] [Branch ~maria-captains/maria/5.1-converting] Rev 2864: ugly-ugly. $with_plugin_innobase was hard-coded in configure.in in
by noreply@launchpad.net 17 Jun '10
by noreply@launchpad.net 17 Jun '10
17 Jun '10
------------------------------------------------------------
revno: 2864
committer: Sergei Golubchik <sergii(a)pisem.net>
branch nick: 5.1
timestamp: Thu 2010-06-10 19:35:18 +0200
message:
ugly-ugly. $with_plugin_innobase was hard-coded in configure.in in
modified:
storage/xtradb/plug.in
--
lp:~maria-captains/maria/5.1-converting
https://code.launchpad.net/~maria-captains/maria/5.1-converting
Your team Maria developers is subscribed to branch lp:~maria-captains/maria/5.1-converting.
To unsubscribe from this branch go to https://code.launchpad.net/~maria-captains/maria/5.1-converting/+edit-subsc…
1
0
[Maria-developers] [Branch ~maria-captains/maria/5.1-converting] Rev 2863: fixed for mysql-test-run to
by noreply@launchpad.net 17 Jun '10
by noreply@launchpad.net 17 Jun '10
17 Jun '10
------------------------------------------------------------
revno: 2863
committer: Sergei Golubchik <sergii(a)pisem.net>
branch nick: 5.1
timestamp: Thu 2010-06-10 11:11:52 +0200
message:
fixed for mysql-test-run to
* fully support --mysqld=--plugin-load=xxxx
* uniformly support all loadable plugins, no need to hard-code
every new plugin in mtr
* autodetect MTR_VS_CONFIG on windows
removed:
mysql-test/suite/pbxt/t/udf-master.opt
mysql-test/suite/rpl/t/rpl_plugin_load-master.opt
mysql-test/suite/rpl/t/rpl_plugin_load-slave.opt
mysql-test/suite/rpl/t/rpl_udf-master.opt
mysql-test/suite/rpl/t/rpl_udf-slave.opt
mysql-test/t/fulltext_plugin-master.opt
mysql-test/t/plugin-master.opt
mysql-test/t/plugin_not_embedded-master.opt
mysql-test/t/udf-master.opt
mysql-test/t/udf_query_cache-master.opt
modified:
mysql-test/include/have_example_plugin.inc
mysql-test/include/have_simple_parser.inc
mysql-test/include/have_udf.inc
mysql-test/include/rpl_udf.inc
mysql-test/lib/My/File/Path.pm
mysql-test/lib/mtr_cases.pm
mysql-test/mysql-test-run.pl
mysql-test/r/information_schema.result
mysql-test/r/innodb_ignore_builtin.result
mysql-test/suite/pbxt/t/udf.test
mysql-test/t/bug46261-master.opt
mysql-test/t/bug46261.test
mysql-test/t/information_schema.test
mysql-test/t/innodb_ignore_builtin.test
mysql-test/t/mysqld_option_err.test
mysql-test/t/plugin.test
mysql-test/t/plugin_load-master.opt
mysql-test/t/plugin_not_embedded.test
mysql-test/t/udf.test
mysql-test/t/udf_query_cache.test
--
lp:~maria-captains/maria/5.1-converting
https://code.launchpad.net/~maria-captains/maria/5.1-converting
Your team Maria developers is subscribed to branch lp:~maria-captains/maria/5.1-converting.
To unsubscribe from this branch go to https://code.launchpad.net/~maria-captains/maria/5.1-converting/+edit-subsc…
1
0
[Maria-developers] [Branch ~maria-captains/maria/5.1-converting] Rev 2862: allow federated and innodb_plugin to be built
by noreply@launchpad.net 17 Jun '10
by noreply@launchpad.net 17 Jun '10
17 Jun '10
------------------------------------------------------------
revno: 2862
committer: Sergei Golubchik <sergii(a)pisem.net>
branch nick: 5.1
timestamp: Wed 2010-06-09 23:29:18 +0200
message:
allow federated and innodb_plugin to be built
renamed:
storage/federated/plug.in.disabled => storage/federated/plug.in
storage/innodb_plugin/plug.in.disabled => storage/innodb_plugin/plug.in
modified:
storage/federated/Makefile.am
storage/federatedx/Makefile.am
storage/federatedx/ha_federatedx.cc
storage/federatedx/plug.in
storage/xtradb/CMakeLists.txt
storage/xtradb/Makefile.am
storage/xtradb/handler/ha_innodb.cc
storage/xtradb/plug.in
storage/federated/plug.in
storage/innodb_plugin/plug.in
--
lp:~maria-captains/maria/5.1-converting
https://code.launchpad.net/~maria-captains/maria/5.1-converting
Your team Maria developers is subscribed to branch lp:~maria-captains/maria/5.1-converting.
To unsubscribe from this branch go to https://code.launchpad.net/~maria-captains/maria/5.1-converting/+edit-subsc…
1
0
[Maria-developers] [Branch ~maria-captains/maria/5.1-converting] Rev 2861: fix questionable UNIV_EXPECT's in the xtradb that confused old gcc.
by noreply@launchpad.net 17 Jun '10
by noreply@launchpad.net 17 Jun '10
17 Jun '10
------------------------------------------------------------
revno: 2861
committer: Sergei Golubchik <sergii(a)pisem.net>
branch nick: 5.1
timestamp: Wed 2010-06-09 13:53:51 +0200
message:
fix questionable UNIV_EXPECT's in the xtradb that confused old gcc.
modified:
storage/xtradb/include/rem0rec.ic
--
lp:~maria-captains/maria/5.1-converting
https://code.launchpad.net/~maria-captains/maria/5.1-converting
Your team Maria developers is subscribed to branch lp:~maria-captains/maria/5.1-converting.
To unsubscribe from this branch go to https://code.launchpad.net/~maria-captains/maria/5.1-converting/+edit-subsc…
1
0
[Maria-developers] [Branch ~maria-captains/maria/5.1-converting] Rev 2860: Automerge MariaDB 5.1.47 release into main.
by noreply@launchpad.net 17 Jun '10
by noreply@launchpad.net 17 Jun '10
17 Jun '10
Merge authors:
<Dao-Gang.Qu(a)sun.com>
<Li-Bing.Song(a)sun.com>
Aleksandr Kuzminsky (akuzminsky)
Alexander Barkov <bar(a)mysql.com>
Alexander Nozdrin <alik(a)sun.com>...
Related merge proposals:
https://code.launchpad.net/~paul-mccullagh/maria/add-xtstat-util/+merge/250…
proposed by: Paul McCullagh (paul-mccullagh)
https://code.launchpad.net/~paul-mccullagh/maria/pbxt-1.0.11/+merge/24882
proposed by: Paul McCullagh (paul-mccullagh)
------------------------------------------------------------
revno: 2860 [merge]
committer: knielsen(a)knielsen-hq.org
branch nick: mariadb-5.1
timestamp: Mon 2010-05-31 10:43:34 +0200
message:
Automerge MariaDB 5.1.47 release into main.
removed:
mysql-test/include/ctype_innodb_like.inc
mysql-test/include/have_innodb.inc
mysql-test/include/innodb_trx_weight.inc
mysql-test/r/innodb-autoinc-44030.result
mysql-test/r/innodb-autoinc.result
mysql-test/r/innodb_bug21704.result
mysql-test/r/innodb_bug38231.result
mysql-test/r/innodb_bug40565.result
mysql-test/r/innodb_bug42101-nonzero.result
mysql-test/r/innodb_bug42101.result
mysql-test/r/innodb_bug44032.result
mysql-test/r/innodb_bug44369.result
mysql-test/r/innodb_bug45357.result
mysql-test/r/innodb_bug46000.result
mysql-test/r/innodb_bug47777.result
mysql-test/suite/innodb/include/have_innodb_plugin.inc
mysql-test/suite/innodb/include/innodb-index.inc
mysql-test/suite/innodb/r/innodb-analyze.result
mysql-test/suite/innodb/r/innodb-consistent.result
mysql-test/suite/innodb/r/innodb-index.result
mysql-test/suite/innodb/r/innodb-index_ucs2.result
mysql-test/suite/innodb/r/innodb-timeout.result
mysql-test/suite/innodb/r/innodb-use-sys-malloc.result
mysql-test/suite/innodb/r/innodb-zip.result
mysql-test/suite/innodb/r/innodb_bug36169.result
mysql-test/suite/innodb/r/innodb_bug36172.result
mysql-test/suite/innodb/r/innodb_bug40360.result
mysql-test/suite/innodb/r/innodb_bug41904.result
mysql-test/suite/innodb/r/innodb_bug44571.result
mysql-test/suite/innodb/r/innodb_bug46676.result
mysql-test/suite/innodb/r/innodb_bug47167.result
mysql-test/suite/innodb/r/innodb_information_schema.result
mysql-test/suite/innodb/t/disabled.def
mysql-test/suite/innodb/t/innodb-analyze.test
mysql-test/suite/innodb/t/innodb-consistent-master.opt
mysql-test/suite/innodb/t/innodb-consistent.test
mysql-test/suite/innodb/t/innodb-index.test
mysql-test/suite/innodb/t/innodb-index_ucs2.test
mysql-test/suite/innodb/t/innodb-timeout.test
mysql-test/suite/innodb/t/innodb-use-sys-malloc-master.opt
mysql-test/suite/innodb/t/innodb-use-sys-malloc.test
mysql-test/suite/innodb/t/innodb-zip.test
mysql-test/suite/innodb/t/innodb_bug36169.test
mysql-test/suite/innodb/t/innodb_bug36172.test
mysql-test/suite/innodb/t/innodb_bug40360.test
mysql-test/suite/innodb/t/innodb_bug41904.test
mysql-test/suite/innodb/t/innodb_bug44571.test
mysql-test/suite/innodb/t/innodb_bug46676.test
mysql-test/suite/innodb/t/innodb_bug47167.test
mysql-test/suite/innodb/t/innodb_information_schema.test
mysql-test/t/innodb-autoinc-44030.test
mysql-test/t/innodb-autoinc.test
mysql-test/t/innodb_bug21704.test
mysql-test/t/innodb_bug38231.test
mysql-test/t/innodb_bug40565.test
mysql-test/t/innodb_bug42101-nonzero-master.opt
mysql-test/t/innodb_bug42101-nonzero.test
mysql-test/t/innodb_bug42101.test
mysql-test/t/innodb_bug44032.test
mysql-test/t/innodb_bug44369.test
mysql-test/t/innodb_bug45357.test
mysql-test/t/innodb_bug46000.test
mysql-test/t/innodb_bug47777.test
storage/innobase/
storage/innobase/CMakeLists.txt
storage/innobase/Makefile.am
storage/innobase/btr/
storage/innobase/btr/btr0btr.c
storage/innobase/btr/btr0cur.c
storage/innobase/btr/btr0pcur.c
storage/innobase/btr/btr0sea.c
storage/innobase/buf/
storage/innobase/buf/buf0buf.c
storage/innobase/buf/buf0flu.c
storage/innobase/buf/buf0lru.c
storage/innobase/buf/buf0rea.c
storage/innobase/data/
storage/innobase/data/data0data.c
storage/innobase/data/data0type.c
storage/innobase/dict/
storage/innobase/dict/dict0boot.c
storage/innobase/dict/dict0crea.c
storage/innobase/dict/dict0dict.c
storage/innobase/dict/dict0load.c
storage/innobase/dict/dict0mem.c
storage/innobase/dyn/
storage/innobase/dyn/dyn0dyn.c
storage/innobase/eval/
storage/innobase/eval/eval0eval.c
storage/innobase/eval/eval0proc.c
storage/innobase/fil/
storage/innobase/fil/fil0fil.c
storage/innobase/fsp/
storage/innobase/fsp/fsp0fsp.c
storage/innobase/fut/
storage/innobase/fut/fut0fut.c
storage/innobase/fut/fut0lst.c
storage/innobase/ha/
storage/innobase/ha/ha0ha.c
storage/innobase/ha/hash0hash.c
storage/innobase/handler/
storage/innobase/handler/ha_innodb.cc
storage/innobase/handler/ha_innodb.h
storage/innobase/ibuf/
storage/innobase/ibuf/ibuf0ibuf.c
storage/innobase/include/
storage/innobase/include/btr0btr.h
storage/innobase/include/btr0btr.ic
storage/innobase/include/btr0cur.h
storage/innobase/include/btr0cur.ic
storage/innobase/include/btr0pcur.h
storage/innobase/include/btr0pcur.ic
storage/innobase/include/btr0sea.h
storage/innobase/include/btr0sea.ic
storage/innobase/include/btr0types.h
storage/innobase/include/buf0buf.h
storage/innobase/include/buf0buf.ic
storage/innobase/include/buf0flu.h
storage/innobase/include/buf0flu.ic
storage/innobase/include/buf0lru.h
storage/innobase/include/buf0lru.ic
storage/innobase/include/buf0rea.h
storage/innobase/include/buf0types.h
storage/innobase/include/data0data.h
storage/innobase/include/data0data.ic
storage/innobase/include/data0type.h
storage/innobase/include/data0type.ic
storage/innobase/include/data0types.h
storage/innobase/include/db0err.h
storage/innobase/include/dict0boot.h
storage/innobase/include/dict0boot.ic
storage/innobase/include/dict0crea.h
storage/innobase/include/dict0crea.ic
storage/innobase/include/dict0dict.h
storage/innobase/include/dict0dict.ic
storage/innobase/include/dict0load.h
storage/innobase/include/dict0load.ic
storage/innobase/include/dict0mem.h
storage/innobase/include/dict0mem.ic
storage/innobase/include/dict0types.h
storage/innobase/include/dyn0dyn.h
storage/innobase/include/dyn0dyn.ic
storage/innobase/include/eval0eval.h
storage/innobase/include/eval0eval.ic
storage/innobase/include/eval0proc.h
storage/innobase/include/eval0proc.ic
storage/innobase/include/fil0fil.h
storage/innobase/include/fsp0fsp.h
storage/innobase/include/fsp0fsp.ic
storage/innobase/include/fsp0types.h
storage/innobase/include/fut0fut.h
storage/innobase/include/fut0fut.ic
storage/innobase/include/fut0lst.h
storage/innobase/include/fut0lst.ic
storage/innobase/include/ha0ha.h
storage/innobase/include/ha0ha.ic
storage/innobase/include/ha_prototypes.h
storage/innobase/include/hash0hash.h
storage/innobase/include/hash0hash.ic
storage/innobase/include/ibuf0ibuf.h
storage/innobase/include/ibuf0ibuf.ic
storage/innobase/include/ibuf0types.h
storage/innobase/include/lock0iter.h
storage/innobase/include/lock0lock.h
storage/innobase/include/lock0lock.ic
storage/innobase/include/lock0priv.h
storage/innobase/include/lock0priv.ic
storage/innobase/include/lock0types.h
storage/innobase/include/log0log.h
storage/innobase/include/log0log.ic
storage/innobase/include/log0recv.h
storage/innobase/include/log0recv.ic
storage/innobase/include/mach0data.h
storage/innobase/include/mach0data.ic
storage/innobase/include/mem0dbg.h
storage/innobase/include/mem0dbg.ic
storage/innobase/include/mem0mem.h
storage/innobase/include/mem0mem.ic
storage/innobase/include/mem0pool.h
storage/innobase/include/mem0pool.ic
storage/innobase/include/mtr0log.h
storage/innobase/include/mtr0log.ic
storage/innobase/include/mtr0mtr.h
storage/innobase/include/mtr0mtr.ic
storage/innobase/include/mtr0types.h
storage/innobase/include/os0file.h
storage/innobase/include/os0proc.h
storage/innobase/include/os0proc.ic
storage/innobase/include/os0sync.h
storage/innobase/include/os0sync.ic
storage/innobase/include/os0thread.h
storage/innobase/include/os0thread.ic
storage/innobase/include/page0cur.h
storage/innobase/include/page0cur.ic
storage/innobase/include/page0page.h
storage/innobase/include/page0page.ic
storage/innobase/include/page0types.h
storage/innobase/include/pars0grm.h
storage/innobase/include/pars0opt.h
storage/innobase/include/pars0opt.ic
storage/innobase/include/pars0pars.h
storage/innobase/include/pars0pars.ic
storage/innobase/include/pars0sym.h
storage/innobase/include/pars0sym.ic
storage/innobase/include/pars0types.h
storage/innobase/include/que0que.h
storage/innobase/include/que0que.ic
storage/innobase/include/que0types.h
storage/innobase/include/read0read.h
storage/innobase/include/read0read.ic
storage/innobase/include/read0types.h
storage/innobase/include/rem0cmp.h
storage/innobase/include/rem0cmp.ic
storage/innobase/include/rem0rec.h
storage/innobase/include/rem0rec.ic
storage/innobase/include/rem0types.h
storage/innobase/include/row0ins.h
storage/innobase/include/row0ins.ic
storage/innobase/include/row0mysql.h
storage/innobase/include/row0mysql.ic
storage/innobase/include/row0purge.h
storage/innobase/include/row0purge.ic
storage/innobase/include/row0row.h
storage/innobase/include/row0row.ic
storage/innobase/include/row0sel.h
storage/innobase/include/row0sel.ic
storage/innobase/include/row0types.h
storage/innobase/include/row0uins.h
storage/innobase/include/row0uins.ic
storage/innobase/include/row0umod.h
storage/innobase/include/row0umod.ic
storage/innobase/include/row0undo.h
storage/innobase/include/row0undo.ic
storage/innobase/include/row0upd.h
storage/innobase/include/row0upd.ic
storage/innobase/include/row0vers.h
storage/innobase/include/row0vers.ic
storage/innobase/include/srv0que.h
storage/innobase/include/srv0srv.h
storage/innobase/include/srv0srv.ic
storage/innobase/include/srv0start.h
storage/innobase/include/sync0arr.h
storage/innobase/include/sync0arr.ic
storage/innobase/include/sync0rw.h
storage/innobase/include/sync0rw.ic
storage/innobase/include/sync0sync.h
storage/innobase/include/sync0sync.ic
storage/innobase/include/sync0types.h
storage/innobase/include/thr0loc.h
storage/innobase/include/thr0loc.ic
storage/innobase/include/trx0purge.h
storage/innobase/include/trx0purge.ic
storage/innobase/include/trx0rec.h
storage/innobase/include/trx0rec.ic
storage/innobase/include/trx0roll.h
storage/innobase/include/trx0roll.ic
storage/innobase/include/trx0rseg.h
storage/innobase/include/trx0rseg.ic
storage/innobase/include/trx0sys.h
storage/innobase/include/trx0sys.ic
storage/innobase/include/trx0trx.h
storage/innobase/include/trx0trx.ic
storage/innobase/include/trx0types.h
storage/innobase/include/trx0undo.h
storage/innobase/include/trx0undo.ic
storage/innobase/include/trx0xa.h
storage/innobase/include/univ.i
storage/innobase/include/usr0sess.h
storage/innobase/include/usr0sess.ic
storage/innobase/include/usr0types.h
storage/innobase/include/ut0byte.h
storage/innobase/include/ut0byte.ic
storage/innobase/include/ut0dbg.h
storage/innobase/include/ut0list.h
storage/innobase/include/ut0list.ic
storage/innobase/include/ut0lst.h
storage/innobase/include/ut0mem.h
storage/innobase/include/ut0mem.ic
storage/innobase/include/ut0rnd.h
storage/innobase/include/ut0rnd.ic
storage/innobase/include/ut0sort.h
storage/innobase/include/ut0ut.h
storage/innobase/include/ut0ut.ic
storage/innobase/include/ut0vec.h
storage/innobase/include/ut0vec.ic
storage/innobase/include/ut0wqueue.h
storage/innobase/lock/
storage/innobase/lock/lock0iter.c
storage/innobase/lock/lock0lock.c
storage/innobase/log/
storage/innobase/log/log0log.c
storage/innobase/log/log0recv.c
storage/innobase/mach/
storage/innobase/mach/mach0data.c
storage/innobase/mem/
storage/innobase/mem/mem0dbg.c
storage/innobase/mem/mem0mem.c
storage/innobase/mem/mem0pool.c
storage/innobase/mtr/
storage/innobase/mtr/mtr0log.c
storage/innobase/mtr/mtr0mtr.c
storage/innobase/os/
storage/innobase/os/os0file.c
storage/innobase/os/os0proc.c
storage/innobase/os/os0sync.c
storage/innobase/os/os0thread.c
storage/innobase/page/
storage/innobase/page/page0cur.c
storage/innobase/page/page0page.c
storage/innobase/pars/
storage/innobase/pars/lexyy.c
storage/innobase/pars/make_bison.sh
storage/innobase/pars/make_flex.sh
storage/innobase/pars/pars0grm.c
storage/innobase/pars/pars0grm.h
storage/innobase/pars/pars0grm.y
storage/innobase/pars/pars0lex.l
storage/innobase/pars/pars0opt.c
storage/innobase/pars/pars0pars.c
storage/innobase/pars/pars0sym.c
storage/innobase/plug.in.disabled
storage/innobase/que/
storage/innobase/que/que0que.c
storage/innobase/read/
storage/innobase/read/read0read.c
storage/innobase/rem/
storage/innobase/rem/rem0cmp.c
storage/innobase/rem/rem0rec.c
storage/innobase/row/
storage/innobase/row/row0ins.c
storage/innobase/row/row0mysql.c
storage/innobase/row/row0purge.c
storage/innobase/row/row0row.c
storage/innobase/row/row0sel.c
storage/innobase/row/row0uins.c
storage/innobase/row/row0umod.c
storage/innobase/row/row0undo.c
storage/innobase/row/row0upd.c
storage/innobase/row/row0vers.c
storage/innobase/srv/
storage/innobase/srv/srv0que.c
storage/innobase/srv/srv0srv.c
storage/innobase/srv/srv0start.c
storage/innobase/sync/
storage/innobase/sync/sync0arr.c
storage/innobase/sync/sync0rw.c
storage/innobase/sync/sync0sync.c
storage/innobase/thr/
storage/innobase/thr/thr0loc.c
storage/innobase/trx/
storage/innobase/trx/trx0purge.c
storage/innobase/trx/trx0rec.c
storage/innobase/trx/trx0roll.c
storage/innobase/trx/trx0rseg.c
storage/innobase/trx/trx0sys.c
storage/innobase/trx/trx0trx.c
storage/innobase/trx/trx0undo.c
storage/innobase/usr/
storage/innobase/usr/usr0sess.c
storage/innobase/ut/
storage/innobase/ut/ut0byte.c
storage/innobase/ut/ut0dbg.c
storage/innobase/ut/ut0list.c
storage/innobase/ut/ut0mem.c
storage/innobase/ut/ut0rnd.c
storage/innobase/ut/ut0ut.c
storage/innobase/ut/ut0vec.c
storage/innobase/ut/ut0wqueue.c
storage/innodb_plugin/
storage/innodb_plugin/CMakeLists.txt
storage/innodb_plugin/COPYING
storage/innodb_plugin/COPYING.Google
storage/innodb_plugin/COPYING.Percona
storage/innodb_plugin/COPYING.Sun_Microsystems
storage/innodb_plugin/ChangeLog
storage/innodb_plugin/Doxyfile
storage/innodb_plugin/Makefile.am
storage/innodb_plugin/btr/
storage/innodb_plugin/btr/btr0btr.c
storage/innodb_plugin/btr/btr0cur.c
storage/innodb_plugin/btr/btr0pcur.c
storage/innodb_plugin/btr/btr0sea.c
storage/innodb_plugin/buf/
storage/innodb_plugin/buf/buf0buddy.c
storage/innodb_plugin/buf/buf0buf.c
storage/innodb_plugin/buf/buf0flu.c
storage/innodb_plugin/buf/buf0lru.c
storage/innodb_plugin/buf/buf0rea.c
storage/innodb_plugin/compile-innodb
storage/innodb_plugin/compile-innodb-debug
storage/innodb_plugin/data/
storage/innodb_plugin/data/data0data.c
storage/innodb_plugin/data/data0type.c
storage/innodb_plugin/dict/
storage/innodb_plugin/dict/dict0boot.c
storage/innodb_plugin/dict/dict0crea.c
storage/innodb_plugin/dict/dict0dict.c
storage/innodb_plugin/dict/dict0load.c
storage/innodb_plugin/dict/dict0mem.c
storage/innodb_plugin/dyn/
storage/innodb_plugin/dyn/dyn0dyn.c
storage/innodb_plugin/eval/
storage/innodb_plugin/eval/eval0eval.c
storage/innodb_plugin/eval/eval0proc.c
storage/innodb_plugin/fil/
storage/innodb_plugin/fil/fil0fil.c
storage/innodb_plugin/fsp/
storage/innodb_plugin/fsp/fsp0fsp.c
storage/innodb_plugin/fut/
storage/innodb_plugin/fut/fut0fut.c
storage/innodb_plugin/fut/fut0lst.c
storage/innodb_plugin/ha/
storage/innodb_plugin/ha/ha0ha.c
storage/innodb_plugin/ha/ha0storage.c
storage/innodb_plugin/ha/hash0hash.c
storage/innodb_plugin/ha_innodb.def
storage/innodb_plugin/handler/
storage/innodb_plugin/handler/ha_innodb.cc
storage/innodb_plugin/handler/ha_innodb.h
storage/innodb_plugin/handler/handler0alter.cc
storage/innodb_plugin/handler/i_s.cc
storage/innodb_plugin/handler/i_s.h
storage/innodb_plugin/handler/mysql_addons.cc
storage/innodb_plugin/ibuf/
storage/innodb_plugin/ibuf/ibuf0ibuf.c
storage/innodb_plugin/include/
storage/innodb_plugin/include/btr0btr.h
storage/innodb_plugin/include/btr0btr.ic
storage/innodb_plugin/include/btr0cur.h
storage/innodb_plugin/include/btr0cur.ic
storage/innodb_plugin/include/btr0pcur.h
storage/innodb_plugin/include/btr0pcur.ic
storage/innodb_plugin/include/btr0sea.h
storage/innodb_plugin/include/btr0sea.ic
storage/innodb_plugin/include/btr0types.h
storage/innodb_plugin/include/buf0buddy.h
storage/innodb_plugin/include/buf0buddy.ic
storage/innodb_plugin/include/buf0buf.h
storage/innodb_plugin/include/buf0buf.ic
storage/innodb_plugin/include/buf0flu.h
storage/innodb_plugin/include/buf0flu.ic
storage/innodb_plugin/include/buf0lru.h
storage/innodb_plugin/include/buf0lru.ic
storage/innodb_plugin/include/buf0rea.h
storage/innodb_plugin/include/buf0types.h
storage/innodb_plugin/include/data0data.h
storage/innodb_plugin/include/data0data.ic
storage/innodb_plugin/include/data0type.h
storage/innodb_plugin/include/data0type.ic
storage/innodb_plugin/include/data0types.h
storage/innodb_plugin/include/db0err.h
storage/innodb_plugin/include/dict0boot.h
storage/innodb_plugin/include/dict0boot.ic
storage/innodb_plugin/include/dict0crea.h
storage/innodb_plugin/include/dict0crea.ic
storage/innodb_plugin/include/dict0dict.h
storage/innodb_plugin/include/dict0dict.ic
storage/innodb_plugin/include/dict0load.h
storage/innodb_plugin/include/dict0load.ic
storage/innodb_plugin/include/dict0mem.h
storage/innodb_plugin/include/dict0mem.ic
storage/innodb_plugin/include/dict0types.h
storage/innodb_plugin/include/dyn0dyn.h
storage/innodb_plugin/include/dyn0dyn.ic
storage/innodb_plugin/include/eval0eval.h
storage/innodb_plugin/include/eval0eval.ic
storage/innodb_plugin/include/eval0proc.h
storage/innodb_plugin/include/eval0proc.ic
storage/innodb_plugin/include/fil0fil.h
storage/innodb_plugin/include/fsp0fsp.h
storage/innodb_plugin/include/fsp0fsp.ic
storage/innodb_plugin/include/fsp0types.h
storage/innodb_plugin/include/fut0fut.h
storage/innodb_plugin/include/fut0fut.ic
storage/innodb_plugin/include/fut0lst.h
storage/innodb_plugin/include/fut0lst.ic
storage/innodb_plugin/include/ha0ha.h
storage/innodb_plugin/include/ha0ha.ic
storage/innodb_plugin/include/ha0storage.h
storage/innodb_plugin/include/ha0storage.ic
storage/innodb_plugin/include/ha_prototypes.h
storage/innodb_plugin/include/handler0alter.h
storage/innodb_plugin/include/hash0hash.h
storage/innodb_plugin/include/hash0hash.ic
storage/innodb_plugin/include/ibuf0ibuf.h
storage/innodb_plugin/include/ibuf0ibuf.ic
storage/innodb_plugin/include/ibuf0types.h
storage/innodb_plugin/include/lock0iter.h
storage/innodb_plugin/include/lock0lock.h
storage/innodb_plugin/include/lock0lock.ic
storage/innodb_plugin/include/lock0priv.h
storage/innodb_plugin/include/lock0priv.ic
storage/innodb_plugin/include/lock0types.h
storage/innodb_plugin/include/log0log.h
storage/innodb_plugin/include/log0log.ic
storage/innodb_plugin/include/log0recv.h
storage/innodb_plugin/include/log0recv.ic
storage/innodb_plugin/include/mach0data.h
storage/innodb_plugin/include/mach0data.ic
storage/innodb_plugin/include/mem0dbg.h
storage/innodb_plugin/include/mem0dbg.ic
storage/innodb_plugin/include/mem0mem.h
storage/innodb_plugin/include/mem0mem.ic
storage/innodb_plugin/include/mem0pool.h
storage/innodb_plugin/include/mem0pool.ic
storage/innodb_plugin/include/mtr0log.h
storage/innodb_plugin/include/mtr0log.ic
storage/innodb_plugin/include/mtr0mtr.h
storage/innodb_plugin/include/mtr0mtr.ic
storage/innodb_plugin/include/mtr0types.h
storage/innodb_plugin/include/mysql_addons.h
storage/innodb_plugin/include/os0file.h
storage/innodb_plugin/include/os0proc.h
storage/innodb_plugin/include/os0proc.ic
storage/innodb_plugin/include/os0sync.h
storage/innodb_plugin/include/os0sync.ic
storage/innodb_plugin/include/os0thread.h
storage/innodb_plugin/include/os0thread.ic
storage/innodb_plugin/include/page0cur.h
storage/innodb_plugin/include/page0cur.ic
storage/innodb_plugin/include/page0page.h
storage/innodb_plugin/include/page0page.ic
storage/innodb_plugin/include/page0types.h
storage/innodb_plugin/include/page0zip.h
storage/innodb_plugin/include/page0zip.ic
storage/innodb_plugin/include/pars0grm.h
storage/innodb_plugin/include/pars0opt.h
storage/innodb_plugin/include/pars0opt.ic
storage/innodb_plugin/include/pars0pars.h
storage/innodb_plugin/include/pars0pars.ic
storage/innodb_plugin/include/pars0sym.h
storage/innodb_plugin/include/pars0sym.ic
storage/innodb_plugin/include/pars0types.h
storage/innodb_plugin/include/que0que.h
storage/innodb_plugin/include/que0que.ic
storage/innodb_plugin/include/que0types.h
storage/innodb_plugin/include/read0read.h
storage/innodb_plugin/include/read0read.ic
storage/innodb_plugin/include/read0types.h
storage/innodb_plugin/include/rem0cmp.h
storage/innodb_plugin/include/rem0cmp.ic
storage/innodb_plugin/include/rem0rec.h
storage/innodb_plugin/include/rem0rec.ic
storage/innodb_plugin/include/rem0types.h
storage/innodb_plugin/include/row0ext.h
storage/innodb_plugin/include/row0ext.ic
storage/innodb_plugin/include/row0ins.h
storage/innodb_plugin/include/row0ins.ic
storage/innodb_plugin/include/row0merge.h
storage/innodb_plugin/include/row0mysql.h
storage/innodb_plugin/include/row0mysql.ic
storage/innodb_plugin/include/row0purge.h
storage/innodb_plugin/include/row0purge.ic
storage/innodb_plugin/include/row0row.h
storage/innodb_plugin/include/row0row.ic
storage/innodb_plugin/include/row0sel.h
storage/innodb_plugin/include/row0sel.ic
storage/innodb_plugin/include/row0types.h
storage/innodb_plugin/include/row0uins.h
storage/innodb_plugin/include/row0uins.ic
storage/innodb_plugin/include/row0umod.h
storage/innodb_plugin/include/row0umod.ic
storage/innodb_plugin/include/row0undo.h
storage/innodb_plugin/include/row0undo.ic
storage/innodb_plugin/include/row0upd.h
storage/innodb_plugin/include/row0upd.ic
storage/innodb_plugin/include/row0vers.h
storage/innodb_plugin/include/row0vers.ic
storage/innodb_plugin/include/srv0que.h
storage/innodb_plugin/include/srv0srv.h
storage/innodb_plugin/include/srv0srv.ic
storage/innodb_plugin/include/srv0start.h
storage/innodb_plugin/include/sync0arr.h
storage/innodb_plugin/include/sync0arr.ic
storage/innodb_plugin/include/sync0rw.h
storage/innodb_plugin/include/sync0rw.ic
storage/innodb_plugin/include/sync0sync.h
storage/innodb_plugin/include/sync0sync.ic
storage/innodb_plugin/include/sync0types.h
storage/innodb_plugin/include/thr0loc.h
storage/innodb_plugin/include/thr0loc.ic
storage/innodb_plugin/include/trx0i_s.h
storage/innodb_plugin/include/trx0purge.h
storage/innodb_plugin/include/trx0purge.ic
storage/innodb_plugin/include/trx0rec.h
storage/innodb_plugin/include/trx0rec.ic
storage/innodb_plugin/include/trx0roll.h
storage/innodb_plugin/include/trx0roll.ic
storage/innodb_plugin/include/trx0rseg.h
storage/innodb_plugin/include/trx0rseg.ic
storage/innodb_plugin/include/trx0sys.h
storage/innodb_plugin/include/trx0sys.ic
storage/innodb_plugin/include/trx0trx.h
storage/innodb_plugin/include/trx0trx.ic
storage/innodb_plugin/include/trx0types.h
storage/innodb_plugin/include/trx0undo.h
storage/innodb_plugin/include/trx0undo.ic
storage/innodb_plugin/include/trx0xa.h
storage/innodb_plugin/include/univ.i
storage/innodb_plugin/include/usr0sess.h
storage/innodb_plugin/include/usr0sess.ic
storage/innodb_plugin/include/usr0types.h
storage/innodb_plugin/include/ut0auxconf.h
storage/innodb_plugin/include/ut0byte.h
storage/innodb_plugin/include/ut0byte.ic
storage/innodb_plugin/include/ut0dbg.h
storage/innodb_plugin/include/ut0list.h
storage/innodb_plugin/include/ut0list.ic
storage/innodb_plugin/include/ut0lst.h
storage/innodb_plugin/include/ut0mem.h
storage/innodb_plugin/include/ut0mem.ic
storage/innodb_plugin/include/ut0rnd.h
storage/innodb_plugin/include/ut0rnd.ic
storage/innodb_plugin/include/ut0sort.h
storage/innodb_plugin/include/ut0ut.h
storage/innodb_plugin/include/ut0ut.ic
storage/innodb_plugin/include/ut0vec.h
storage/innodb_plugin/include/ut0vec.ic
storage/innodb_plugin/include/ut0wqueue.h
storage/innodb_plugin/lock/
storage/innodb_plugin/lock/lock0iter.c
storage/innodb_plugin/lock/lock0lock.c
storage/innodb_plugin/log/
storage/innodb_plugin/log/log0log.c
storage/innodb_plugin/log/log0recv.c
storage/innodb_plugin/mach/
storage/innodb_plugin/mach/mach0data.c
storage/innodb_plugin/mem/
storage/innodb_plugin/mem/mem0dbg.c
storage/innodb_plugin/mem/mem0mem.c
storage/innodb_plugin/mem/mem0pool.c
storage/innodb_plugin/mtr/
storage/innodb_plugin/mtr/mtr0log.c
storage/innodb_plugin/mtr/mtr0mtr.c
storage/innodb_plugin/mysql-test/
storage/innodb_plugin/mysql-test/ctype_innodb_like.inc
storage/innodb_plugin/mysql-test/have_innodb.inc
storage/innodb_plugin/mysql-test/innodb-analyze.result
storage/innodb_plugin/mysql-test/innodb-analyze.test
storage/innodb_plugin/mysql-test/innodb-autoinc.result
storage/innodb_plugin/mysql-test/innodb-autoinc.test
storage/innodb_plugin/mysql-test/innodb-consistent-master.opt
storage/innodb_plugin/mysql-test/innodb-consistent.result
storage/innodb_plugin/mysql-test/innodb-consistent.test
storage/innodb_plugin/mysql-test/innodb-index.inc
storage/innodb_plugin/mysql-test/innodb-index.result
storage/innodb_plugin/mysql-test/innodb-index.test
storage/innodb_plugin/mysql-test/innodb-index_ucs2.result
storage/innodb_plugin/mysql-test/innodb-index_ucs2.test
storage/innodb_plugin/mysql-test/innodb-lock.result
storage/innodb_plugin/mysql-test/innodb-lock.test
storage/innodb_plugin/mysql-test/innodb-master.opt
storage/innodb_plugin/mysql-test/innodb-replace.result
storage/innodb_plugin/mysql-test/innodb-replace.test
storage/innodb_plugin/mysql-test/innodb-semi-consistent-master.opt
storage/innodb_plugin/mysql-test/innodb-semi-consistent.result
storage/innodb_plugin/mysql-test/innodb-semi-consistent.test
storage/innodb_plugin/mysql-test/innodb-timeout.result
storage/innodb_plugin/mysql-test/innodb-timeout.test
storage/innodb_plugin/mysql-test/innodb-use-sys-malloc-master.opt
storage/innodb_plugin/mysql-test/innodb-use-sys-malloc.result
storage/innodb_plugin/mysql-test/innodb-use-sys-malloc.test
storage/innodb_plugin/mysql-test/innodb-zip.result
storage/innodb_plugin/mysql-test/innodb-zip.test
storage/innodb_plugin/mysql-test/innodb.result
storage/innodb_plugin/mysql-test/innodb.test
storage/innodb_plugin/mysql-test/innodb_bug21704.result
storage/innodb_plugin/mysql-test/innodb_bug21704.test
storage/innodb_plugin/mysql-test/innodb_bug34053.result
storage/innodb_plugin/mysql-test/innodb_bug34053.test
storage/innodb_plugin/mysql-test/innodb_bug34300.result
storage/innodb_plugin/mysql-test/innodb_bug34300.test
storage/innodb_plugin/mysql-test/innodb_bug35220.result
storage/innodb_plugin/mysql-test/innodb_bug35220.test
storage/innodb_plugin/mysql-test/innodb_bug36169.result
storage/innodb_plugin/mysql-test/innodb_bug36169.test
storage/innodb_plugin/mysql-test/innodb_bug36172.result
storage/innodb_plugin/mysql-test/innodb_bug36172.test
storage/innodb_plugin/mysql-test/innodb_bug40360.result
storage/innodb_plugin/mysql-test/innodb_bug40360.test
storage/innodb_plugin/mysql-test/innodb_bug40565.result
storage/innodb_plugin/mysql-test/innodb_bug40565.test
storage/innodb_plugin/mysql-test/innodb_bug41904.result
storage/innodb_plugin/mysql-test/innodb_bug41904.test
storage/innodb_plugin/mysql-test/innodb_bug42101-nonzero-master.opt
storage/innodb_plugin/mysql-test/innodb_bug42101-nonzero.result
storage/innodb_plugin/mysql-test/innodb_bug42101-nonzero.test
storage/innodb_plugin/mysql-test/innodb_bug42101.result
storage/innodb_plugin/mysql-test/innodb_bug42101.test
storage/innodb_plugin/mysql-test/innodb_bug44032.result
storage/innodb_plugin/mysql-test/innodb_bug44032.test
storage/innodb_plugin/mysql-test/innodb_bug44369.result
storage/innodb_plugin/mysql-test/innodb_bug44369.test
storage/innodb_plugin/mysql-test/innodb_bug44571.result
storage/innodb_plugin/mysql-test/innodb_bug44571.test
storage/innodb_plugin/mysql-test/innodb_bug45357.result
storage/innodb_plugin/mysql-test/innodb_bug45357.test
storage/innodb_plugin/mysql-test/innodb_bug46000.result
storage/innodb_plugin/mysql-test/innodb_bug46000.test
storage/innodb_plugin/mysql-test/innodb_file_format.result
storage/innodb_plugin/mysql-test/innodb_file_format.test
storage/innodb_plugin/mysql-test/innodb_information_schema.result
storage/innodb_plugin/mysql-test/innodb_information_schema.test
storage/innodb_plugin/mysql-test/innodb_trx_weight.inc
storage/innodb_plugin/mysql-test/innodb_trx_weight.result
storage/innodb_plugin/mysql-test/innodb_trx_weight.test
storage/innodb_plugin/mysql-test/patches/
storage/innodb_plugin/mysql-test/patches/README
storage/innodb_plugin/mysql-test/patches/index_merge_innodb-explain.diff
storage/innodb_plugin/mysql-test/patches/information_schema.diff
storage/innodb_plugin/mysql-test/patches/innodb-index.diff
storage/innodb_plugin/mysql-test/patches/innodb_file_per_table.diff
storage/innodb_plugin/mysql-test/patches/innodb_lock_wait_timeout.diff
storage/innodb_plugin/mysql-test/patches/innodb_thread_concurrency_basic.diff
storage/innodb_plugin/mysql-test/patches/partition_innodb.diff
storage/innodb_plugin/os/
storage/innodb_plugin/os/os0file.c
storage/innodb_plugin/os/os0proc.c
storage/innodb_plugin/os/os0sync.c
storage/innodb_plugin/os/os0thread.c
storage/innodb_plugin/page/
storage/innodb_plugin/page/page0cur.c
storage/innodb_plugin/page/page0page.c
storage/innodb_plugin/page/page0zip.c
storage/innodb_plugin/pars/
storage/innodb_plugin/pars/lexyy.c
storage/innodb_plugin/pars/make_bison.sh
storage/innodb_plugin/pars/make_flex.sh
storage/innodb_plugin/pars/pars0grm.c
storage/innodb_plugin/pars/pars0grm.y
storage/innodb_plugin/pars/pars0lex.l
storage/innodb_plugin/pars/pars0opt.c
storage/innodb_plugin/pars/pars0pars.c
storage/innodb_plugin/pars/pars0sym.c
storage/innodb_plugin/plug.in.disabled
storage/innodb_plugin/que/
storage/innodb_plugin/que/que0que.c
storage/innodb_plugin/read/
storage/innodb_plugin/read/read0read.c
storage/innodb_plugin/rem/
storage/innodb_plugin/rem/rem0cmp.c
storage/innodb_plugin/rem/rem0rec.c
storage/innodb_plugin/revert_gen.sh
storage/innodb_plugin/row/
storage/innodb_plugin/row/row0ext.c
storage/innodb_plugin/row/row0ins.c
storage/innodb_plugin/row/row0merge.c
storage/innodb_plugin/row/row0mysql.c
storage/innodb_plugin/row/row0purge.c
storage/innodb_plugin/row/row0row.c
storage/innodb_plugin/row/row0sel.c
storage/innodb_plugin/row/row0uins.c
storage/innodb_plugin/row/row0umod.c
storage/innodb_plugin/row/row0undo.c
storage/innodb_plugin/row/row0upd.c
storage/innodb_plugin/row/row0vers.c
storage/innodb_plugin/scripts/
storage/innodb_plugin/scripts/export.sh
storage/innodb_plugin/scripts/install_innodb_plugins.sql
storage/innodb_plugin/scripts/install_innodb_plugins_win.sql
storage/innodb_plugin/setup.sh
storage/innodb_plugin/srv/
storage/innodb_plugin/srv/srv0que.c
storage/innodb_plugin/srv/srv0srv.c
storage/innodb_plugin/srv/srv0start.c
storage/innodb_plugin/sync/
storage/innodb_plugin/sync/sync0arr.c
storage/innodb_plugin/sync/sync0rw.c
storage/innodb_plugin/sync/sync0sync.c
storage/innodb_plugin/thr/
storage/innodb_plugin/thr/thr0loc.c
storage/innodb_plugin/trx/
storage/innodb_plugin/trx/trx0i_s.c
storage/innodb_plugin/trx/trx0purge.c
storage/innodb_plugin/trx/trx0rec.c
storage/innodb_plugin/trx/trx0roll.c
storage/innodb_plugin/trx/trx0rseg.c
storage/innodb_plugin/trx/trx0sys.c
storage/innodb_plugin/trx/trx0trx.c
storage/innodb_plugin/trx/trx0undo.c
storage/innodb_plugin/usr/
storage/innodb_plugin/usr/usr0sess.c
storage/innodb_plugin/ut/
storage/innodb_plugin/ut/ut0auxconf_atomic_pthread_t_gcc.c
storage/innodb_plugin/ut/ut0auxconf_atomic_pthread_t_solaris.c
storage/innodb_plugin/ut/ut0auxconf_have_gcc_atomics.c
storage/innodb_plugin/ut/ut0auxconf_have_solaris_atomics.c
storage/innodb_plugin/ut/ut0auxconf_pause.c
storage/innodb_plugin/ut/ut0auxconf_sizeof_pthread_t.c
storage/innodb_plugin/ut/ut0byte.c
storage/innodb_plugin/ut/ut0dbg.c
storage/innodb_plugin/ut/ut0list.c
storage/innodb_plugin/ut/ut0mem.c
storage/innodb_plugin/ut/ut0rnd.c
storage/innodb_plugin/ut/ut0ut.c
storage/innodb_plugin/ut/ut0vec.c
storage/innodb_plugin/ut/ut0wqueue.c
added:
include/my_valgrind.h
mysql-test/include/ctype_innodb_like.inc
mysql-test/include/have_innodb.inc
mysql-test/include/have_innodb_plugin.inc
mysql-test/include/innodb_trx_weight.inc
mysql-test/include/min_null_cond.inc
mysql-test/include/not_binlog_format_row.inc
mysql-test/include/view_alias.inc
mysql-test/r/bug39022.result
mysql-test/r/bug46261.result
mysql-test/r/log_tables_upgrade.result
mysql-test/r/no_binlog.result
mysql-test/r/partition_debug_sync.result
mysql-test/r/plugin_not_embedded.result
mysql-test/r/view_alias.result
mysql-test/std_data/binlog_savepoint.000001
mysql-test/std_data/bug46565.ARZ
mysql-test/std_data/bug46565.frm
mysql-test/std_data/bug48265.frm
mysql-test/std_data/bug48449.frm
mysql-test/std_data/bug49823.CSM
mysql-test/std_data/bug49823.CSV
mysql-test/std_data/bug49823.frm
mysql-test/suite/engines/
mysql-test/suite/engines/README
mysql-test/suite/engines/funcs/
mysql-test/suite/engines/funcs/r/
mysql-test/suite/engines/funcs/r/ai_init_alter_table.result
mysql-test/suite/engines/funcs/r/ai_init_create_table.result
mysql-test/suite/engines/funcs/r/ai_init_insert.result
mysql-test/suite/engines/funcs/r/ai_init_insert_id.result
mysql-test/suite/engines/funcs/r/ai_overflow_error.result
mysql-test/suite/engines/funcs/r/ai_reset_by_truncate.result
mysql-test/suite/engines/funcs/r/ai_sql_auto_is_null.result
mysql-test/suite/engines/funcs/r/an_calendar.result
mysql-test/suite/engines/funcs/r/an_number.result
mysql-test/suite/engines/funcs/r/an_string.result
mysql-test/suite/engines/funcs/r/comment_column.result
mysql-test/suite/engines/funcs/r/comment_column2.result
mysql-test/suite/engines/funcs/r/comment_table.result
mysql-test/suite/engines/funcs/r/crash_manycolumns_number.result
mysql-test/suite/engines/funcs/r/crash_manycolumns_string.result
mysql-test/suite/engines/funcs/r/crash_manyindexes_number.result
mysql-test/suite/engines/funcs/r/crash_manyindexes_string.result
mysql-test/suite/engines/funcs/r/crash_manytables_number.result
mysql-test/suite/engines/funcs/r/crash_manytables_string.result
mysql-test/suite/engines/funcs/r/date_function.result
mysql-test/suite/engines/funcs/r/datetime_function.result
mysql-test/suite/engines/funcs/r/db_alter_character_set.result
mysql-test/suite/engines/funcs/r/db_alter_character_set_collate.result
mysql-test/suite/engines/funcs/r/db_alter_collate_ascii.result
mysql-test/suite/engines/funcs/r/db_alter_collate_utf8.result
mysql-test/suite/engines/funcs/r/db_create_character_set.result
mysql-test/suite/engines/funcs/r/db_create_character_set_collate.result
mysql-test/suite/engines/funcs/r/db_create_drop.result
mysql-test/suite/engines/funcs/r/db_create_error.result
mysql-test/suite/engines/funcs/r/db_create_error_reserved.result
mysql-test/suite/engines/funcs/r/db_create_if_not_exists.result
mysql-test/suite/engines/funcs/r/db_drop_error.result
mysql-test/suite/engines/funcs/r/db_use_error.result
mysql-test/suite/engines/funcs/r/de_autoinc.result
mysql-test/suite/engines/funcs/r/de_calendar_range.result
mysql-test/suite/engines/funcs/r/de_ignore.result
mysql-test/suite/engines/funcs/r/de_limit.result
mysql-test/suite/engines/funcs/r/de_multi_db_table.result
mysql-test/suite/engines/funcs/r/de_multi_db_table_using.result
mysql-test/suite/engines/funcs/r/de_multi_table.result
mysql-test/suite/engines/funcs/r/de_multi_table_using.result
mysql-test/suite/engines/funcs/r/de_number_range.result
mysql-test/suite/engines/funcs/r/de_quick.result
mysql-test/suite/engines/funcs/r/de_string_range.result
mysql-test/suite/engines/funcs/r/de_truncate.result
mysql-test/suite/engines/funcs/r/de_truncate_autoinc.result
mysql-test/suite/engines/funcs/r/fu_aggregate_avg_number.result
mysql-test/suite/engines/funcs/r/fu_aggregate_count_number.result
mysql-test/suite/engines/funcs/r/fu_aggregate_max_number.result
mysql-test/suite/engines/funcs/r/fu_aggregate_max_subquery.result
mysql-test/suite/engines/funcs/r/fu_aggregate_min_number.result
mysql-test/suite/engines/funcs/r/fu_aggregate_sum_number.result
mysql-test/suite/engines/funcs/r/general_no_data.result
mysql-test/suite/engines/funcs/r/general_not_null.result
mysql-test/suite/engines/funcs/r/general_null.result
mysql-test/suite/engines/funcs/r/in_calendar_2_unique_constraints_duplicate_update.result
mysql-test/suite/engines/funcs/r/in_calendar_pk_constraint_duplicate_update.result
mysql-test/suite/engines/funcs/r/in_calendar_pk_constraint_error.result
mysql-test/suite/engines/funcs/r/in_calendar_pk_constraint_ignore.result
mysql-test/suite/engines/funcs/r/in_calendar_unique_constraint_duplicate_update.result
mysql-test/suite/engines/funcs/r/in_calendar_unique_constraint_error.result
mysql-test/suite/engines/funcs/r/in_calendar_unique_constraint_ignore.result
mysql-test/suite/engines/funcs/r/in_enum_null.result
mysql-test/suite/engines/funcs/r/in_enum_null_boundary_error.result
mysql-test/suite/engines/funcs/r/in_enum_null_large_error.result
mysql-test/suite/engines/funcs/r/in_insert_select.result
mysql-test/suite/engines/funcs/r/in_insert_select_autoinc.result
mysql-test/suite/engines/funcs/r/in_insert_select_unique_violation.result
mysql-test/suite/engines/funcs/r/in_lob_boundary_error.result
mysql-test/suite/engines/funcs/r/in_multicolumn_calendar_pk_constraint_duplicate_update.result
mysql-test/suite/engines/funcs/r/in_multicolumn_calendar_pk_constraint_error.result
mysql-test/suite/engines/funcs/r/in_multicolumn_calendar_pk_constraint_ignore.result
mysql-test/suite/engines/funcs/r/in_multicolumn_calendar_unique_constraint_duplicate_update.result
mysql-test/suite/engines/funcs/r/in_multicolumn_calendar_unique_constraint_error.result
mysql-test/suite/engines/funcs/r/in_multicolumn_calendar_unique_constraint_ignore.result
mysql-test/suite/engines/funcs/r/in_multicolumn_number_pk_constraint_duplicate_update.result
mysql-test/suite/engines/funcs/r/in_multicolumn_number_pk_constraint_error.result
mysql-test/suite/engines/funcs/r/in_multicolumn_number_pk_constraint_ignore.result
mysql-test/suite/engines/funcs/r/in_multicolumn_number_unique_constraint_duplicate_update.result
mysql-test/suite/engines/funcs/r/in_multicolumn_number_unique_constraint_error.result
mysql-test/suite/engines/funcs/r/in_multicolumn_number_unique_constraint_ignore.result
mysql-test/suite/engines/funcs/r/in_multicolumn_string_pk_constraint_duplicate_update.result
mysql-test/suite/engines/funcs/r/in_multicolumn_string_pk_constraint_error.result
mysql-test/suite/engines/funcs/r/in_multicolumn_string_pk_constraint_ignore.result
mysql-test/suite/engines/funcs/r/in_multicolumn_string_unique_constraint_duplicate_update.result
mysql-test/suite/engines/funcs/r/in_multicolumn_string_unique_constraint_error.result
mysql-test/suite/engines/funcs/r/in_multicolumn_string_unique_constraint_ignore.result
mysql-test/suite/engines/funcs/r/in_number_2_unique_constraints_duplicate_update.result
mysql-test/suite/engines/funcs/r/in_number_boundary_error.result
mysql-test/suite/engines/funcs/r/in_number_decimal_boundary_error.result
mysql-test/suite/engines/funcs/r/in_number_length.result
mysql-test/suite/engines/funcs/r/in_number_null.result
mysql-test/suite/engines/funcs/r/in_number_pk_constraint_duplicate_update.result
mysql-test/suite/engines/funcs/r/in_number_pk_constraint_error.result
mysql-test/suite/engines/funcs/r/in_number_pk_constraint_ignore.result
mysql-test/suite/engines/funcs/r/in_number_unique_constraint_duplicate_update.result
mysql-test/suite/engines/funcs/r/in_number_unique_constraint_error.result
mysql-test/suite/engines/funcs/r/in_number_unique_constraint_ignore.result
mysql-test/suite/engines/funcs/r/in_set_null.result
mysql-test/suite/engines/funcs/r/in_set_null_boundary_error.result
mysql-test/suite/engines/funcs/r/in_set_null_large.result
mysql-test/suite/engines/funcs/r/in_string_2_unique_constraints_duplicate_update.result
mysql-test/suite/engines/funcs/r/in_string_boundary_error.result
mysql-test/suite/engines/funcs/r/in_string_not_null.result
mysql-test/suite/engines/funcs/r/in_string_null.result
mysql-test/suite/engines/funcs/r/in_string_pk_constraint_duplicate_update.result
mysql-test/suite/engines/funcs/r/in_string_pk_constraint_error.result
mysql-test/suite/engines/funcs/r/in_string_pk_constraint_ignore.result
mysql-test/suite/engines/funcs/r/in_string_set_enum_fail.result
mysql-test/suite/engines/funcs/r/in_string_unique_constraint_duplicate_update.result
mysql-test/suite/engines/funcs/r/in_string_unique_constraint_error.result
mysql-test/suite/engines/funcs/r/in_string_unique_constraint_ignore.result
mysql-test/suite/engines/funcs/r/ix_drop.result
mysql-test/suite/engines/funcs/r/ix_drop_error.result
mysql-test/suite/engines/funcs/r/ix_index_decimals.result
mysql-test/suite/engines/funcs/r/ix_index_lob.result
mysql-test/suite/engines/funcs/r/ix_index_non_string.result
mysql-test/suite/engines/funcs/r/ix_index_string.result
mysql-test/suite/engines/funcs/r/ix_index_string_length.result
mysql-test/suite/engines/funcs/r/ix_unique_decimals.result
mysql-test/suite/engines/funcs/r/ix_unique_lob.result
mysql-test/suite/engines/funcs/r/ix_unique_non_string.result
mysql-test/suite/engines/funcs/r/ix_unique_string.result
mysql-test/suite/engines/funcs/r/ix_unique_string_length.result
mysql-test/suite/engines/funcs/r/ix_using_order.result
mysql-test/suite/engines/funcs/r/jp_comment_column.result
mysql-test/suite/engines/funcs/r/jp_comment_older_compatibility1.result
mysql-test/suite/engines/funcs/r/jp_comment_table.result
mysql-test/suite/engines/funcs/r/ld_all_number_string_calendar_types.result
mysql-test/suite/engines/funcs/r/ld_bit.result
mysql-test/suite/engines/funcs/r/ld_enum_set.result
mysql-test/suite/engines/funcs/r/ld_less_columns.result
mysql-test/suite/engines/funcs/r/ld_more_columns_truncated.result
mysql-test/suite/engines/funcs/r/ld_null.result
mysql-test/suite/engines/funcs/r/ld_quote.result
mysql-test/suite/engines/funcs/r/ld_simple.result
mysql-test/suite/engines/funcs/r/ld_starting.result
mysql-test/suite/engines/funcs/r/ld_unique_error1.result
mysql-test/suite/engines/funcs/r/ld_unique_error1_local.result
mysql-test/suite/engines/funcs/r/ld_unique_error2.result
mysql-test/suite/engines/funcs/r/ld_unique_error2_local.result
mysql-test/suite/engines/funcs/r/ld_unique_error3.result
mysql-test/suite/engines/funcs/r/ld_unique_error3_local.result
mysql-test/suite/engines/funcs/r/ps_number_length.result
mysql-test/suite/engines/funcs/r/ps_number_null.result
mysql-test/suite/engines/funcs/r/ps_string_not_null.result
mysql-test/suite/engines/funcs/r/ps_string_null.result
mysql-test/suite/engines/funcs/r/re_number_range.result
mysql-test/suite/engines/funcs/r/re_number_range_set.result
mysql-test/suite/engines/funcs/r/re_number_select.result
mysql-test/suite/engines/funcs/r/re_string_range.result
mysql-test/suite/engines/funcs/r/re_string_range_set.result
mysql-test/suite/engines/funcs/r/rpl000010.result
mysql-test/suite/engines/funcs/r/rpl000011.result
mysql-test/suite/engines/funcs/r/rpl000013.result
mysql-test/suite/engines/funcs/r/rpl000017.result
mysql-test/suite/engines/funcs/r/rpl_000015.result
mysql-test/suite/engines/funcs/r/rpl_LD_INFILE.result
mysql-test/suite/engines/funcs/r/rpl_REDIRECT.result
mysql-test/suite/engines/funcs/r/rpl_alter.result
mysql-test/suite/engines/funcs/r/rpl_alter_db.result
mysql-test/suite/engines/funcs/r/rpl_bit.result
mysql-test/suite/engines/funcs/r/rpl_bit_npk.result
mysql-test/suite/engines/funcs/r/rpl_change_master.result
mysql-test/suite/engines/funcs/r/rpl_create_database.result
mysql-test/suite/engines/funcs/r/rpl_do_grant.result
mysql-test/suite/engines/funcs/r/rpl_drop.result
mysql-test/suite/engines/funcs/r/rpl_drop_db.result
mysql-test/suite/engines/funcs/r/rpl_dual_pos_advance.result
mysql-test/suite/engines/funcs/r/rpl_empty_master_crash.result
mysql-test/suite/engines/funcs/r/rpl_err_ignoredtable.result
mysql-test/suite/engines/funcs/r/rpl_flushlog_loop.result
mysql-test/suite/engines/funcs/r/rpl_free_items.result
mysql-test/suite/engines/funcs/r/rpl_get_lock.result
mysql-test/suite/engines/funcs/r/rpl_ignore_grant.result
mysql-test/suite/engines/funcs/r/rpl_ignore_revoke.result
mysql-test/suite/engines/funcs/r/rpl_ignore_table_update.result
mysql-test/suite/engines/funcs/r/rpl_init_slave.result
mysql-test/suite/engines/funcs/r/rpl_insert.result
mysql-test/suite/engines/funcs/r/rpl_insert_select.result
mysql-test/suite/engines/funcs/r/rpl_loaddata2.result
mysql-test/suite/engines/funcs/r/rpl_loaddata_m.result
mysql-test/suite/engines/funcs/r/rpl_loaddata_s.result
mysql-test/suite/engines/funcs/r/rpl_loaddatalocal.result
mysql-test/suite/engines/funcs/r/rpl_loadfile.result
mysql-test/suite/engines/funcs/r/rpl_log_pos.result
mysql-test/suite/engines/funcs/r/rpl_many_optimize.result
mysql-test/suite/engines/funcs/r/rpl_master_pos_wait.result
mysql-test/suite/engines/funcs/r/rpl_misc_functions.result
mysql-test/suite/engines/funcs/r/rpl_multi_delete.result
mysql-test/suite/engines/funcs/r/rpl_multi_delete2.result
mysql-test/suite/engines/funcs/r/rpl_multi_update4.result
mysql-test/suite/engines/funcs/r/rpl_ps.result
mysql-test/suite/engines/funcs/r/rpl_rbr_to_sbr.result
mysql-test/suite/engines/funcs/r/rpl_relayspace.result
mysql-test/suite/engines/funcs/r/rpl_replicate_ignore_db.result
mysql-test/suite/engines/funcs/r/rpl_row_NOW.result
mysql-test/suite/engines/funcs/r/rpl_row_USER.result
mysql-test/suite/engines/funcs/r/rpl_row_drop.result
mysql-test/suite/engines/funcs/r/rpl_row_func001.result
mysql-test/suite/engines/funcs/r/rpl_row_inexist_tbl.result
mysql-test/suite/engines/funcs/r/rpl_row_max_relay_size.result
mysql-test/suite/engines/funcs/r/rpl_row_reset_slave.result
mysql-test/suite/engines/funcs/r/rpl_row_sp001.result
mysql-test/suite/engines/funcs/r/rpl_row_sp005.result
mysql-test/suite/engines/funcs/r/rpl_row_sp008.result
mysql-test/suite/engines/funcs/r/rpl_row_sp009.result
mysql-test/suite/engines/funcs/r/rpl_row_sp010.result
mysql-test/suite/engines/funcs/r/rpl_row_sp011.result
mysql-test/suite/engines/funcs/r/rpl_row_sp012.result
mysql-test/suite/engines/funcs/r/rpl_row_stop_middle.result
mysql-test/suite/engines/funcs/r/rpl_row_trig001.result
mysql-test/suite/engines/funcs/r/rpl_row_trig002.result
mysql-test/suite/engines/funcs/r/rpl_row_trig003.result
mysql-test/suite/engines/funcs/r/rpl_row_until.result
mysql-test/suite/engines/funcs/r/rpl_row_view01.result
mysql-test/suite/engines/funcs/r/rpl_server_id1.result
mysql-test/suite/engines/funcs/r/rpl_server_id2.result
mysql-test/suite/engines/funcs/r/rpl_session_var.result
mysql-test/suite/engines/funcs/r/rpl_sf.result
mysql-test/suite/engines/funcs/r/rpl_skip_error.result
mysql-test/suite/engines/funcs/r/rpl_slave_status.result
mysql-test/suite/engines/funcs/r/rpl_sp.result
mysql-test/suite/engines/funcs/r/rpl_sp004.result
mysql-test/suite/engines/funcs/r/rpl_sp_effects.result
mysql-test/suite/engines/funcs/r/rpl_start_stop_slave.result
mysql-test/suite/engines/funcs/r/rpl_stm_max_relay_size.result
mysql-test/suite/engines/funcs/r/rpl_stm_mystery22.result
mysql-test/suite/engines/funcs/r/rpl_stm_no_op.result
mysql-test/suite/engines/funcs/r/rpl_stm_reset_slave.result
mysql-test/suite/engines/funcs/r/rpl_switch_stm_row_mixed.result
mysql-test/suite/engines/funcs/r/rpl_temp_table.result
mysql-test/suite/engines/funcs/r/rpl_temporary.result
mysql-test/suite/engines/funcs/r/rpl_trigger.result
mysql-test/suite/engines/funcs/r/rpl_trunc_temp.result
mysql-test/suite/engines/funcs/r/rpl_user_variables.result
mysql-test/suite/engines/funcs/r/rpl_variables.result
mysql-test/suite/engines/funcs/r/rpl_view.result
mysql-test/suite/engines/funcs/r/se_join_cross.result
mysql-test/suite/engines/funcs/r/se_join_default.result
mysql-test/suite/engines/funcs/r/se_join_inner.result
mysql-test/suite/engines/funcs/r/se_join_left.result
mysql-test/suite/engines/funcs/r/se_join_left_outer.result
mysql-test/suite/engines/funcs/r/se_join_natural_left.result
mysql-test/suite/engines/funcs/r/se_join_natural_left_outer.result
mysql-test/suite/engines/funcs/r/se_join_natural_right.result
mysql-test/suite/engines/funcs/r/se_join_natural_right_outer.result
mysql-test/suite/engines/funcs/r/se_join_right.result
mysql-test/suite/engines/funcs/r/se_join_right_outer.result
mysql-test/suite/engines/funcs/r/se_join_straight.result
mysql-test/suite/engines/funcs/r/se_rowid.result
mysql-test/suite/engines/funcs/r/se_string_distinct.result
mysql-test/suite/engines/funcs/r/se_string_from.result
mysql-test/suite/engines/funcs/r/se_string_groupby.result
mysql-test/suite/engines/funcs/r/se_string_having.result
mysql-test/suite/engines/funcs/r/se_string_limit.result
mysql-test/suite/engines/funcs/r/se_string_orderby.result
mysql-test/suite/engines/funcs/r/se_string_union.result
mysql-test/suite/engines/funcs/r/se_string_where.result
mysql-test/suite/engines/funcs/r/se_string_where_and.result
mysql-test/suite/engines/funcs/r/se_string_where_or.result
mysql-test/suite/engines/funcs/r/sf_alter.result
mysql-test/suite/engines/funcs/r/sf_cursor.result
mysql-test/suite/engines/funcs/r/sf_simple1.result
mysql-test/suite/engines/funcs/r/sp_alter.result
mysql-test/suite/engines/funcs/r/sp_cursor.result
mysql-test/suite/engines/funcs/r/sp_simple1.result
mysql-test/suite/engines/funcs/r/sq_all.result
mysql-test/suite/engines/funcs/r/sq_any.result
mysql-test/suite/engines/funcs/r/sq_corr.result
mysql-test/suite/engines/funcs/r/sq_error.result
mysql-test/suite/engines/funcs/r/sq_exists.result
mysql-test/suite/engines/funcs/r/sq_from.result
mysql-test/suite/engines/funcs/r/sq_in.result
mysql-test/suite/engines/funcs/r/sq_row.result
mysql-test/suite/engines/funcs/r/sq_scalar.result
mysql-test/suite/engines/funcs/r/sq_some.result
mysql-test/suite/engines/funcs/r/ta_2part_column_to_pk.result
mysql-test/suite/engines/funcs/r/ta_2part_diff_string_to_pk.result
mysql-test/suite/engines/funcs/r/ta_2part_diff_to_pk.result
mysql-test/suite/engines/funcs/r/ta_2part_string_to_pk.result
mysql-test/suite/engines/funcs/r/ta_3part_column_to_pk.result
mysql-test/suite/engines/funcs/r/ta_3part_string_to_pk.result
mysql-test/suite/engines/funcs/r/ta_add_column.result
mysql-test/suite/engines/funcs/r/ta_add_column2.result
mysql-test/suite/engines/funcs/r/ta_add_column_first.result
mysql-test/suite/engines/funcs/r/ta_add_column_first2.result
mysql-test/suite/engines/funcs/r/ta_add_column_middle.result
mysql-test/suite/engines/funcs/r/ta_add_column_middle2.result
mysql-test/suite/engines/funcs/r/ta_add_string.result
mysql-test/suite/engines/funcs/r/ta_add_string2.result
mysql-test/suite/engines/funcs/r/ta_add_string_first.result
mysql-test/suite/engines/funcs/r/ta_add_string_first2.result
mysql-test/suite/engines/funcs/r/ta_add_string_middle.result
mysql-test/suite/engines/funcs/r/ta_add_string_middle2.result
mysql-test/suite/engines/funcs/r/ta_add_string_unique_index.result
mysql-test/suite/engines/funcs/r/ta_add_unique_index.result
mysql-test/suite/engines/funcs/r/ta_column_from_unsigned.result
mysql-test/suite/engines/funcs/r/ta_column_from_zerofill.result
mysql-test/suite/engines/funcs/r/ta_column_to_index.result
mysql-test/suite/engines/funcs/r/ta_column_to_not_null.result
mysql-test/suite/engines/funcs/r/ta_column_to_null.result
mysql-test/suite/engines/funcs/r/ta_column_to_pk.result
mysql-test/suite/engines/funcs/r/ta_column_to_unsigned.result
mysql-test/suite/engines/funcs/r/ta_column_to_zerofill.result
mysql-test/suite/engines/funcs/r/ta_drop_column.result
mysql-test/suite/engines/funcs/r/ta_drop_index.result
mysql-test/suite/engines/funcs/r/ta_drop_pk_autoincrement.result
mysql-test/suite/engines/funcs/r/ta_drop_pk_number.result
mysql-test/suite/engines/funcs/r/ta_drop_pk_string.result
mysql-test/suite/engines/funcs/r/ta_drop_string_index.result
mysql-test/suite/engines/funcs/r/ta_orderby.result
mysql-test/suite/engines/funcs/r/ta_rename.result
mysql-test/suite/engines/funcs/r/ta_set_drop_default.result
mysql-test/suite/engines/funcs/r/ta_string_drop_column.result
mysql-test/suite/engines/funcs/r/ta_string_to_index.result
mysql-test/suite/engines/funcs/r/ta_string_to_not_null.result
mysql-test/suite/engines/funcs/r/ta_string_to_null.result
mysql-test/suite/engines/funcs/r/ta_string_to_pk.result
mysql-test/suite/engines/funcs/r/tc_column_autoincrement.result
mysql-test/suite/engines/funcs/r/tc_column_comment.result
mysql-test/suite/engines/funcs/r/tc_column_comment_string.result
mysql-test/suite/engines/funcs/r/tc_column_default_decimal.result
mysql-test/suite/engines/funcs/r/tc_column_default_number.result
mysql-test/suite/engines/funcs/r/tc_column_default_string.result
mysql-test/suite/engines/funcs/r/tc_column_enum.result
mysql-test/suite/engines/funcs/r/tc_column_enum_long.result
mysql-test/suite/engines/funcs/r/tc_column_key.result
mysql-test/suite/engines/funcs/r/tc_column_key_length.result
mysql-test/suite/engines/funcs/r/tc_column_length.result
mysql-test/suite/engines/funcs/r/tc_column_length_decimals.result
mysql-test/suite/engines/funcs/r/tc_column_length_zero.result
mysql-test/suite/engines/funcs/r/tc_column_not_null.result
mysql-test/suite/engines/funcs/r/tc_column_null.result
mysql-test/suite/engines/funcs/r/tc_column_primary_key_number.result
mysql-test/suite/engines/funcs/r/tc_column_primary_key_string.result
mysql-test/suite/engines/funcs/r/tc_column_serial.result
mysql-test/suite/engines/funcs/r/tc_column_set.result
mysql-test/suite/engines/funcs/r/tc_column_set_long.result
mysql-test/suite/engines/funcs/r/tc_column_unique_key.result
mysql-test/suite/engines/funcs/r/tc_column_unique_key_string.result
mysql-test/suite/engines/funcs/r/tc_column_unsigned.result
mysql-test/suite/engines/funcs/r/tc_column_zerofill.result
mysql-test/suite/engines/funcs/r/tc_drop_table.result
mysql-test/suite/engines/funcs/r/tc_multicolumn_different.result
mysql-test/suite/engines/funcs/r/tc_multicolumn_same.result
mysql-test/suite/engines/funcs/r/tc_multicolumn_same_string.result
mysql-test/suite/engines/funcs/r/tc_partition_analyze.result
mysql-test/suite/engines/funcs/r/tc_partition_change_from_range_to_hash_key.result
mysql-test/suite/engines/funcs/r/tc_partition_check.result
mysql-test/suite/engines/funcs/r/tc_partition_hash.result
mysql-test/suite/engines/funcs/r/tc_partition_hash_date_function.result
mysql-test/suite/engines/funcs/r/tc_partition_key.result
mysql-test/suite/engines/funcs/r/tc_partition_linear_key.result
mysql-test/suite/engines/funcs/r/tc_partition_list_directory.result
mysql-test/suite/engines/funcs/r/tc_partition_list_error.result
mysql-test/suite/engines/funcs/r/tc_partition_optimize.result
mysql-test/suite/engines/funcs/r/tc_partition_rebuild.result
mysql-test/suite/engines/funcs/r/tc_partition_remove.result
mysql-test/suite/engines/funcs/r/tc_partition_reorg_divide.result
mysql-test/suite/engines/funcs/r/tc_partition_reorg_hash_key.result
mysql-test/suite/engines/funcs/r/tc_partition_reorg_merge.result
mysql-test/suite/engines/funcs/r/tc_partition_repair.result
mysql-test/suite/engines/funcs/r/tc_partition_sub1.result
mysql-test/suite/engines/funcs/r/tc_partition_sub2.result
mysql-test/suite/engines/funcs/r/tc_partition_value.result
mysql-test/suite/engines/funcs/r/tc_partition_value_error.result
mysql-test/suite/engines/funcs/r/tc_partition_value_specific.result
mysql-test/suite/engines/funcs/r/tc_rename.result
mysql-test/suite/engines/funcs/r/tc_rename_across_database.result
mysql-test/suite/engines/funcs/r/tc_rename_error.result
mysql-test/suite/engines/funcs/r/tc_structure_comment.result
mysql-test/suite/engines/funcs/r/tc_structure_create_like.result
mysql-test/suite/engines/funcs/r/tc_structure_create_like_string.result
mysql-test/suite/engines/funcs/r/tc_structure_create_select.result
mysql-test/suite/engines/funcs/r/tc_structure_create_select_string.result
mysql-test/suite/engines/funcs/r/tc_structure_string_comment.result
mysql-test/suite/engines/funcs/r/tc_temporary_column.result
mysql-test/suite/engines/funcs/r/tc_temporary_column_length.result
mysql-test/suite/engines/funcs/r/time_function.result
mysql-test/suite/engines/funcs/r/tr_all_type_triggers.result
mysql-test/suite/engines/funcs/r/tr_delete.result
mysql-test/suite/engines/funcs/r/tr_delete_new_error.result
mysql-test/suite/engines/funcs/r/tr_insert.result
mysql-test/suite/engines/funcs/r/tr_insert_after_error.result
mysql-test/suite/engines/funcs/r/tr_insert_old_error.result
mysql-test/suite/engines/funcs/r/tr_update.result
mysql-test/suite/engines/funcs/r/tr_update_after_error.result
mysql-test/suite/engines/funcs/r/up_calendar_range.result
mysql-test/suite/engines/funcs/r/up_ignore.result
mysql-test/suite/engines/funcs/r/up_limit.result
mysql-test/suite/engines/funcs/r/up_multi_db_table.result
mysql-test/suite/engines/funcs/r/up_multi_table.result
mysql-test/suite/engines/funcs/r/up_nullcheck.result
mysql-test/suite/engines/funcs/r/up_number_range.result
mysql-test/suite/engines/funcs/r/up_string_range.result
mysql-test/suite/engines/funcs/t/
mysql-test/suite/engines/funcs/t/ai_init_alter_table.test
mysql-test/suite/engines/funcs/t/ai_init_create_table.test
mysql-test/suite/engines/funcs/t/ai_init_insert.test
mysql-test/suite/engines/funcs/t/ai_init_insert_id.test
mysql-test/suite/engines/funcs/t/ai_overflow_error.test
mysql-test/suite/engines/funcs/t/ai_reset_by_truncate.test
mysql-test/suite/engines/funcs/t/ai_sql_auto_is_null.test
mysql-test/suite/engines/funcs/t/an_calendar.test
mysql-test/suite/engines/funcs/t/an_number.test
mysql-test/suite/engines/funcs/t/an_string.test
mysql-test/suite/engines/funcs/t/comment_column.test
mysql-test/suite/engines/funcs/t/comment_column2.test
mysql-test/suite/engines/funcs/t/comment_table.test
mysql-test/suite/engines/funcs/t/crash_manycolumns_number.test
mysql-test/suite/engines/funcs/t/crash_manycolumns_string.test
mysql-test/suite/engines/funcs/t/crash_manyindexes_number.test
mysql-test/suite/engines/funcs/t/crash_manyindexes_string.test
mysql-test/suite/engines/funcs/t/crash_manytables_number.test
mysql-test/suite/engines/funcs/t/crash_manytables_string.test
mysql-test/suite/engines/funcs/t/data1.inc
mysql-test/suite/engines/funcs/t/data2.inc
mysql-test/suite/engines/funcs/t/date_function.test
mysql-test/suite/engines/funcs/t/datetime_function.test
mysql-test/suite/engines/funcs/t/db_alter_character_set.test
mysql-test/suite/engines/funcs/t/db_alter_character_set_collate.test
mysql-test/suite/engines/funcs/t/db_alter_collate_ascii.test
mysql-test/suite/engines/funcs/t/db_alter_collate_utf8.test
mysql-test/suite/engines/funcs/t/db_create_character_set.test
mysql-test/suite/engines/funcs/t/db_create_character_set_collate.test
mysql-test/suite/engines/funcs/t/db_create_drop.test
mysql-test/suite/engines/funcs/t/db_create_error.test
mysql-test/suite/engines/funcs/t/db_create_error_reserved.test
mysql-test/suite/engines/funcs/t/db_create_if_not_exists.test
mysql-test/suite/engines/funcs/t/db_drop_error.test
mysql-test/suite/engines/funcs/t/db_use_error.test
mysql-test/suite/engines/funcs/t/de_autoinc.test
mysql-test/suite/engines/funcs/t/de_calendar_range.test
mysql-test/suite/engines/funcs/t/de_ignore.test
mysql-test/suite/engines/funcs/t/de_limit.test
mysql-test/suite/engines/funcs/t/de_multi_db_table.test
mysql-test/suite/engines/funcs/t/de_multi_db_table_using.test
mysql-test/suite/engines/funcs/t/de_multi_table.test
mysql-test/suite/engines/funcs/t/de_multi_table_using.test
mysql-test/suite/engines/funcs/t/de_number_range.test
mysql-test/suite/engines/funcs/t/de_quick.test
mysql-test/suite/engines/funcs/t/de_string_range.test
mysql-test/suite/engines/funcs/t/de_truncate.test
mysql-test/suite/engines/funcs/t/de_truncate_autoinc.test
mysql-test/suite/engines/funcs/t/disabled.def
mysql-test/suite/engines/funcs/t/fu_aggregate_avg_number.test
mysql-test/suite/engines/funcs/t/fu_aggregate_count_number.test
mysql-test/suite/engines/funcs/t/fu_aggregate_max_number.test
mysql-test/suite/engines/funcs/t/fu_aggregate_max_subquery.test
mysql-test/suite/engines/funcs/t/fu_aggregate_min_number.test
mysql-test/suite/engines/funcs/t/fu_aggregate_sum_number.test
mysql-test/suite/engines/funcs/t/general_no_data.test
mysql-test/suite/engines/funcs/t/general_not_null.test
mysql-test/suite/engines/funcs/t/general_null.test
mysql-test/suite/engines/funcs/t/in_calendar_2_unique_constraints_duplicate_update.test
mysql-test/suite/engines/funcs/t/in_calendar_pk_constraint_duplicate_update.test
mysql-test/suite/engines/funcs/t/in_calendar_pk_constraint_error.test
mysql-test/suite/engines/funcs/t/in_calendar_pk_constraint_ignore.test
mysql-test/suite/engines/funcs/t/in_calendar_unique_constraint_duplicate_update.test
mysql-test/suite/engines/funcs/t/in_calendar_unique_constraint_error.test
mysql-test/suite/engines/funcs/t/in_calendar_unique_constraint_ignore.test
mysql-test/suite/engines/funcs/t/in_enum_null.test
mysql-test/suite/engines/funcs/t/in_enum_null_boundary_error.test
mysql-test/suite/engines/funcs/t/in_enum_null_large_error.test
mysql-test/suite/engines/funcs/t/in_insert_select.test
mysql-test/suite/engines/funcs/t/in_insert_select_autoinc.test
mysql-test/suite/engines/funcs/t/in_insert_select_unique_violation.test
mysql-test/suite/engines/funcs/t/in_lob_boundary_error.test
mysql-test/suite/engines/funcs/t/in_multicolumn_calendar_pk_constraint_duplicate_update.test
mysql-test/suite/engines/funcs/t/in_multicolumn_calendar_pk_constraint_error.test
mysql-test/suite/engines/funcs/t/in_multicolumn_calendar_pk_constraint_ignore.test
mysql-test/suite/engines/funcs/t/in_multicolumn_calendar_unique_constraint_duplicate_update.test
mysql-test/suite/engines/funcs/t/in_multicolumn_calendar_unique_constraint_error.test
mysql-test/suite/engines/funcs/t/in_multicolumn_calendar_unique_constraint_ignore.test
mysql-test/suite/engines/funcs/t/in_multicolumn_number_pk_constraint_duplicate_update.test
mysql-test/suite/engines/funcs/t/in_multicolumn_number_pk_constraint_error.test
mysql-test/suite/engines/funcs/t/in_multicolumn_number_pk_constraint_ignore.test
mysql-test/suite/engines/funcs/t/in_multicolumn_number_unique_constraint_duplicate_update.test
mysql-test/suite/engines/funcs/t/in_multicolumn_number_unique_constraint_error.test
mysql-test/suite/engines/funcs/t/in_multicolumn_number_unique_constraint_ignore.test
mysql-test/suite/engines/funcs/t/in_multicolumn_string_pk_constraint_duplicate_update.test
mysql-test/suite/engines/funcs/t/in_multicolumn_string_pk_constraint_error.test
mysql-test/suite/engines/funcs/t/in_multicolumn_string_pk_constraint_ignore.test
mysql-test/suite/engines/funcs/t/in_multicolumn_string_unique_constraint_duplicate_update.test
mysql-test/suite/engines/funcs/t/in_multicolumn_string_unique_constraint_error.test
mysql-test/suite/engines/funcs/t/in_multicolumn_string_unique_constraint_ignore.test
mysql-test/suite/engines/funcs/t/in_number_2_unique_constraints_duplicate_update.test
mysql-test/suite/engines/funcs/t/in_number_boundary_error.test
mysql-test/suite/engines/funcs/t/in_number_decimal_boundary_error.test
mysql-test/suite/engines/funcs/t/in_number_length.test
mysql-test/suite/engines/funcs/t/in_number_null.test
mysql-test/suite/engines/funcs/t/in_number_pk_constraint_duplicate_update.test
mysql-test/suite/engines/funcs/t/in_number_pk_constraint_error.test
mysql-test/suite/engines/funcs/t/in_number_pk_constraint_ignore.test
mysql-test/suite/engines/funcs/t/in_number_unique_constraint_duplicate_update.test
mysql-test/suite/engines/funcs/t/in_number_unique_constraint_error.test
mysql-test/suite/engines/funcs/t/in_number_unique_constraint_ignore.test
mysql-test/suite/engines/funcs/t/in_set_null.test
mysql-test/suite/engines/funcs/t/in_set_null_boundary_error.test
mysql-test/suite/engines/funcs/t/in_set_null_large.test
mysql-test/suite/engines/funcs/t/in_string_2_unique_constraints_duplicate_update.test
mysql-test/suite/engines/funcs/t/in_string_boundary_error.test
mysql-test/suite/engines/funcs/t/in_string_not_null.test
mysql-test/suite/engines/funcs/t/in_string_null.test
mysql-test/suite/engines/funcs/t/in_string_pk_constraint_duplicate_update.test
mysql-test/suite/engines/funcs/t/in_string_pk_constraint_error.test
mysql-test/suite/engines/funcs/t/in_string_pk_constraint_ignore.test
mysql-test/suite/engines/funcs/t/in_string_set_enum_fail.test
mysql-test/suite/engines/funcs/t/in_string_unique_constraint_duplicate_update.test
mysql-test/suite/engines/funcs/t/in_string_unique_constraint_error.test
mysql-test/suite/engines/funcs/t/in_string_unique_constraint_ignore.test
mysql-test/suite/engines/funcs/t/ix_drop.test
mysql-test/suite/engines/funcs/t/ix_drop_error.test
mysql-test/suite/engines/funcs/t/ix_index_decimals.test
mysql-test/suite/engines/funcs/t/ix_index_lob.test
mysql-test/suite/engines/funcs/t/ix_index_non_string.test
mysql-test/suite/engines/funcs/t/ix_index_string.test
mysql-test/suite/engines/funcs/t/ix_index_string_length.test
mysql-test/suite/engines/funcs/t/ix_unique_decimals.test
mysql-test/suite/engines/funcs/t/ix_unique_lob.test
mysql-test/suite/engines/funcs/t/ix_unique_non_string.test
mysql-test/suite/engines/funcs/t/ix_unique_string.test
mysql-test/suite/engines/funcs/t/ix_unique_string_length.test
mysql-test/suite/engines/funcs/t/ix_using_order.test
mysql-test/suite/engines/funcs/t/jp_comment_column.test
mysql-test/suite/engines/funcs/t/jp_comment_older_compatibility1.test
mysql-test/suite/engines/funcs/t/jp_comment_table.test
mysql-test/suite/engines/funcs/t/ld_all_number_string_calendar_types.test
mysql-test/suite/engines/funcs/t/ld_bit.test
mysql-test/suite/engines/funcs/t/ld_enum_set.test
mysql-test/suite/engines/funcs/t/ld_less_columns.test
mysql-test/suite/engines/funcs/t/ld_more_columns_truncated.test
mysql-test/suite/engines/funcs/t/ld_null.test
mysql-test/suite/engines/funcs/t/ld_quote.test
mysql-test/suite/engines/funcs/t/ld_simple.test
mysql-test/suite/engines/funcs/t/ld_starting.test
mysql-test/suite/engines/funcs/t/ld_unique_error1.test
mysql-test/suite/engines/funcs/t/ld_unique_error1_local.test
mysql-test/suite/engines/funcs/t/ld_unique_error2.test
mysql-test/suite/engines/funcs/t/ld_unique_error2_local.test
mysql-test/suite/engines/funcs/t/ld_unique_error3.test
mysql-test/suite/engines/funcs/t/ld_unique_error3_local.test
mysql-test/suite/engines/funcs/t/load_bit.inc
mysql-test/suite/engines/funcs/t/load_enum_set.inc
mysql-test/suite/engines/funcs/t/load_less_columns.inc
mysql-test/suite/engines/funcs/t/load_more_columns.inc
mysql-test/suite/engines/funcs/t/load_null.inc
mysql-test/suite/engines/funcs/t/load_null2.inc
mysql-test/suite/engines/funcs/t/load_quote.inc
mysql-test/suite/engines/funcs/t/load_simple.inc
mysql-test/suite/engines/funcs/t/load_starting.inc
mysql-test/suite/engines/funcs/t/load_unique_error1.inc
mysql-test/suite/engines/funcs/t/load_unique_error2.inc
mysql-test/suite/engines/funcs/t/load_unique_error3.inc
mysql-test/suite/engines/funcs/t/ps_number_length.test
mysql-test/suite/engines/funcs/t/ps_number_null.test
mysql-test/suite/engines/funcs/t/ps_string_not_null.test
mysql-test/suite/engines/funcs/t/ps_string_null.test
mysql-test/suite/engines/funcs/t/re_number_range.test
mysql-test/suite/engines/funcs/t/re_number_range_set.test
mysql-test/suite/engines/funcs/t/re_number_select.test
mysql-test/suite/engines/funcs/t/re_string_range.test
mysql-test/suite/engines/funcs/t/re_string_range_set.test
mysql-test/suite/engines/funcs/t/rpl000010-slave.opt
mysql-test/suite/engines/funcs/t/rpl000010.test
mysql-test/suite/engines/funcs/t/rpl000011.test
mysql-test/suite/engines/funcs/t/rpl000013.test
mysql-test/suite/engines/funcs/t/rpl000017-slave.opt
mysql-test/suite/engines/funcs/t/rpl000017.test
mysql-test/suite/engines/funcs/t/rpl_000015.test
mysql-test/suite/engines/funcs/t/rpl_LD_INFILE.test
mysql-test/suite/engines/funcs/t/rpl_REDIRECT.test
mysql-test/suite/engines/funcs/t/rpl_alter.test
mysql-test/suite/engines/funcs/t/rpl_alter_db.test
mysql-test/suite/engines/funcs/t/rpl_bit.test
mysql-test/suite/engines/funcs/t/rpl_bit_npk.test
mysql-test/suite/engines/funcs/t/rpl_change_master.test
mysql-test/suite/engines/funcs/t/rpl_create_database-master.opt
mysql-test/suite/engines/funcs/t/rpl_create_database-slave.opt
mysql-test/suite/engines/funcs/t/rpl_create_database.test
mysql-test/suite/engines/funcs/t/rpl_do_grant.test
mysql-test/suite/engines/funcs/t/rpl_drop.test
mysql-test/suite/engines/funcs/t/rpl_drop_db.test
mysql-test/suite/engines/funcs/t/rpl_dual_pos_advance-master.opt
mysql-test/suite/engines/funcs/t/rpl_dual_pos_advance.test
mysql-test/suite/engines/funcs/t/rpl_empty_master_crash-master.opt
mysql-test/suite/engines/funcs/t/rpl_empty_master_crash.test
mysql-test/suite/engines/funcs/t/rpl_err_ignoredtable-slave.opt
mysql-test/suite/engines/funcs/t/rpl_err_ignoredtable.test
mysql-test/suite/engines/funcs/t/rpl_flushlog_loop.test
mysql-test/suite/engines/funcs/t/rpl_free_items-slave.opt
mysql-test/suite/engines/funcs/t/rpl_free_items.test
mysql-test/suite/engines/funcs/t/rpl_get_lock.test
mysql-test/suite/engines/funcs/t/rpl_ignore_grant-slave.opt
mysql-test/suite/engines/funcs/t/rpl_ignore_grant.test
mysql-test/suite/engines/funcs/t/rpl_ignore_revoke-slave.opt
mysql-test/suite/engines/funcs/t/rpl_ignore_revoke.test
mysql-test/suite/engines/funcs/t/rpl_ignore_table_update-slave.opt
mysql-test/suite/engines/funcs/t/rpl_ignore_table_update.test
mysql-test/suite/engines/funcs/t/rpl_init_slave-slave.opt
mysql-test/suite/engines/funcs/t/rpl_init_slave.test
mysql-test/suite/engines/funcs/t/rpl_insert.test
mysql-test/suite/engines/funcs/t/rpl_insert_select.test
mysql-test/suite/engines/funcs/t/rpl_loaddata2.test
mysql-test/suite/engines/funcs/t/rpl_loaddata_m-master.opt
mysql-test/suite/engines/funcs/t/rpl_loaddata_m.test
mysql-test/suite/engines/funcs/t/rpl_loaddata_s-slave.opt
mysql-test/suite/engines/funcs/t/rpl_loaddata_s.test
mysql-test/suite/engines/funcs/t/rpl_loaddatalocal.test
mysql-test/suite/engines/funcs/t/rpl_loadfile.test
mysql-test/suite/engines/funcs/t/rpl_log_pos.test
mysql-test/suite/engines/funcs/t/rpl_many_optimize.test
mysql-test/suite/engines/funcs/t/rpl_master_pos_wait.test
mysql-test/suite/engines/funcs/t/rpl_misc_functions.test
mysql-test/suite/engines/funcs/t/rpl_multi_delete-slave.opt
mysql-test/suite/engines/funcs/t/rpl_multi_delete.test
mysql-test/suite/engines/funcs/t/rpl_multi_delete2-slave.opt
mysql-test/suite/engines/funcs/t/rpl_multi_delete2.test
mysql-test/suite/engines/funcs/t/rpl_multi_update4-slave.opt
mysql-test/suite/engines/funcs/t/rpl_multi_update4.test
mysql-test/suite/engines/funcs/t/rpl_ps.test
mysql-test/suite/engines/funcs/t/rpl_rbr_to_sbr.test
mysql-test/suite/engines/funcs/t/rpl_relayspace-slave.opt
mysql-test/suite/engines/funcs/t/rpl_relayspace.test
mysql-test/suite/engines/funcs/t/rpl_replicate_ignore_db-slave.opt
mysql-test/suite/engines/funcs/t/rpl_replicate_ignore_db.test
mysql-test/suite/engines/funcs/t/rpl_row_NOW.test
mysql-test/suite/engines/funcs/t/rpl_row_USER.test
mysql-test/suite/engines/funcs/t/rpl_row_drop.test
mysql-test/suite/engines/funcs/t/rpl_row_func001.test
mysql-test/suite/engines/funcs/t/rpl_row_inexist_tbl-slave.opt
mysql-test/suite/engines/funcs/t/rpl_row_inexist_tbl.test
mysql-test/suite/engines/funcs/t/rpl_row_max_relay_size.test
mysql-test/suite/engines/funcs/t/rpl_row_reset_slave.test
mysql-test/suite/engines/funcs/t/rpl_row_sp001.test
mysql-test/suite/engines/funcs/t/rpl_row_sp005.test
mysql-test/suite/engines/funcs/t/rpl_row_sp008.test
mysql-test/suite/engines/funcs/t/rpl_row_sp009.test
mysql-test/suite/engines/funcs/t/rpl_row_sp010.test
mysql-test/suite/engines/funcs/t/rpl_row_sp011.test
mysql-test/suite/engines/funcs/t/rpl_row_sp012.test
mysql-test/suite/engines/funcs/t/rpl_row_stop_middle.test
mysql-test/suite/engines/funcs/t/rpl_row_trig001.test
mysql-test/suite/engines/funcs/t/rpl_row_trig002.test
mysql-test/suite/engines/funcs/t/rpl_row_trig003.test
mysql-test/suite/engines/funcs/t/rpl_row_until.test
mysql-test/suite/engines/funcs/t/rpl_row_view01.test
mysql-test/suite/engines/funcs/t/rpl_server_id1.test
mysql-test/suite/engines/funcs/t/rpl_server_id2-slave.opt
mysql-test/suite/engines/funcs/t/rpl_server_id2.test
mysql-test/suite/engines/funcs/t/rpl_session_var.test
mysql-test/suite/engines/funcs/t/rpl_sf.test
mysql-test/suite/engines/funcs/t/rpl_skip_error-slave.opt
mysql-test/suite/engines/funcs/t/rpl_skip_error.test
mysql-test/suite/engines/funcs/t/rpl_slave_status.test
mysql-test/suite/engines/funcs/t/rpl_sp-master.opt
mysql-test/suite/engines/funcs/t/rpl_sp-slave.opt
mysql-test/suite/engines/funcs/t/rpl_sp.test
mysql-test/suite/engines/funcs/t/rpl_sp004.test
mysql-test/suite/engines/funcs/t/rpl_sp_effects-master.opt
mysql-test/suite/engines/funcs/t/rpl_sp_effects-slave.opt
mysql-test/suite/engines/funcs/t/rpl_sp_effects.test
mysql-test/suite/engines/funcs/t/rpl_start_stop_slave.test
mysql-test/suite/engines/funcs/t/rpl_stm_max_relay_size.test
mysql-test/suite/engines/funcs/t/rpl_stm_mystery22.test
mysql-test/suite/engines/funcs/t/rpl_stm_no_op.test
mysql-test/suite/engines/funcs/t/rpl_stm_reset_slave.test
mysql-test/suite/engines/funcs/t/rpl_switch_stm_row_mixed.test
mysql-test/suite/engines/funcs/t/rpl_temp_table.test
mysql-test/suite/engines/funcs/t/rpl_temporary.test
mysql-test/suite/engines/funcs/t/rpl_trigger.test
mysql-test/suite/engines/funcs/t/rpl_trunc_temp.test
mysql-test/suite/engines/funcs/t/rpl_user_variables.test
mysql-test/suite/engines/funcs/t/rpl_variables-master.opt
mysql-test/suite/engines/funcs/t/rpl_variables.test
mysql-test/suite/engines/funcs/t/rpl_view-slave.opt
mysql-test/suite/engines/funcs/t/rpl_view.test
mysql-test/suite/engines/funcs/t/se_join_cross.test
mysql-test/suite/engines/funcs/t/se_join_default.test
mysql-test/suite/engines/funcs/t/se_join_inner.test
mysql-test/suite/engines/funcs/t/se_join_left.test
mysql-test/suite/engines/funcs/t/se_join_left_outer.test
mysql-test/suite/engines/funcs/t/se_join_natural_left.test
mysql-test/suite/engines/funcs/t/se_join_natural_left_outer.test
mysql-test/suite/engines/funcs/t/se_join_natural_right.test
mysql-test/suite/engines/funcs/t/se_join_natural_right_outer.test
mysql-test/suite/engines/funcs/t/se_join_right.test
mysql-test/suite/engines/funcs/t/se_join_right_outer.test
mysql-test/suite/engines/funcs/t/se_join_straight.test
mysql-test/suite/engines/funcs/t/se_rowid.test
mysql-test/suite/engines/funcs/t/se_string_distinct.test
mysql-test/suite/engines/funcs/t/se_string_from.test
mysql-test/suite/engines/funcs/t/se_string_groupby.test
mysql-test/suite/engines/funcs/t/se_string_having.test
mysql-test/suite/engines/funcs/t/se_string_limit.test
mysql-test/suite/engines/funcs/t/se_string_orderby.test
mysql-test/suite/engines/funcs/t/se_string_union.test
mysql-test/suite/engines/funcs/t/se_string_where.test
mysql-test/suite/engines/funcs/t/se_string_where_and.test
mysql-test/suite/engines/funcs/t/se_string_where_or.test
mysql-test/suite/engines/funcs/t/sf_alter.test
mysql-test/suite/engines/funcs/t/sf_cursor.test
mysql-test/suite/engines/funcs/t/sf_simple1.test
mysql-test/suite/engines/funcs/t/sp_alter.test
mysql-test/suite/engines/funcs/t/sp_cursor.test
mysql-test/suite/engines/funcs/t/sp_simple1.test
mysql-test/suite/engines/funcs/t/sq_all.test
mysql-test/suite/engines/funcs/t/sq_any.test
mysql-test/suite/engines/funcs/t/sq_corr.test
mysql-test/suite/engines/funcs/t/sq_error.test
mysql-test/suite/engines/funcs/t/sq_exists.test
mysql-test/suite/engines/funcs/t/sq_from.test
mysql-test/suite/engines/funcs/t/sq_in.test
mysql-test/suite/engines/funcs/t/sq_row.test
mysql-test/suite/engines/funcs/t/sq_scalar.test
mysql-test/suite/engines/funcs/t/sq_some.test
mysql-test/suite/engines/funcs/t/ta_2part_column_to_pk.test
mysql-test/suite/engines/funcs/t/ta_2part_diff_string_to_pk.test
mysql-test/suite/engines/funcs/t/ta_2part_diff_to_pk.test
mysql-test/suite/engines/funcs/t/ta_2part_string_to_pk.test
mysql-test/suite/engines/funcs/t/ta_3part_column_to_pk.test
mysql-test/suite/engines/funcs/t/ta_3part_string_to_pk.test
mysql-test/suite/engines/funcs/t/ta_add_column.test
mysql-test/suite/engines/funcs/t/ta_add_column2.test
mysql-test/suite/engines/funcs/t/ta_add_column_first.test
mysql-test/suite/engines/funcs/t/ta_add_column_first2.test
mysql-test/suite/engines/funcs/t/ta_add_column_middle.test
mysql-test/suite/engines/funcs/t/ta_add_column_middle2.test
mysql-test/suite/engines/funcs/t/ta_add_string.test
mysql-test/suite/engines/funcs/t/ta_add_string2.test
mysql-test/suite/engines/funcs/t/ta_add_string_first.test
mysql-test/suite/engines/funcs/t/ta_add_string_first2.test
mysql-test/suite/engines/funcs/t/ta_add_string_middle.test
mysql-test/suite/engines/funcs/t/ta_add_string_middle2.test
mysql-test/suite/engines/funcs/t/ta_add_string_unique_index.test
mysql-test/suite/engines/funcs/t/ta_add_unique_index.test
mysql-test/suite/engines/funcs/t/ta_column_from_unsigned.test
mysql-test/suite/engines/funcs/t/ta_column_from_zerofill.test
mysql-test/suite/engines/funcs/t/ta_column_to_index.test
mysql-test/suite/engines/funcs/t/ta_column_to_not_null.test
mysql-test/suite/engines/funcs/t/ta_column_to_null.test
mysql-test/suite/engines/funcs/t/ta_column_to_pk.test
mysql-test/suite/engines/funcs/t/ta_column_to_unsigned.test
mysql-test/suite/engines/funcs/t/ta_column_to_zerofill.test
mysql-test/suite/engines/funcs/t/ta_drop_column.test
mysql-test/suite/engines/funcs/t/ta_drop_index.test
mysql-test/suite/engines/funcs/t/ta_drop_pk_autoincrement.test
mysql-test/suite/engines/funcs/t/ta_drop_pk_number.test
mysql-test/suite/engines/funcs/t/ta_drop_pk_string.test
mysql-test/suite/engines/funcs/t/ta_drop_string_index.test
mysql-test/suite/engines/funcs/t/ta_orderby.test
mysql-test/suite/engines/funcs/t/ta_rename.test
mysql-test/suite/engines/funcs/t/ta_set_drop_default.test
mysql-test/suite/engines/funcs/t/ta_string_drop_column.test
mysql-test/suite/engines/funcs/t/ta_string_to_index.test
mysql-test/suite/engines/funcs/t/ta_string_to_not_null.test
mysql-test/suite/engines/funcs/t/ta_string_to_null.test
mysql-test/suite/engines/funcs/t/ta_string_to_pk.test
mysql-test/suite/engines/funcs/t/tc_column_autoincrement.test
mysql-test/suite/engines/funcs/t/tc_column_comment.test
mysql-test/suite/engines/funcs/t/tc_column_comment_string.test
mysql-test/suite/engines/funcs/t/tc_column_default_decimal.test
mysql-test/suite/engines/funcs/t/tc_column_default_number.test
mysql-test/suite/engines/funcs/t/tc_column_default_string.test
mysql-test/suite/engines/funcs/t/tc_column_enum.test
mysql-test/suite/engines/funcs/t/tc_column_enum_long.test
mysql-test/suite/engines/funcs/t/tc_column_key.test
mysql-test/suite/engines/funcs/t/tc_column_key_length.test
mysql-test/suite/engines/funcs/t/tc_column_length.test
mysql-test/suite/engines/funcs/t/tc_column_length_decimals.test
mysql-test/suite/engines/funcs/t/tc_column_length_zero.test
mysql-test/suite/engines/funcs/t/tc_column_not_null.test
mysql-test/suite/engines/funcs/t/tc_column_null.test
mysql-test/suite/engines/funcs/t/tc_column_primary_key_number.test
mysql-test/suite/engines/funcs/t/tc_column_primary_key_string.test
mysql-test/suite/engines/funcs/t/tc_column_serial.test
mysql-test/suite/engines/funcs/t/tc_column_set.test
mysql-test/suite/engines/funcs/t/tc_column_set_long.test
mysql-test/suite/engines/funcs/t/tc_column_unique_key.test
mysql-test/suite/engines/funcs/t/tc_column_unique_key_string.test
mysql-test/suite/engines/funcs/t/tc_column_unsigned.test
mysql-test/suite/engines/funcs/t/tc_column_zerofill.test
mysql-test/suite/engines/funcs/t/tc_drop_table.test
mysql-test/suite/engines/funcs/t/tc_multicolumn_different.test
mysql-test/suite/engines/funcs/t/tc_multicolumn_same.test
mysql-test/suite/engines/funcs/t/tc_multicolumn_same_string.test
mysql-test/suite/engines/funcs/t/tc_partition_analyze.test
mysql-test/suite/engines/funcs/t/tc_partition_change_from_range_to_hash_key.test
mysql-test/suite/engines/funcs/t/tc_partition_check.test
mysql-test/suite/engines/funcs/t/tc_partition_hash.test
mysql-test/suite/engines/funcs/t/tc_partition_hash_date_function.test
mysql-test/suite/engines/funcs/t/tc_partition_key.test
mysql-test/suite/engines/funcs/t/tc_partition_linear_key.test
mysql-test/suite/engines/funcs/t/tc_partition_list_directory.test
mysql-test/suite/engines/funcs/t/tc_partition_list_error.test
mysql-test/suite/engines/funcs/t/tc_partition_optimize.test
mysql-test/suite/engines/funcs/t/tc_partition_rebuild.test
mysql-test/suite/engines/funcs/t/tc_partition_remove.test
mysql-test/suite/engines/funcs/t/tc_partition_reorg_divide.test
mysql-test/suite/engines/funcs/t/tc_partition_reorg_hash_key.test
mysql-test/suite/engines/funcs/t/tc_partition_reorg_merge.test
mysql-test/suite/engines/funcs/t/tc_partition_repair.test
mysql-test/suite/engines/funcs/t/tc_partition_sub1.test
mysql-test/suite/engines/funcs/t/tc_partition_sub2.test
mysql-test/suite/engines/funcs/t/tc_partition_value.test
mysql-test/suite/engines/funcs/t/tc_partition_value_error.test
mysql-test/suite/engines/funcs/t/tc_partition_value_specific.test
mysql-test/suite/engines/funcs/t/tc_rename.test
mysql-test/suite/engines/funcs/t/tc_rename_across_database.test
mysql-test/suite/engines/funcs/t/tc_rename_error.test
mysql-test/suite/engines/funcs/t/tc_structure_comment.test
mysql-test/suite/engines/funcs/t/tc_structure_create_like.test
mysql-test/suite/engines/funcs/t/tc_structure_create_like_string.test
mysql-test/suite/engines/funcs/t/tc_structure_create_select.test
mysql-test/suite/engines/funcs/t/tc_structure_create_select_string.test
mysql-test/suite/engines/funcs/t/tc_structure_string_comment.test
mysql-test/suite/engines/funcs/t/tc_temporary_column.test
mysql-test/suite/engines/funcs/t/tc_temporary_column_length.test
mysql-test/suite/engines/funcs/t/time_function.test
mysql-test/suite/engines/funcs/t/tr_all_type_triggers.test
mysql-test/suite/engines/funcs/t/tr_delete.test
mysql-test/suite/engines/funcs/t/tr_delete_new_error.test
mysql-test/suite/engines/funcs/t/tr_insert.test
mysql-test/suite/engines/funcs/t/tr_insert_after_error.test
mysql-test/suite/engines/funcs/t/tr_insert_old_error.test
mysql-test/suite/engines/funcs/t/tr_update.test
mysql-test/suite/engines/funcs/t/tr_update_after_error.test
mysql-test/suite/engines/funcs/t/up_calendar_range.test
mysql-test/suite/engines/funcs/t/up_ignore.test
mysql-test/suite/engines/funcs/t/up_limit.test
mysql-test/suite/engines/funcs/t/up_multi_db_table.test
mysql-test/suite/engines/funcs/t/up_multi_table.test
mysql-test/suite/engines/funcs/t/up_nullcheck.test
mysql-test/suite/engines/funcs/t/up_number_range.test
mysql-test/suite/engines/funcs/t/up_string_range.test
mysql-test/suite/engines/funcs/t/wait_show_pattern.inc
mysql-test/suite/engines/funcs/t/wait_slave_status.inc
mysql-test/suite/engines/iuds/
mysql-test/suite/engines/iuds/r/
mysql-test/suite/engines/iuds/r/delete_decimal.result
mysql-test/suite/engines/iuds/r/delete_time.result
mysql-test/suite/engines/iuds/r/delete_year.result
mysql-test/suite/engines/iuds/r/insert_calendar.result
mysql-test/suite/engines/iuds/r/insert_decimal.result
mysql-test/suite/engines/iuds/r/insert_number.result
mysql-test/suite/engines/iuds/r/insert_time.result
mysql-test/suite/engines/iuds/r/insert_year.result
mysql-test/suite/engines/iuds/r/strings_charsets_update_delete.result
mysql-test/suite/engines/iuds/r/strings_update_delete.result
mysql-test/suite/engines/iuds/r/type_bit_iuds.result
mysql-test/suite/engines/iuds/r/update_decimal.result
mysql-test/suite/engines/iuds/r/update_delete_calendar.result
mysql-test/suite/engines/iuds/r/update_delete_number.result
mysql-test/suite/engines/iuds/r/update_time.result
mysql-test/suite/engines/iuds/r/update_year.result
mysql-test/suite/engines/iuds/t/
mysql-test/suite/engines/iuds/t/delete_decimal.test
mysql-test/suite/engines/iuds/t/delete_time.test
mysql-test/suite/engines/iuds/t/delete_year.test
mysql-test/suite/engines/iuds/t/disabled.def
mysql-test/suite/engines/iuds/t/hindi.txt
mysql-test/suite/engines/iuds/t/insert_calendar.test
mysql-test/suite/engines/iuds/t/insert_decimal.test
mysql-test/suite/engines/iuds/t/insert_number.test
mysql-test/suite/engines/iuds/t/insert_time.test
mysql-test/suite/engines/iuds/t/insert_year.test
mysql-test/suite/engines/iuds/t/sample.txt
mysql-test/suite/engines/iuds/t/strings_charsets_update_delete.test
mysql-test/suite/engines/iuds/t/strings_update_delete.test
mysql-test/suite/engines/iuds/t/type_bit_iuds.test
mysql-test/suite/engines/iuds/t/update_decimal.test
mysql-test/suite/engines/iuds/t/update_delete_calendar.test
mysql-test/suite/engines/iuds/t/update_delete_number.test
mysql-test/suite/engines/iuds/t/update_time.test
mysql-test/suite/engines/iuds/t/update_year.test
mysql-test/suite/engines/rr_trx/
mysql-test/suite/engines/rr_trx/check_consistency.sql
mysql-test/suite/engines/rr_trx/include/
mysql-test/suite/engines/rr_trx/include/check_for_error_rollback.inc
mysql-test/suite/engines/rr_trx/include/check_for_error_rollback_skip.inc
mysql-test/suite/engines/rr_trx/include/check_repeatable_read_all_columns.inc
mysql-test/suite/engines/rr_trx/include/record_query_all_columns.inc
mysql-test/suite/engines/rr_trx/include/rr_init.test
mysql-test/suite/engines/rr_trx/init_innodb.txt
mysql-test/suite/engines/rr_trx/r/
mysql-test/suite/engines/rr_trx/r/init_innodb.result
mysql-test/suite/engines/rr_trx/r/rr_c_count_not_zero.result
mysql-test/suite/engines/rr_trx/r/rr_c_stats.result
mysql-test/suite/engines/rr_trx/r/rr_i_40-44.result
mysql-test/suite/engines/rr_trx/r/rr_id_3.result
mysql-test/suite/engines/rr_trx/r/rr_id_900.result
mysql-test/suite/engines/rr_trx/r/rr_insert_select_2.result
mysql-test/suite/engines/rr_trx/r/rr_iud_rollback-multi-50.result
mysql-test/suite/engines/rr_trx/r/rr_replace_7-8.result
mysql-test/suite/engines/rr_trx/r/rr_s_select-uncommitted.result
mysql-test/suite/engines/rr_trx/r/rr_sc_select-limit-nolimit_4.result
mysql-test/suite/engines/rr_trx/r/rr_sc_select-same_2.result
mysql-test/suite/engines/rr_trx/r/rr_sc_sum_total.result
mysql-test/suite/engines/rr_trx/r/rr_u_10-19.result
mysql-test/suite/engines/rr_trx/r/rr_u_10-19_nolimit.result
mysql-test/suite/engines/rr_trx/r/rr_u_4.result
mysql-test/suite/engines/rr_trx/run.txt
mysql-test/suite/engines/rr_trx/run_stress_tx_rr.pl
mysql-test/suite/engines/rr_trx/t/
mysql-test/suite/engines/rr_trx/t/init_innodb.test
mysql-test/suite/engines/rr_trx/t/rr_c_count_not_zero.test
mysql-test/suite/engines/rr_trx/t/rr_c_stats.test
mysql-test/suite/engines/rr_trx/t/rr_i_40-44.test
mysql-test/suite/engines/rr_trx/t/rr_id_3.test
mysql-test/suite/engines/rr_trx/t/rr_id_900.test
mysql-test/suite/engines/rr_trx/t/rr_insert_select_2.test
mysql-test/suite/engines/rr_trx/t/rr_iud_rollback-multi-50.test
mysql-test/suite/engines/rr_trx/t/rr_replace_7-8.test
mysql-test/suite/engines/rr_trx/t/rr_s_select-uncommitted.test
mysql-test/suite/engines/rr_trx/t/rr_sc_select-limit-nolimit_4.test
mysql-test/suite/engines/rr_trx/t/rr_sc_select-same_2.test
mysql-test/suite/engines/rr_trx/t/rr_sc_sum_total.test
mysql-test/suite/engines/rr_trx/t/rr_u_10-19.test
mysql-test/suite/engines/rr_trx/t/rr_u_10-19_nolimit.test
mysql-test/suite/engines/rr_trx/t/rr_u_4.test
mysql-test/suite/innodb/r/innodb-autoinc-44030.result
mysql-test/suite/innodb/r/innodb-autoinc.result
mysql-test/suite/innodb/r/innodb-lock.result
mysql-test/suite/innodb/r/innodb-replace.result
mysql-test/suite/innodb/r/innodb-semi-consistent.result
mysql-test/suite/innodb/r/innodb-use-sys-malloc.result
mysql-test/suite/innodb/r/innodb_bug21704.result
mysql-test/suite/innodb/r/innodb_bug34053.result
mysql-test/suite/innodb/r/innodb_bug35220.result
mysql-test/suite/innodb/r/innodb_bug38231.result
mysql-test/suite/innodb/r/innodb_bug40565.result
mysql-test/suite/innodb/r/innodb_bug42101-nonzero.result
mysql-test/suite/innodb/r/innodb_bug42101.result
mysql-test/suite/innodb/r/innodb_bug44369.result
mysql-test/suite/innodb/r/innodb_bug45357.result
mysql-test/suite/innodb/r/innodb_bug46000.result
mysql-test/suite/innodb/r/innodb_bug47621.result
mysql-test/suite/innodb/r/innodb_bug47777.result
mysql-test/suite/innodb/r/innodb_bug51920.result
mysql-test/suite/innodb/r/innodb_bug52663.result
mysql-test/suite/innodb/r/innodb_misc1.result
mysql-test/suite/innodb/r/innodb_trx_weight.result
mysql-test/suite/innodb/t/disabled.def
mysql-test/suite/innodb/t/innodb-autoinc-44030.test
mysql-test/suite/innodb/t/innodb-autoinc.test
mysql-test/suite/innodb/t/innodb-lock.test
mysql-test/suite/innodb/t/innodb-master.opt
mysql-test/suite/innodb/t/innodb-replace.test
mysql-test/suite/innodb/t/innodb-semi-consistent-master.opt
mysql-test/suite/innodb/t/innodb-semi-consistent.test
mysql-test/suite/innodb/t/innodb_bug21704.test
mysql-test/suite/innodb/t/innodb_bug34053.test
mysql-test/suite/innodb/t/innodb_bug35220.test
mysql-test/suite/innodb/t/innodb_bug38231.test
mysql-test/suite/innodb/t/innodb_bug40565.test
mysql-test/suite/innodb/t/innodb_bug42101-nonzero-master.opt
mysql-test/suite/innodb/t/innodb_bug42101-nonzero.test
mysql-test/suite/innodb/t/innodb_bug42101.test
mysql-test/suite/innodb/t/innodb_bug44369.test
mysql-test/suite/innodb/t/innodb_bug45357.test
mysql-test/suite/innodb/t/innodb_bug46000.test
mysql-test/suite/innodb/t/innodb_bug47621.test
mysql-test/suite/innodb/t/innodb_bug47777.test
mysql-test/suite/innodb/t/innodb_bug51920.test
mysql-test/suite/innodb/t/innodb_bug52663-master.opt
mysql-test/suite/innodb/t/innodb_bug52663.test
mysql-test/suite/innodb/t/innodb_misc1-master.opt
mysql-test/suite/innodb/t/innodb_misc1.test
mysql-test/suite/innodb/t/innodb_trx_weight.test
mysql-test/suite/innodb_plugin/
mysql-test/suite/innodb_plugin/include/
mysql-test/suite/innodb_plugin/include/ctype_innodb_like.inc
mysql-test/suite/innodb_plugin/include/innodb-index.inc
mysql-test/suite/innodb_plugin/include/innodb_trx_weight.inc
mysql-test/suite/innodb_plugin/r/
mysql-test/suite/innodb_plugin/r/innodb-analyze.result
mysql-test/suite/innodb_plugin/r/innodb-autoinc-44030.result
mysql-test/suite/innodb_plugin/r/innodb-autoinc.result
mysql-test/suite/innodb_plugin/r/innodb-consistent.result
mysql-test/suite/innodb_plugin/r/innodb-index.result
mysql-test/suite/innodb_plugin/r/innodb-index_ucs2.result
mysql-test/suite/innodb_plugin/r/innodb-lock.result
mysql-test/suite/innodb_plugin/r/innodb-replace.result
mysql-test/suite/innodb_plugin/r/innodb-semi-consistent.result
mysql-test/suite/innodb_plugin/r/innodb-timeout.result
mysql-test/suite/innodb_plugin/r/innodb-use-sys-malloc.result
mysql-test/suite/innodb_plugin/r/innodb-zip.result
mysql-test/suite/innodb_plugin/r/innodb.result
mysql-test/suite/innodb_plugin/r/innodb_bug21704.result
mysql-test/suite/innodb_plugin/r/innodb_bug34053.result
mysql-test/suite/innodb_plugin/r/innodb_bug34300.result
mysql-test/suite/innodb_plugin/r/innodb_bug35220.result
mysql-test/suite/innodb_plugin/r/innodb_bug36169.result
mysql-test/suite/innodb_plugin/r/innodb_bug36172.result
mysql-test/suite/innodb_plugin/r/innodb_bug38231.result
mysql-test/suite/innodb_plugin/r/innodb_bug39438.result
mysql-test/suite/innodb_plugin/r/innodb_bug40360.result
mysql-test/suite/innodb_plugin/r/innodb_bug40565.result
mysql-test/suite/innodb_plugin/r/innodb_bug41904.result
mysql-test/suite/innodb_plugin/r/innodb_bug42101-nonzero.result
mysql-test/suite/innodb_plugin/r/innodb_bug42101.result
mysql-test/suite/innodb_plugin/r/innodb_bug44032.result
mysql-test/suite/innodb_plugin/r/innodb_bug44369.result
mysql-test/suite/innodb_plugin/r/innodb_bug44571.result
mysql-test/suite/innodb_plugin/r/innodb_bug45357.result
mysql-test/suite/innodb_plugin/r/innodb_bug46000.result
mysql-test/suite/innodb_plugin/r/innodb_bug46676.result
mysql-test/suite/innodb_plugin/r/innodb_bug47167.result
mysql-test/suite/innodb_plugin/r/innodb_bug47621.result
mysql-test/suite/innodb_plugin/r/innodb_bug47622.result
mysql-test/suite/innodb_plugin/r/innodb_bug47777.result
mysql-test/suite/innodb_plugin/r/innodb_bug51378.result
mysql-test/suite/innodb_plugin/r/innodb_bug51920.result
mysql-test/suite/innodb_plugin/r/innodb_bug52663.result
mysql-test/suite/innodb_plugin/r/innodb_bug52745.result
mysql-test/suite/innodb_plugin/r/innodb_file_format.result
mysql-test/suite/innodb_plugin/r/innodb_information_schema.result
mysql-test/suite/innodb_plugin/r/innodb_trx_weight.result
mysql-test/suite/innodb_plugin/t/
mysql-test/suite/innodb_plugin/t/innodb-analyze.test
mysql-test/suite/innodb_plugin/t/innodb-autoinc-44030.test
mysql-test/suite/innodb_plugin/t/innodb-autoinc.test
mysql-test/suite/innodb_plugin/t/innodb-consistent-master.opt
mysql-test/suite/innodb_plugin/t/innodb-consistent.test
mysql-test/suite/innodb_plugin/t/innodb-index.test
mysql-test/suite/innodb_plugin/t/innodb-index_ucs2.test
mysql-test/suite/innodb_plugin/t/innodb-lock.test
mysql-test/suite/innodb_plugin/t/innodb-master.opt
mysql-test/suite/innodb_plugin/t/innodb-replace.test
mysql-test/suite/innodb_plugin/t/innodb-semi-consistent-master.opt
mysql-test/suite/innodb_plugin/t/innodb-semi-consistent.test
mysql-test/suite/innodb_plugin/t/innodb-timeout.test
mysql-test/suite/innodb_plugin/t/innodb-use-sys-malloc-master.opt
mysql-test/suite/innodb_plugin/t/innodb-use-sys-malloc.test
mysql-test/suite/innodb_plugin/t/innodb-zip.test
mysql-test/suite/innodb_plugin/t/innodb.test
mysql-test/suite/innodb_plugin/t/innodb_bug21704.test
mysql-test/suite/innodb_plugin/t/innodb_bug34053.test
mysql-test/suite/innodb_plugin/t/innodb_bug34300.test
mysql-test/suite/innodb_plugin/t/innodb_bug35220.test
mysql-test/suite/innodb_plugin/t/innodb_bug36169.test
mysql-test/suite/innodb_plugin/t/innodb_bug36172.test
mysql-test/suite/innodb_plugin/t/innodb_bug38231.test
mysql-test/suite/innodb_plugin/t/innodb_bug39438-master.opt
mysql-test/suite/innodb_plugin/t/innodb_bug39438.test
mysql-test/suite/innodb_plugin/t/innodb_bug40360.test
mysql-test/suite/innodb_plugin/t/innodb_bug40565.test
mysql-test/suite/innodb_plugin/t/innodb_bug41904.test
mysql-test/suite/innodb_plugin/t/innodb_bug42101-nonzero-master.opt
mysql-test/suite/innodb_plugin/t/innodb_bug42101-nonzero.test
mysql-test/suite/innodb_plugin/t/innodb_bug42101.test
mysql-test/suite/innodb_plugin/t/innodb_bug44032.test
mysql-test/suite/innodb_plugin/t/innodb_bug44369.test
mysql-test/suite/innodb_plugin/t/innodb_bug44571.test
mysql-test/suite/innodb_plugin/t/innodb_bug45357.test
mysql-test/suite/innodb_plugin/t/innodb_bug46000.test
mysql-test/suite/innodb_plugin/t/innodb_bug46676.test
mysql-test/suite/innodb_plugin/t/innodb_bug47167.test
mysql-test/suite/innodb_plugin/t/innodb_bug47621.test
mysql-test/suite/innodb_plugin/t/innodb_bug47622.test
mysql-test/suite/innodb_plugin/t/innodb_bug47777.test
mysql-test/suite/innodb_plugin/t/innodb_bug51378.test
mysql-test/suite/innodb_plugin/t/innodb_bug51920.test
mysql-test/suite/innodb_plugin/t/innodb_bug52663.test
mysql-test/suite/innodb_plugin/t/innodb_bug52745.test
mysql-test/suite/innodb_plugin/t/innodb_file_format.test
mysql-test/suite/innodb_plugin/t/innodb_information_schema.test
mysql-test/suite/innodb_plugin/t/innodb_trx_weight.test
mysql-test/suite/rpl/r/rpl_show_slave_running.result
mysql-test/suite/rpl/r/rpl_slow_query_log.result
mysql-test/suite/rpl/r/rpl_stm_sql_mode.result
mysql-test/suite/rpl/r/rpl_typeconv_innodb.result
mysql-test/suite/rpl/t/rpl_begin_commit_rollback-master.opt
mysql-test/suite/rpl/t/rpl_show_slave_running.test
mysql-test/suite/rpl/t/rpl_slow_query_log-slave.opt
mysql-test/suite/rpl/t/rpl_slow_query_log.test
mysql-test/suite/rpl/t/rpl_stm_sql_mode.test
mysql-test/suite/rpl/t/rpl_typeconv-slave.opt
mysql-test/suite/rpl/t/rpl_typeconv_innodb.test
mysql-test/suite/sys_vars/r/secure_file_priv.result
mysql-test/suite/sys_vars/t/secure_file_priv-master.opt
mysql-test/suite/sys_vars/t/secure_file_priv.test
mysql-test/t/bug39022.test
mysql-test/t/bug46261-master.opt
mysql-test/t/bug46261.test
mysql-test/t/log_tables_upgrade.test
mysql-test/t/no_binlog.test
mysql-test/t/partition_debug_sync.test
mysql-test/t/plugin_not_embedded-master.opt
mysql-test/t/plugin_not_embedded.test
mysql-test/t/view_alias.test
storage/innobase/
storage/innobase/CMakeLists.txt
storage/innobase/Makefile.am
storage/innobase/btr/
storage/innobase/btr/btr0btr.c
storage/innobase/btr/btr0cur.c
storage/innobase/btr/btr0pcur.c
storage/innobase/btr/btr0sea.c
storage/innobase/buf/
storage/innobase/buf/buf0buf.c
storage/innobase/buf/buf0flu.c
storage/innobase/buf/buf0lru.c
storage/innobase/buf/buf0rea.c
storage/innobase/data/
storage/innobase/data/data0data.c
storage/innobase/data/data0type.c
storage/innobase/dict/
storage/innobase/dict/dict0boot.c
storage/innobase/dict/dict0crea.c
storage/innobase/dict/dict0dict.c
storage/innobase/dict/dict0load.c
storage/innobase/dict/dict0mem.c
storage/innobase/dyn/
storage/innobase/dyn/dyn0dyn.c
storage/innobase/eval/
storage/innobase/eval/eval0eval.c
storage/innobase/eval/eval0proc.c
storage/innobase/fil/
storage/innobase/fil/fil0fil.c
storage/innobase/fsp/
storage/innobase/fsp/fsp0fsp.c
storage/innobase/fut/
storage/innobase/fut/fut0fut.c
storage/innobase/fut/fut0lst.c
storage/innobase/ha/
storage/innobase/ha/ha0ha.c
storage/innobase/ha/hash0hash.c
storage/innobase/handler/
storage/innobase/handler/ha_innodb.cc
storage/innobase/handler/ha_innodb.h
storage/innobase/ibuf/
storage/innobase/ibuf/ibuf0ibuf.c
storage/innobase/include/
storage/innobase/include/btr0btr.h
storage/innobase/include/btr0btr.ic
storage/innobase/include/btr0cur.h
storage/innobase/include/btr0cur.ic
storage/innobase/include/btr0pcur.h
storage/innobase/include/btr0pcur.ic
storage/innobase/include/btr0sea.h
storage/innobase/include/btr0sea.ic
storage/innobase/include/btr0types.h
storage/innobase/include/buf0buf.h
storage/innobase/include/buf0buf.ic
storage/innobase/include/buf0flu.h
storage/innobase/include/buf0flu.ic
storage/innobase/include/buf0lru.h
storage/innobase/include/buf0lru.ic
storage/innobase/include/buf0rea.h
storage/innobase/include/buf0types.h
storage/innobase/include/data0data.h
storage/innobase/include/data0data.ic
storage/innobase/include/data0type.h
storage/innobase/include/data0type.ic
storage/innobase/include/data0types.h
storage/innobase/include/db0err.h
storage/innobase/include/dict0boot.h
storage/innobase/include/dict0boot.ic
storage/innobase/include/dict0crea.h
storage/innobase/include/dict0crea.ic
storage/innobase/include/dict0dict.h
storage/innobase/include/dict0dict.ic
storage/innobase/include/dict0load.h
storage/innobase/include/dict0load.ic
storage/innobase/include/dict0mem.h
storage/innobase/include/dict0mem.ic
storage/innobase/include/dict0types.h
storage/innobase/include/dyn0dyn.h
storage/innobase/include/dyn0dyn.ic
storage/innobase/include/eval0eval.h
storage/innobase/include/eval0eval.ic
storage/innobase/include/eval0proc.h
storage/innobase/include/eval0proc.ic
storage/innobase/include/fil0fil.h
storage/innobase/include/fsp0fsp.h
storage/innobase/include/fsp0fsp.ic
storage/innobase/include/fsp0types.h
storage/innobase/include/fut0fut.h
storage/innobase/include/fut0fut.ic
storage/innobase/include/fut0lst.h
storage/innobase/include/fut0lst.ic
storage/innobase/include/ha0ha.h
storage/innobase/include/ha0ha.ic
storage/innobase/include/ha_prototypes.h
storage/innobase/include/hash0hash.h
storage/innobase/include/hash0hash.ic
storage/innobase/include/ibuf0ibuf.h
storage/innobase/include/ibuf0ibuf.ic
storage/innobase/include/ibuf0types.h
storage/innobase/include/lock0iter.h
storage/innobase/include/lock0lock.h
storage/innobase/include/lock0lock.ic
storage/innobase/include/lock0priv.h
storage/innobase/include/lock0priv.ic
storage/innobase/include/lock0types.h
storage/innobase/include/log0log.h
storage/innobase/include/log0log.ic
storage/innobase/include/log0recv.h
storage/innobase/include/log0recv.ic
storage/innobase/include/mach0data.h
storage/innobase/include/mach0data.ic
storage/innobase/include/mem0dbg.h
storage/innobase/include/mem0dbg.ic
storage/innobase/include/mem0mem.h
storage/innobase/include/mem0mem.ic
storage/innobase/include/mem0pool.h
storage/innobase/include/mem0pool.ic
storage/innobase/include/mtr0log.h
storage/innobase/include/mtr0log.ic
storage/innobase/include/mtr0mtr.h
storage/innobase/include/mtr0mtr.ic
storage/innobase/include/mtr0types.h
storage/innobase/include/os0file.h
storage/innobase/include/os0proc.h
storage/innobase/include/os0proc.ic
storage/innobase/include/os0sync.h
storage/innobase/include/os0sync.ic
storage/innobase/include/os0thread.h
storage/innobase/include/os0thread.ic
storage/innobase/include/page0cur.h
storage/innobase/include/page0cur.ic
storage/innobase/include/page0page.h
storage/innobase/include/page0page.ic
storage/innobase/include/page0types.h
storage/innobase/include/pars0grm.h
storage/innobase/include/pars0opt.h
storage/innobase/include/pars0opt.ic
storage/innobase/include/pars0pars.h
storage/innobase/include/pars0pars.ic
storage/innobase/include/pars0sym.h
storage/innobase/include/pars0sym.ic
storage/innobase/include/pars0types.h
storage/innobase/include/que0que.h
storage/innobase/include/que0que.ic
storage/innobase/include/que0types.h
storage/innobase/include/read0read.h
storage/innobase/include/read0read.ic
storage/innobase/include/read0types.h
storage/innobase/include/rem0cmp.h
storage/innobase/include/rem0cmp.ic
storage/innobase/include/rem0rec.h
storage/innobase/include/rem0rec.ic
storage/innobase/include/rem0types.h
storage/innobase/include/row0ins.h
storage/innobase/include/row0ins.ic
storage/innobase/include/row0mysql.h
storage/innobase/include/row0mysql.ic
storage/innobase/include/row0purge.h
storage/innobase/include/row0purge.ic
storage/innobase/include/row0row.h
storage/innobase/include/row0row.ic
storage/innobase/include/row0sel.h
storage/innobase/include/row0sel.ic
storage/innobase/include/row0types.h
storage/innobase/include/row0uins.h
storage/innobase/include/row0uins.ic
storage/innobase/include/row0umod.h
storage/innobase/include/row0umod.ic
storage/innobase/include/row0undo.h
storage/innobase/include/row0undo.ic
storage/innobase/include/row0upd.h
storage/innobase/include/row0upd.ic
storage/innobase/include/row0vers.h
storage/innobase/include/row0vers.ic
storage/innobase/include/srv0que.h
storage/innobase/include/srv0srv.h
storage/innobase/include/srv0srv.ic
storage/innobase/include/srv0start.h
storage/innobase/include/sync0arr.h
storage/innobase/include/sync0arr.ic
storage/innobase/include/sync0rw.h
storage/innobase/include/sync0rw.ic
storage/innobase/include/sync0sync.h
storage/innobase/include/sync0sync.ic
storage/innobase/include/sync0types.h
storage/innobase/include/thr0loc.h
storage/innobase/include/thr0loc.ic
storage/innobase/include/trx0purge.h
storage/innobase/include/trx0purge.ic
storage/innobase/include/trx0rec.h
storage/innobase/include/trx0rec.ic
storage/innobase/include/trx0roll.h
storage/innobase/include/trx0roll.ic
storage/innobase/include/trx0rseg.h
storage/innobase/include/trx0rseg.ic
storage/innobase/include/trx0sys.h
storage/innobase/include/trx0sys.ic
storage/innobase/include/trx0trx.h
storage/innobase/include/trx0trx.ic
storage/innobase/include/trx0types.h
storage/innobase/include/trx0undo.h
storage/innobase/include/trx0undo.ic
storage/innobase/include/trx0xa.h
storage/innobase/include/univ.i
storage/innobase/include/usr0sess.h
storage/innobase/include/usr0sess.ic
storage/innobase/include/usr0types.h
storage/innobase/include/ut0byte.h
storage/innobase/include/ut0byte.ic
storage/innobase/include/ut0dbg.h
storage/innobase/include/ut0list.h
storage/innobase/include/ut0list.ic
storage/innobase/include/ut0lst.h
storage/innobase/include/ut0mem.h
storage/innobase/include/ut0mem.ic
storage/innobase/include/ut0rnd.h
storage/innobase/include/ut0rnd.ic
storage/innobase/include/ut0sort.h
storage/innobase/include/ut0ut.h
storage/innobase/include/ut0ut.ic
storage/innobase/include/ut0vec.h
storage/innobase/include/ut0vec.ic
storage/innobase/include/ut0wqueue.h
storage/innobase/lock/
storage/innobase/lock/lock0iter.c
storage/innobase/lock/lock0lock.c
storage/innobase/log/
storage/innobase/log/log0log.c
storage/innobase/log/log0recv.c
storage/innobase/mach/
storage/innobase/mach/mach0data.c
storage/innobase/mem/
storage/innobase/mem/mem0dbg.c
storage/innobase/mem/mem0mem.c
storage/innobase/mem/mem0pool.c
storage/innobase/mtr/
storage/innobase/mtr/mtr0log.c
storage/innobase/mtr/mtr0mtr.c
storage/innobase/mysql-test/
storage/innobase/os/
storage/innobase/os/os0file.c
storage/innobase/os/os0proc.c
storage/innobase/os/os0sync.c
storage/innobase/os/os0thread.c
storage/innobase/page/
storage/innobase/page/page0cur.c
storage/innobase/page/page0page.c
storage/innobase/pars/
storage/innobase/pars/lexyy.c
storage/innobase/pars/make_bison.sh
storage/innobase/pars/make_flex.sh
storage/innobase/pars/pars0grm.c
storage/innobase/pars/pars0grm.h
storage/innobase/pars/pars0grm.y
storage/innobase/pars/pars0lex.l
storage/innobase/pars/pars0opt.c
storage/innobase/pars/pars0pars.c
storage/innobase/pars/pars0sym.c
storage/innobase/plug.in.disabled
storage/innobase/que/
storage/innobase/que/que0que.c
storage/innobase/read/
storage/innobase/read/read0read.c
storage/innobase/rem/
storage/innobase/rem/rem0cmp.c
storage/innobase/rem/rem0rec.c
storage/innobase/row/
storage/innobase/row/row0ins.c
storage/innobase/row/row0mysql.c
storage/innobase/row/row0purge.c
storage/innobase/row/row0row.c
storage/innobase/row/row0sel.c
storage/innobase/row/row0uins.c
storage/innobase/row/row0umod.c
storage/innobase/row/row0undo.c
storage/innobase/row/row0upd.c
storage/innobase/row/row0vers.c
storage/innobase/srv/
storage/innobase/srv/srv0que.c
storage/innobase/srv/srv0srv.c
storage/innobase/srv/srv0start.c
storage/innobase/sync/
storage/innobase/sync/sync0arr.c
storage/innobase/sync/sync0rw.c
storage/innobase/sync/sync0sync.c
storage/innobase/thr/
storage/innobase/thr/thr0loc.c
storage/innobase/trx/
storage/innobase/trx/trx0purge.c
storage/innobase/trx/trx0rec.c
storage/innobase/trx/trx0roll.c
storage/innobase/trx/trx0rseg.c
storage/innobase/trx/trx0sys.c
storage/innobase/trx/trx0trx.c
storage/innobase/trx/trx0undo.c
storage/innobase/usr/
storage/innobase/usr/usr0sess.c
storage/innobase/ut/
storage/innobase/ut/ut0byte.c
storage/innobase/ut/ut0dbg.c
storage/innobase/ut/ut0list.c
storage/innobase/ut/ut0mem.c
storage/innobase/ut/ut0rnd.c
storage/innobase/ut/ut0ut.c
storage/innobase/ut/ut0vec.c
storage/innobase/ut/ut0wqueue.c
storage/innodb_plugin/
storage/innodb_plugin/CMakeLists.txt
storage/innodb_plugin/COPYING
storage/innodb_plugin/COPYING.Google
storage/innodb_plugin/COPYING.Percona
storage/innodb_plugin/COPYING.Sun_Microsystems
storage/innodb_plugin/ChangeLog
storage/innodb_plugin/Doxyfile
storage/innodb_plugin/Makefile.am
storage/innodb_plugin/btr/
storage/innodb_plugin/btr/btr0btr.c
storage/innodb_plugin/btr/btr0cur.c
storage/innodb_plugin/btr/btr0pcur.c
storage/innodb_plugin/btr/btr0sea.c
storage/innodb_plugin/buf/
storage/innodb_plugin/buf/buf0buddy.c
storage/innodb_plugin/buf/buf0buf.c
storage/innodb_plugin/buf/buf0flu.c
storage/innodb_plugin/buf/buf0lru.c
storage/innodb_plugin/buf/buf0rea.c
storage/innodb_plugin/compile-innodb
storage/innodb_plugin/compile-innodb-debug
storage/innodb_plugin/data/
storage/innodb_plugin/data/data0data.c
storage/innodb_plugin/data/data0type.c
storage/innodb_plugin/dict/
storage/innodb_plugin/dict/dict0boot.c
storage/innodb_plugin/dict/dict0crea.c
storage/innodb_plugin/dict/dict0dict.c
storage/innodb_plugin/dict/dict0load.c
storage/innodb_plugin/dict/dict0mem.c
storage/innodb_plugin/dyn/
storage/innodb_plugin/dyn/dyn0dyn.c
storage/innodb_plugin/eval/
storage/innodb_plugin/eval/eval0eval.c
storage/innodb_plugin/eval/eval0proc.c
storage/innodb_plugin/fil/
storage/innodb_plugin/fil/fil0fil.c
storage/innodb_plugin/fsp/
storage/innodb_plugin/fsp/fsp0fsp.c
storage/innodb_plugin/fut/
storage/innodb_plugin/fut/fut0fut.c
storage/innodb_plugin/fut/fut0lst.c
storage/innodb_plugin/ha/
storage/innodb_plugin/ha/ha0ha.c
storage/innodb_plugin/ha/ha0storage.c
storage/innodb_plugin/ha/hash0hash.c
storage/innodb_plugin/ha_innodb.def
storage/innodb_plugin/handler/
storage/innodb_plugin/handler/ha_innodb.cc
storage/innodb_plugin/handler/ha_innodb.h
storage/innodb_plugin/handler/handler0alter.cc
storage/innodb_plugin/handler/i_s.cc
storage/innodb_plugin/handler/i_s.h
storage/innodb_plugin/handler/mysql_addons.cc
storage/innodb_plugin/ibuf/
storage/innodb_plugin/ibuf/ibuf0ibuf.c
storage/innodb_plugin/include/
storage/innodb_plugin/include/btr0btr.h
storage/innodb_plugin/include/btr0btr.ic
storage/innodb_plugin/include/btr0cur.h
storage/innodb_plugin/include/btr0cur.ic
storage/innodb_plugin/include/btr0pcur.h
storage/innodb_plugin/include/btr0pcur.ic
storage/innodb_plugin/include/btr0sea.h
storage/innodb_plugin/include/btr0sea.ic
storage/innodb_plugin/include/btr0types.h
storage/innodb_plugin/include/buf0buddy.h
storage/innodb_plugin/include/buf0buddy.ic
storage/innodb_plugin/include/buf0buf.h
storage/innodb_plugin/include/buf0buf.ic
storage/innodb_plugin/include/buf0flu.h
storage/innodb_plugin/include/buf0flu.ic
storage/innodb_plugin/include/buf0lru.h
storage/innodb_plugin/include/buf0lru.ic
storage/innodb_plugin/include/buf0rea.h
storage/innodb_plugin/include/buf0types.h
storage/innodb_plugin/include/data0data.h
storage/innodb_plugin/include/data0data.ic
storage/innodb_plugin/include/data0type.h
storage/innodb_plugin/include/data0type.ic
storage/innodb_plugin/include/data0types.h
storage/innodb_plugin/include/db0err.h
storage/innodb_plugin/include/dict0boot.h
storage/innodb_plugin/include/dict0boot.ic
storage/innodb_plugin/include/dict0crea.h
storage/innodb_plugin/include/dict0crea.ic
storage/innodb_plugin/include/dict0dict.h
storage/innodb_plugin/include/dict0dict.ic
storage/innodb_plugin/include/dict0load.h
storage/innodb_plugin/include/dict0load.ic
storage/innodb_plugin/include/dict0mem.h
storage/innodb_plugin/include/dict0mem.ic
storage/innodb_plugin/include/dict0types.h
storage/innodb_plugin/include/dyn0dyn.h
storage/innodb_plugin/include/dyn0dyn.ic
storage/innodb_plugin/include/eval0eval.h
storage/innodb_plugin/include/eval0eval.ic
storage/innodb_plugin/include/eval0proc.h
storage/innodb_plugin/include/eval0proc.ic
storage/innodb_plugin/include/fil0fil.h
storage/innodb_plugin/include/fsp0fsp.h
storage/innodb_plugin/include/fsp0fsp.ic
storage/innodb_plugin/include/fsp0types.h
storage/innodb_plugin/include/fut0fut.h
storage/innodb_plugin/include/fut0fut.ic
storage/innodb_plugin/include/fut0lst.h
storage/innodb_plugin/include/fut0lst.ic
storage/innodb_plugin/include/ha0ha.h
storage/innodb_plugin/include/ha0ha.ic
storage/innodb_plugin/include/ha0storage.h
storage/innodb_plugin/include/ha0storage.ic
storage/innodb_plugin/include/ha_prototypes.h
storage/innodb_plugin/include/handler0alter.h
storage/innodb_plugin/include/hash0hash.h
storage/innodb_plugin/include/hash0hash.ic
storage/innodb_plugin/include/ibuf0ibuf.h
storage/innodb_plugin/include/ibuf0ibuf.ic
storage/innodb_plugin/include/ibuf0types.h
storage/innodb_plugin/include/lock0iter.h
storage/innodb_plugin/include/lock0lock.h
storage/innodb_plugin/include/lock0lock.ic
storage/innodb_plugin/include/lock0priv.h
storage/innodb_plugin/include/lock0priv.ic
storage/innodb_plugin/include/lock0types.h
storage/innodb_plugin/include/log0log.h
storage/innodb_plugin/include/log0log.ic
storage/innodb_plugin/include/log0recv.h
storage/innodb_plugin/include/log0recv.ic
storage/innodb_plugin/include/mach0data.h
storage/innodb_plugin/include/mach0data.ic
storage/innodb_plugin/include/mem0dbg.h
storage/innodb_plugin/include/mem0dbg.ic
storage/innodb_plugin/include/mem0mem.h
storage/innodb_plugin/include/mem0mem.ic
storage/innodb_plugin/include/mem0pool.h
storage/innodb_plugin/include/mem0pool.ic
storage/innodb_plugin/include/mtr0log.h
storage/innodb_plugin/include/mtr0log.ic
storage/innodb_plugin/include/mtr0mtr.h
storage/innodb_plugin/include/mtr0mtr.ic
storage/innodb_plugin/include/mtr0types.h
storage/innodb_plugin/include/mysql_addons.h
storage/innodb_plugin/include/os0file.h
storage/innodb_plugin/include/os0proc.h
storage/innodb_plugin/include/os0proc.ic
storage/innodb_plugin/include/os0sync.h
storage/innodb_plugin/include/os0sync.ic
storage/innodb_plugin/include/os0thread.h
storage/innodb_plugin/include/os0thread.ic
storage/innodb_plugin/include/page0cur.h
storage/innodb_plugin/include/page0cur.ic
storage/innodb_plugin/include/page0page.h
storage/innodb_plugin/include/page0page.ic
storage/innodb_plugin/include/page0types.h
storage/innodb_plugin/include/page0zip.h
storage/innodb_plugin/include/page0zip.ic
storage/innodb_plugin/include/pars0grm.h
storage/innodb_plugin/include/pars0opt.h
storage/innodb_plugin/include/pars0opt.ic
storage/innodb_plugin/include/pars0pars.h
storage/innodb_plugin/include/pars0pars.ic
storage/innodb_plugin/include/pars0sym.h
storage/innodb_plugin/include/pars0sym.ic
storage/innodb_plugin/include/pars0types.h
storage/innodb_plugin/include/que0que.h
storage/innodb_plugin/include/que0que.ic
storage/innodb_plugin/include/que0types.h
storage/innodb_plugin/include/read0read.h
storage/innodb_plugin/include/read0read.ic
storage/innodb_plugin/include/read0types.h
storage/innodb_plugin/include/rem0cmp.h
storage/innodb_plugin/include/rem0cmp.ic
storage/innodb_plugin/include/rem0rec.h
storage/innodb_plugin/include/rem0rec.ic
storage/innodb_plugin/include/rem0types.h
storage/innodb_plugin/include/row0ext.h
storage/innodb_plugin/include/row0ext.ic
storage/innodb_plugin/include/row0ins.h
storage/innodb_plugin/include/row0ins.ic
storage/innodb_plugin/include/row0merge.h
storage/innodb_plugin/include/row0mysql.h
storage/innodb_plugin/include/row0mysql.ic
storage/innodb_plugin/include/row0purge.h
storage/innodb_plugin/include/row0purge.ic
storage/innodb_plugin/include/row0row.h
storage/innodb_plugin/include/row0row.ic
storage/innodb_plugin/include/row0sel.h
storage/innodb_plugin/include/row0sel.ic
storage/innodb_plugin/include/row0types.h
storage/innodb_plugin/include/row0uins.h
storage/innodb_plugin/include/row0uins.ic
storage/innodb_plugin/include/row0umod.h
storage/innodb_plugin/include/row0umod.ic
storage/innodb_plugin/include/row0undo.h
storage/innodb_plugin/include/row0undo.ic
storage/innodb_plugin/include/row0upd.h
storage/innodb_plugin/include/row0upd.ic
storage/innodb_plugin/include/row0vers.h
storage/innodb_plugin/include/row0vers.ic
storage/innodb_plugin/include/srv0que.h
storage/innodb_plugin/include/srv0srv.h
storage/innodb_plugin/include/srv0srv.ic
storage/innodb_plugin/include/srv0start.h
storage/innodb_plugin/include/sync0arr.h
storage/innodb_plugin/include/sync0arr.ic
storage/innodb_plugin/include/sync0rw.h
storage/innodb_plugin/include/sync0rw.ic
storage/innodb_plugin/include/sync0sync.h
storage/innodb_plugin/include/sync0sync.ic
storage/innodb_plugin/include/sync0types.h
storage/innodb_plugin/include/thr0loc.h
storage/innodb_plugin/include/thr0loc.ic
storage/innodb_plugin/include/trx0i_s.h
storage/innodb_plugin/include/trx0purge.h
storage/innodb_plugin/include/trx0purge.ic
storage/innodb_plugin/include/trx0rec.h
storage/innodb_plugin/include/trx0rec.ic
storage/innodb_plugin/include/trx0roll.h
storage/innodb_plugin/include/trx0roll.ic
storage/innodb_plugin/include/trx0rseg.h
storage/innodb_plugin/include/trx0rseg.ic
storage/innodb_plugin/include/trx0sys.h
storage/innodb_plugin/include/trx0sys.ic
storage/innodb_plugin/include/trx0trx.h
storage/innodb_plugin/include/trx0trx.ic
storage/innodb_plugin/include/trx0types.h
storage/innodb_plugin/include/trx0undo.h
storage/innodb_plugin/include/trx0undo.ic
storage/innodb_plugin/include/trx0xa.h
storage/innodb_plugin/include/univ.i
storage/innodb_plugin/include/usr0sess.h
storage/innodb_plugin/include/usr0sess.ic
storage/innodb_plugin/include/usr0types.h
storage/innodb_plugin/include/ut0auxconf.h
storage/innodb_plugin/include/ut0byte.h
storage/innodb_plugin/include/ut0byte.ic
storage/innodb_plugin/include/ut0dbg.h
storage/innodb_plugin/include/ut0list.h
storage/innodb_plugin/include/ut0list.ic
storage/innodb_plugin/include/ut0lst.h
storage/innodb_plugin/include/ut0mem.h
storage/innodb_plugin/include/ut0mem.ic
storage/innodb_plugin/include/ut0rbt.h
storage/innodb_plugin/include/ut0rnd.h
storage/innodb_plugin/include/ut0rnd.ic
storage/innodb_plugin/include/ut0sort.h
storage/innodb_plugin/include/ut0ut.h
storage/innodb_plugin/include/ut0ut.ic
storage/innodb_plugin/include/ut0vec.h
storage/innodb_plugin/include/ut0vec.ic
storage/innodb_plugin/include/ut0wqueue.h
storage/innodb_plugin/lock/
storage/innodb_plugin/lock/lock0iter.c
storage/innodb_plugin/lock/lock0lock.c
storage/innodb_plugin/log/
storage/innodb_plugin/log/log0log.c
storage/innodb_plugin/log/log0recv.c
storage/innodb_plugin/mach/
storage/innodb_plugin/mach/mach0data.c
storage/innodb_plugin/mem/
storage/innodb_plugin/mem/mem0dbg.c
storage/innodb_plugin/mem/mem0mem.c
storage/innodb_plugin/mem/mem0pool.c
storage/innodb_plugin/mtr/
storage/innodb_plugin/mtr/mtr0log.c
storage/innodb_plugin/mtr/mtr0mtr.c
storage/innodb_plugin/mysql-test/
storage/innodb_plugin/mysql-test/patches/
storage/innodb_plugin/mysql-test/patches/README
storage/innodb_plugin/mysql-test/patches/index_merge_innodb-explain.diff
storage/innodb_plugin/mysql-test/patches/information_schema.diff
storage/innodb_plugin/mysql-test/patches/innodb_file_per_table.diff
storage/innodb_plugin/mysql-test/patches/innodb_lock_wait_timeout.diff
storage/innodb_plugin/mysql-test/patches/innodb_thread_concurrency_basic.diff
storage/innodb_plugin/mysql-test/patches/partition_innodb.diff
storage/innodb_plugin/os/
storage/innodb_plugin/os/os0file.c
storage/innodb_plugin/os/os0proc.c
storage/innodb_plugin/os/os0sync.c
storage/innodb_plugin/os/os0thread.c
storage/innodb_plugin/page/
storage/innodb_plugin/page/page0cur.c
storage/innodb_plugin/page/page0page.c
storage/innodb_plugin/page/page0zip.c
storage/innodb_plugin/pars/
storage/innodb_plugin/pars/lexyy.c
storage/innodb_plugin/pars/make_bison.sh
storage/innodb_plugin/pars/make_flex.sh
storage/innodb_plugin/pars/pars0grm.c
storage/innodb_plugin/pars/pars0grm.y
storage/innodb_plugin/pars/pars0lex.l
storage/innodb_plugin/pars/pars0opt.c
storage/innodb_plugin/pars/pars0pars.c
storage/innodb_plugin/pars/pars0sym.c
storage/innodb_plugin/plug.in.disabled
storage/innodb_plugin/que/
storage/innodb_plugin/que/que0que.c
storage/innodb_plugin/read/
storage/innodb_plugin/read/read0read.c
storage/innodb_plugin/rem/
storage/innodb_plugin/rem/rem0cmp.c
storage/innodb_plugin/rem/rem0rec.c
storage/innodb_plugin/revert_gen.sh
storage/innodb_plugin/row/
storage/innodb_plugin/row/row0ext.c
storage/innodb_plugin/row/row0ins.c
storage/innodb_plugin/row/row0merge.c
storage/innodb_plugin/row/row0mysql.c
storage/innodb_plugin/row/row0purge.c
storage/innodb_plugin/row/row0row.c
storage/innodb_plugin/row/row0sel.c
storage/innodb_plugin/row/row0uins.c
storage/innodb_plugin/row/row0umod.c
storage/innodb_plugin/row/row0undo.c
storage/innodb_plugin/row/row0upd.c
storage/innodb_plugin/row/row0vers.c
storage/innodb_plugin/scripts/
storage/innodb_plugin/scripts/export.sh
storage/innodb_plugin/scripts/install_innodb_plugins.sql
storage/innodb_plugin/scripts/install_innodb_plugins_win.sql
storage/innodb_plugin/setup.sh
storage/innodb_plugin/srv/
storage/innodb_plugin/srv/srv0que.c
storage/innodb_plugin/srv/srv0srv.c
storage/innodb_plugin/srv/srv0start.c
storage/innodb_plugin/sync/
storage/innodb_plugin/sync/sync0arr.c
storage/innodb_plugin/sync/sync0rw.c
storage/innodb_plugin/sync/sync0sync.c
storage/innodb_plugin/thr/
storage/innodb_plugin/thr/thr0loc.c
storage/innodb_plugin/trx/
storage/innodb_plugin/trx/trx0i_s.c
storage/innodb_plugin/trx/trx0purge.c
storage/innodb_plugin/trx/trx0rec.c
storage/innodb_plugin/trx/trx0roll.c
storage/innodb_plugin/trx/trx0rseg.c
storage/innodb_plugin/trx/trx0sys.c
storage/innodb_plugin/trx/trx0trx.c
storage/innodb_plugin/trx/trx0undo.c
storage/innodb_plugin/usr/
storage/innodb_plugin/usr/usr0sess.c
storage/innodb_plugin/ut/
storage/innodb_plugin/ut/ut0auxconf_atomic_pthread_t_gcc.c
storage/innodb_plugin/ut/ut0auxconf_atomic_pthread_t_solaris.c
storage/innodb_plugin/ut/ut0auxconf_have_gcc_atomics.c
storage/innodb_plugin/ut/ut0auxconf_have_solaris_atomics.c
storage/innodb_plugin/ut/ut0auxconf_pause.c
storage/innodb_plugin/ut/ut0auxconf_sizeof_pthread_t.c
storage/innodb_plugin/ut/ut0byte.c
storage/innodb_plugin/ut/ut0dbg.c
storage/innodb_plugin/ut/ut0list.c
storage/innodb_plugin/ut/ut0mem.c
storage/innodb_plugin/ut/ut0rbt.c
storage/innodb_plugin/ut/ut0rnd.c
storage/innodb_plugin/ut/ut0ut.c
storage/innodb_plugin/ut/ut0vec.c
storage/innodb_plugin/ut/ut0wqueue.c
storage/pbxt/bin/
storage/pbxt/bin/Makefile.am
storage/pbxt/bin/xtstat_xt.cc
storage/xtradb/build/
storage/xtradb/build/debian/
storage/xtradb/build/debian/README.Maintainer
storage/xtradb/build/debian/additions/
storage/xtradb/build/debian/additions/Docs__Images__Makefile.in
storage/xtradb/build/debian/additions/Docs__Makefile.in
storage/xtradb/build/debian/additions/debian-start
storage/xtradb/build/debian/additions/debian-start.inc.sh
storage/xtradb/build/debian/additions/echo_stderr
storage/xtradb/build/debian/additions/innotop/
storage/xtradb/build/debian/additions/innotop/InnoDBParser.pm
storage/xtradb/build/debian/additions/innotop/changelog.innotop
storage/xtradb/build/debian/additions/innotop/innotop
storage/xtradb/build/debian/additions/innotop/innotop.1
storage/xtradb/build/debian/additions/msql2mysql.1
storage/xtradb/build/debian/additions/my.cnf
storage/xtradb/build/debian/additions/my_print_defaults.1
storage/xtradb/build/debian/additions/myisam_ftdump.1
storage/xtradb/build/debian/additions/myisamchk.1
storage/xtradb/build/debian/additions/myisamlog.1
storage/xtradb/build/debian/additions/myisampack.1
storage/xtradb/build/debian/additions/mysql-server.lintian-overrides
storage/xtradb/build/debian/additions/mysql_config.1
storage/xtradb/build/debian/additions/mysql_convert_table_format.1
storage/xtradb/build/debian/additions/mysql_find_rows.1
storage/xtradb/build/debian/additions/mysql_fix_extensions.1
storage/xtradb/build/debian/additions/mysql_install_db.1
storage/xtradb/build/debian/additions/mysql_secure_installation.1
storage/xtradb/build/debian/additions/mysql_setpermission.1
storage/xtradb/build/debian/additions/mysql_tableinfo.1
storage/xtradb/build/debian/additions/mysql_waitpid.1
storage/xtradb/build/debian/additions/mysqlbinlog.1
storage/xtradb/build/debian/additions/mysqlbug.1
storage/xtradb/build/debian/additions/mysqlcheck.1
storage/xtradb/build/debian/additions/mysqld_safe_syslog.cnf
storage/xtradb/build/debian/additions/mysqldumpslow.1
storage/xtradb/build/debian/additions/mysqlimport.1
storage/xtradb/build/debian/additions/mysqlmanager.1
storage/xtradb/build/debian/additions/mysqlreport
storage/xtradb/build/debian/additions/mysqlreport.1
storage/xtradb/build/debian/additions/mysqltest.1
storage/xtradb/build/debian/additions/pack_isam.1
storage/xtradb/build/debian/additions/resolve_stack_dump.1
storage/xtradb/build/debian/additions/resolveip.1
storage/xtradb/build/debian/changelog
storage/xtradb/build/debian/compat
storage/xtradb/build/debian/control
storage/xtradb/build/debian/copyright
storage/xtradb/build/debian/libpercona-xtradb-client-dev.README.Maintainer
storage/xtradb/build/debian/libpercona-xtradb-client-dev.dirs
storage/xtradb/build/debian/libpercona-xtradb-client-dev.docs
storage/xtradb/build/debian/libpercona-xtradb-client-dev.examples
storage/xtradb/build/debian/libpercona-xtradb-client-dev.files
storage/xtradb/build/debian/libpercona-xtradb-client-dev.links
storage/xtradb/build/debian/libpercona-xtradb-client16.dirs
storage/xtradb/build/debian/libpercona-xtradb-client16.docs
storage/xtradb/build/debian/libpercona-xtradb-client16.files
storage/xtradb/build/debian/libpercona-xtradb-client16.postinst
storage/xtradb/build/debian/patches/
storage/xtradb/build/debian/patches/00list
storage/xtradb/build/debian/patches/01_MAKEFILES__Docs_Images_Makefile.in.dpatch
storage/xtradb/build/debian/patches/01_MAKEFILES__Docs_Makefile.in.dpatch
storage/xtradb/build/debian/patches/33_scripts__mysql_create_system_tables__no_test.dpatch
storage/xtradb/build/debian/patches/38_scripts__mysqld_safe.sh__signals.dpatch
storage/xtradb/build/debian/patches/41_scripts__mysql_install_db.sh__no_test.dpatch
storage/xtradb/build/debian/patches/44_scripts__mysql_config__libs.dpatch
storage/xtradb/build/debian/patches/50_mysql-test__db_test.dpatch
storage/xtradb/build/debian/patches/60_percona_support.dpatch
storage/xtradb/build/debian/percona-xtradb-client-5.1.README.Debian
storage/xtradb/build/debian/percona-xtradb-client-5.1.dirs
storage/xtradb/build/debian/percona-xtradb-client-5.1.docs
storage/xtradb/build/debian/percona-xtradb-client-5.1.files
storage/xtradb/build/debian/percona-xtradb-client-5.1.links
storage/xtradb/build/debian/percona-xtradb-client-5.1.lintian-overrides
storage/xtradb/build/debian/percona-xtradb-client-5.1.menu
storage/xtradb/build/debian/percona-xtradb-common.dirs
storage/xtradb/build/debian/percona-xtradb-common.files
storage/xtradb/build/debian/percona-xtradb-common.lintian-overrides
storage/xtradb/build/debian/percona-xtradb-common.postrm
storage/xtradb/build/debian/percona-xtradb-server-5.1.NEWS
storage/xtradb/build/debian/percona-xtradb-server-5.1.README.Debian
storage/xtradb/build/debian/percona-xtradb-server-5.1.config
storage/xtradb/build/debian/percona-xtradb-server-5.1.dirs
storage/xtradb/build/debian/percona-xtradb-server-5.1.docs
storage/xtradb/build/debian/percona-xtradb-server-5.1.files
storage/xtradb/build/debian/percona-xtradb-server-5.1.links
storage/xtradb/build/debian/percona-xtradb-server-5.1.lintian-overrides
storage/xtradb/build/debian/percona-xtradb-server-5.1.logcheck.ignore.paranoid
storage/xtradb/build/debian/percona-xtradb-server-5.1.logcheck.ignore.server
storage/xtradb/build/debian/percona-xtradb-server-5.1.logcheck.ignore.workstation
storage/xtradb/build/debian/percona-xtradb-server-5.1.mysql.init
storage/xtradb/build/debian/percona-xtradb-server-5.1.percona-xtradb-server.logrotate
storage/xtradb/build/debian/percona-xtradb-server-5.1.postinst
storage/xtradb/build/debian/percona-xtradb-server-5.1.postrm
storage/xtradb/build/debian/percona-xtradb-server-5.1.preinst
storage/xtradb/build/debian/percona-xtradb-server-5.1.prerm
storage/xtradb/build/debian/percona-xtradb-server-5.1.templates
storage/xtradb/build/debian/po/
storage/xtradb/build/debian/po/POTFILES.in
storage/xtradb/build/debian/po/ar.po
storage/xtradb/build/debian/po/ca.po
storage/xtradb/build/debian/po/cs.po
storage/xtradb/build/debian/po/da.po
storage/xtradb/build/debian/po/de.po
storage/xtradb/build/debian/po/es.po
storage/xtradb/build/debian/po/eu.po
storage/xtradb/build/debian/po/fr.po
storage/xtradb/build/debian/po/gl.po
storage/xtradb/build/debian/po/it.po
storage/xtradb/build/debian/po/ja.po
storage/xtradb/build/debian/po/nb.po
storage/xtradb/build/debian/po/nl.po
storage/xtradb/build/debian/po/pt.po
storage/xtradb/build/debian/po/pt_BR.po
storage/xtradb/build/debian/po/ro.po
storage/xtradb/build/debian/po/ru.po
storage/xtradb/build/debian/po/sv.po
storage/xtradb/build/debian/po/templates.pot
storage/xtradb/build/debian/po/tr.po
storage/xtradb/build/debian/rules
storage/xtradb/build/debian/source.lintian-overrides
storage/xtradb/build/debian/watch
storage/xtradb/build/percona-sql.spec
renamed:
mysql-test/r/innodb_bug39438.result => mysql-test/suite/innodb/r/innodb_bug39438.result
mysql-test/r/variables+c.result => mysql-test/r/variables_community.result
mysql-test/t/innodb-use-sys-malloc.test => mysql-test/suite/innodb/t/innodb-use-sys-malloc.test
mysql-test/t/innodb_bug39438-master.opt => mysql-test/suite/innodb/t/innodb_bug39438-master.opt
mysql-test/t/innodb_bug39438.test => mysql-test/suite/innodb/t/innodb_bug39438.test
mysql-test/t/variables+c.test => mysql-test/t/variables_community.test
modified:
.bzrignore
COPYING
INSTALL-SOURCE
INSTALL-WIN-SOURCE
client/mysql.cc
client/mysql_upgrade.c
client/mysqladmin.cc
client/mysqlbinlog.cc
client/mysqlcheck.c
client/mysqldump.c
client/mysqlimport.c
client/mysqlshow.c
client/mysqlslap.c
client/mysqltest.cc
cmd-line-utils/readline/rlmbutil.h
configure.in
extra/libevent/event-internal.h
extra/yassl/include/yassl_error.hpp
extra/yassl/src/ssl.cpp
extra/yassl/src/yassl_error.cpp
include/Makefile.am
include/my_global.h
include/my_sys.h
include/mysql/plugin.h
include/mysql/plugin.h.pp
libmysql/libmysql.c
man/comp_err.1
man/innochecksum.1
man/make_win_bin_dist.1
man/msql2mysql.1
man/my_print_defaults.1
man/myisam_ftdump.1
man/myisamchk.1
man/myisamlog.1
man/myisampack.1
man/mysql-stress-test.pl.1
man/mysql-test-run.pl.1
man/mysql.1
man/mysql.server.1
man/mysql_client_test.1
man/mysql_config.1
man/mysql_convert_table_format.1
man/mysql_find_rows.1
man/mysql_fix_extensions.1
man/mysql_fix_privilege_tables.1
man/mysql_install_db.1
man/mysql_secure_installation.1
man/mysql_setpermission.1
man/mysql_tzinfo_to_sql.1
man/mysql_upgrade.1
man/mysql_waitpid.1
man/mysql_zap.1
man/mysqlaccess.1
man/mysqladmin.1
man/mysqlbinlog.1
man/mysqlbug.1
man/mysqlcheck.1
man/mysqld.8
man/mysqld_multi.1
man/mysqld_safe.1
man/mysqldump.1
man/mysqldumpslow.1
man/mysqlhotcopy.1
man/mysqlimport.1
man/mysqlmanager.8
man/mysqlshow.1
man/mysqlslap.1
man/mysqltest.1
man/ndbd.8
man/ndbd_redo_log_reader.1
man/ndbmtd.8
man/perror.1
man/replace.1
man/resolve_stack_dump.1
man/resolveip.1
mysql-test/Makefile.am
mysql-test/collections/default.daily
mysql-test/collections/default.push
mysql-test/extra/rpl_tests/rpl_get_master_version_and_clock.test
mysql-test/extra/rpl_tests/rpl_loaddata.test
mysql-test/include/mtr_warnings.sql
mysql-test/include/test_fieldsize.inc
mysql-test/lib/My/ConfigFactory.pm
mysql-test/lib/My/SafeProcess.pm
mysql-test/lib/My/SafeProcess/safe_process_win.cc
mysql-test/lib/mtr_cases.pm
mysql-test/lib/mtr_gprof.pl
mysql-test/lib/mtr_misc.pl
mysql-test/lib/mtr_report.pm
mysql-test/lib/mtr_stress.pl
mysql-test/lib/v1/mtr_stress.pl
mysql-test/lib/v1/mysql-test-run.pl
mysql-test/mysql-stress-test.pl
mysql-test/mysql-test-run.pl
mysql-test/r/archive.result
mysql-test/r/backup.result
mysql-test/r/bigint.result
mysql-test/r/compare.result
mysql-test/r/csv.result
mysql-test/r/ctype_ldml.result
mysql-test/r/ctype_ucs.result
mysql-test/r/default.result
mysql-test/r/delete.result
mysql-test/r/error_simulation.result
mysql-test/r/explain.result
mysql-test/r/fulltext.result
mysql-test/r/func_concat.result
mysql-test/r/func_gconcat.result
mysql-test/r/func_str.result
mysql-test/r/func_time.result
mysql-test/r/gis-rtree.result
mysql-test/r/group_by.result
mysql-test/r/group_min_max.result
mysql-test/r/handler_myisam.result
mysql-test/r/having.result
mysql-test/r/index_merge_myisam.result
mysql-test/r/information_schema.result
mysql-test/r/information_schema_all_engines.result
mysql-test/r/innodb_mysql.result
mysql-test/r/join.result
mysql-test/r/join_outer.result
mysql-test/r/loaddata.result
mysql-test/r/log_state.result
mysql-test/r/merge.result
mysql-test/r/metadata.result
mysql-test/r/multi_update.result
mysql-test/r/myisam.result
mysql-test/r/mysqlbinlog.result
mysql-test/r/mysqlbinlog_row_innodb.result
mysql-test/r/mysqltest.result
mysql-test/r/partition.result
mysql-test/r/partition_error.result
mysql-test/r/partition_innodb.result
mysql-test/r/partition_pruning.result
mysql-test/r/partition_range.result
mysql-test/r/ps.result
mysql-test/r/query_cache_with_views.result
mysql-test/r/row.result
mysql-test/r/select.result
mysql-test/r/show_check.result
mysql-test/r/skip_name_resolve.result
mysql-test/r/sp-bugs.result
mysql-test/r/sp-error.result
mysql-test/r/sp.result
mysql-test/r/sp_notembedded.result
mysql-test/r/sp_trans.result
mysql-test/r/subselect.result
mysql-test/r/subselect3.result
mysql-test/r/symlink.result
mysql-test/r/table_elim.result
mysql-test/r/trigger.result
mysql-test/r/type_bit.result
mysql-test/r/type_blob.result
mysql-test/r/type_date.result
mysql-test/r/type_datetime.result
mysql-test/r/type_timestamp.result
mysql-test/r/type_year.result
mysql-test/r/union.result
mysql-test/r/update.result
mysql-test/r/variables.result
mysql-test/r/variables_debug.result
mysql-test/r/view.result
mysql-test/r/view_grant.result
mysql-test/r/warnings.result
mysql-test/r/xa.result
mysql-test/suite/binlog/r/binlog_innodb_row.result
mysql-test/suite/binlog/r/binlog_row_mix_innodb_myisam.result
mysql-test/suite/binlog/r/binlog_stm_binlog.result
mysql-test/suite/binlog/r/binlog_stm_mix_innodb_myisam.result
mysql-test/suite/binlog/r/binlog_stm_unsafe_warning.result
mysql-test/suite/binlog/r/binlog_tmp_table.result
mysql-test/suite/binlog/r/binlog_unsafe.result
mysql-test/suite/binlog/t/binlog_innodb_row.test
mysql-test/suite/binlog/t/binlog_killed.test
mysql-test/suite/binlog/t/binlog_stm_binlog.test
mysql-test/suite/binlog/t/binlog_stm_unsafe_warning.test
mysql-test/suite/binlog/t/binlog_tmp_table.test
mysql-test/suite/federated/federated.result
mysql-test/suite/federated/federated.test
mysql-test/suite/funcs_1/r/is_columns_is.result
mysql-test/suite/funcs_1/r/is_tables_is.result
mysql-test/suite/maria/t/maria-recovery-bitmap.test
mysql-test/suite/parts/inc/partition_auto_increment.inc
mysql-test/suite/parts/r/partition_auto_increment_archive.result
mysql-test/suite/parts/r/partition_auto_increment_blackhole.result
mysql-test/suite/parts/r/partition_auto_increment_innodb.result
mysql-test/suite/parts/r/partition_auto_increment_maria.result
mysql-test/suite/parts/r/partition_auto_increment_memory.result
mysql-test/suite/parts/r/partition_auto_increment_myisam.result
mysql-test/suite/parts/r/partition_auto_increment_ndb.result
mysql-test/suite/pbxt/r/default.result
mysql-test/suite/pbxt/r/func_str.result
mysql-test/suite/pbxt/r/group_min_max.result
mysql-test/suite/pbxt/r/join_nested.result
mysql-test/suite/pbxt/r/mysqlshow.result
mysql-test/suite/pbxt/r/negation_elimination.result
mysql-test/suite/pbxt/r/null.result
mysql-test/suite/pbxt/r/order_by.result
mysql-test/suite/pbxt/r/pbxt_ref_int.result
mysql-test/suite/pbxt/r/pbxt_xa.result
mysql-test/suite/pbxt/r/range.result
mysql-test/suite/pbxt/r/select.result
mysql-test/suite/pbxt/r/select_safe.result
mysql-test/suite/pbxt/r/subselect.result
mysql-test/suite/pbxt/r/type_timestamp.result
mysql-test/suite/pbxt/t/pbxt_xa.test
mysql-test/suite/pbxt/t/select_safe.test
mysql-test/suite/rpl/r/rpl_begin_commit_rollback.result
mysql-test/suite/rpl/r/rpl_do_grant.result
mysql-test/suite/rpl/r/rpl_events.result
mysql-test/suite/rpl/r/rpl_get_master_version_and_clock.result
mysql-test/suite/rpl/r/rpl_innodb_mixed_dml.result
mysql-test/suite/rpl/r/rpl_row_create_table.result
mysql-test/suite/rpl/r/rpl_sp.result
mysql-test/suite/rpl/t/disabled.def
mysql-test/suite/rpl/t/rpl_begin_commit_rollback.test
mysql-test/suite/rpl/t/rpl_do_grant.test
mysql-test/suite/rpl/t/rpl_events.test
mysql-test/suite/rpl/t/rpl_get_master_version_and_clock.test
mysql-test/suite/rpl/t/rpl_loaddata_symlink.test
mysql-test/suite/rpl/t/rpl_row_create_table.test
mysql-test/suite/rpl/t/rpl_slave_skip.test
mysql-test/suite/sys_vars/r/log_basic.result
mysql-test/suite/sys_vars/r/log_bin_trust_routine_creators_basic.result
mysql-test/suite/sys_vars/r/myisam_sort_buffer_size_basic_32.result
mysql-test/suite/sys_vars/r/myisam_sort_buffer_size_basic_64.result
mysql-test/suite/sys_vars/r/slow_query_log_func.result
mysql-test/suite/sys_vars/t/innodb_table_locks_func.test
mysql-test/suite/sys_vars/t/slow_query_log_func.test
mysql-test/suite/sys_vars/t/sql_low_priority_updates_func.test
mysql-test/t/archive.test
mysql-test/t/bigint.test
mysql-test/t/csv.test
mysql-test/t/ctype_ldml.test
mysql-test/t/ctype_ucs.test
mysql-test/t/delete.test
mysql-test/t/disabled.def
mysql-test/t/error_simulation.test
mysql-test/t/explain.test
mysql-test/t/fulltext.test
mysql-test/t/func_concat.test
mysql-test/t/func_gconcat.test
mysql-test/t/func_str.test
mysql-test/t/gis-rtree.test
mysql-test/t/group_by.test
mysql-test/t/group_min_max.test
mysql-test/t/handler_myisam.test
mysql-test/t/having.test
mysql-test/t/information_schema_all_engines.test
mysql-test/t/innodb_mysql.test
mysql-test/t/join.test
mysql-test/t/join_outer.test
mysql-test/t/loaddata.test
mysql-test/t/merge.test
mysql-test/t/metadata.test
mysql-test/t/multi_update.test
mysql-test/t/myisam.test
mysql-test/t/mysql_upgrade.test
mysql-test/t/mysqlbinlog.test
mysql-test/t/mysqltest.test
mysql-test/t/partition.test
mysql-test/t/partition_error.test
mysql-test/t/partition_innodb.test
mysql-test/t/partition_innodb_plugin.test
mysql-test/t/partition_innodb_semi_consistent.test
mysql-test/t/partition_pruning.test
mysql-test/t/partition_range.test
mysql-test/t/ps.test
mysql-test/t/query_cache_with_views.test
mysql-test/t/row.test
mysql-test/t/skip_name_resolve.test
mysql-test/t/sp-bugs.test
mysql-test/t/sp_notembedded.test
mysql-test/t/subselect.test
mysql-test/t/symlink.test
mysql-test/t/trigger.test
mysql-test/t/type_bit.test
mysql-test/t/type_date.test
mysql-test/t/type_year.test
mysql-test/t/udf.test
mysql-test/t/update.test
mysql-test/t/variables.test
mysql-test/t/variables_debug.test
mysql-test/t/view.test
mysql-test/t/view_grant.test
mysql-test/t/xa.test
mysys/charset.c
mysys/default.c
mysys/mf_loadpath.c
mysys/mf_pack.c
mysys/my_alloc.c
mysys/my_file.c
mysys/my_getwd.c
mysys/my_init.c
mysys/my_symlink.c
scripts/fill_help_tables.sql
scripts/make_binary_distribution.sh
scripts/make_win_bin_dist
scripts/mysql_system_tables_fix.sql
scripts/mysqld_safe.sh
scripts/mysqlhotcopy.sh
server-tools/instance-manager/options.cc
sql/CMakeLists.txt
sql/debug_sync.cc
sql/debug_sync.h
sql/events.cc
sql/field.cc
sql/field.h
sql/field_conv.cc
sql/filesort.cc
sql/ha_ndbcluster.cc
sql/ha_partition.cc
sql/handler.cc
sql/handler.h
sql/item.cc
sql/item.h
sql/item_cmpfunc.cc
sql/item_cmpfunc.h
sql/item_create.cc
sql/item_create.h
sql/item_func.cc
sql/item_row.cc
sql/item_row.h
sql/item_strfunc.cc
sql/item_strfunc.h
sql/item_subselect.cc
sql/item_subselect.h
sql/item_sum.cc
sql/item_sum.h
sql/item_timefunc.cc
sql/log.cc
sql/log_event.cc
sql/log_event.h
sql/log_event_old.cc
sql/mysql_priv.h
sql/mysqld.cc
sql/opt_range.cc
sql/opt_range.h
sql/opt_sum.cc
sql/protocol.cc
sql/rpl_utility.cc
sql/rpl_utility.h
sql/set_var.cc
sql/share/errmsg.txt
sql/slave.cc
sql/sp.cc
sql/sp_cache.cc
sql/sp_head.cc
sql/sp_head.h
sql/sql_acl.cc
sql/sql_base.cc
sql/sql_class.cc
sql/sql_class.h
sql/sql_delete.cc
sql/sql_insert.cc
sql/sql_lex.cc
sql/sql_lex.h
sql/sql_load.cc
sql/sql_parse.cc
sql/sql_partition.cc
sql/sql_plugin.cc
sql/sql_profile.cc
sql/sql_repl.cc
sql/sql_select.cc
sql/sql_select.h
sql/sql_show.cc
sql/sql_table.cc
sql/sql_trigger.cc
sql/sql_update.cc
sql/sql_view.cc
sql/sql_yacc.yy
sql/table.cc
sql/table.h
storage/archive/ha_archive.cc
storage/csv/ha_tina.cc
storage/example/ha_example.h
storage/federated/ha_federated.cc
storage/federated/ha_federated.h
storage/myisam/ft_boolean_search.c
storage/myisam/ha_myisam.cc
storage/myisam/mi_check.c
storage/myisam/mi_delete_all.c
storage/myisam/mi_delete_table.c
storage/myisam/mi_dynrec.c
storage/myisam/mi_extra.c
storage/myisam/mi_locking.c
storage/myisam/mi_open.c
storage/myisam/mi_page.c
storage/myisam/mi_rnext.c
storage/myisam/mi_write.c
storage/myisam/myisamdef.h
storage/myisam/rt_index.c
storage/myisam/rt_split.c
storage/myisam/sort.c
storage/myisammrg/ha_myisammrg.cc
storage/myisammrg/myrg_open.c
storage/pbxt/ChangeLog
storage/pbxt/Makefile.am
storage/pbxt/src/backup_xt.cc
storage/pbxt/src/cache_xt.cc
storage/pbxt/src/cache_xt.h
storage/pbxt/src/database_xt.cc
storage/pbxt/src/database_xt.h
storage/pbxt/src/datadic_xt.cc
storage/pbxt/src/datadic_xt.h
storage/pbxt/src/datalog_xt.cc
storage/pbxt/src/filesys_xt.h
storage/pbxt/src/ha_pbxt.cc
storage/pbxt/src/index_xt.cc
storage/pbxt/src/index_xt.h
storage/pbxt/src/lock_xt.cc
storage/pbxt/src/lock_xt.h
storage/pbxt/src/locklist_xt.cc
storage/pbxt/src/myxt_xt.cc
storage/pbxt/src/pthread_xt.cc
storage/pbxt/src/pthread_xt.h
storage/pbxt/src/restart_xt.cc
storage/pbxt/src/restart_xt.h
storage/pbxt/src/strutil_xt.cc
storage/pbxt/src/tabcache_xt.cc
storage/pbxt/src/tabcache_xt.h
storage/pbxt/src/table_xt.cc
storage/pbxt/src/table_xt.h
storage/pbxt/src/thread_xt.cc
storage/pbxt/src/thread_xt.h
storage/pbxt/src/trace_xt.cc
storage/pbxt/src/trace_xt.h
storage/pbxt/src/xaction_xt.cc
storage/pbxt/src/xaction_xt.h
storage/pbxt/src/xactlog_xt.cc
storage/pbxt/src/xactlog_xt.h
storage/pbxt/src/xt_config.h
storage/pbxt/src/xt_defs.h
storage/xtradb/btr/btr0btr.c
storage/xtradb/btr/btr0cur.c
storage/xtradb/btr/btr0pcur.c
storage/xtradb/btr/btr0sea.c
storage/xtradb/buf/buf0buddy.c
storage/xtradb/buf/buf0buf.c
storage/xtradb/buf/buf0flu.c
storage/xtradb/buf/buf0rea.c
storage/xtradb/dict/dict0crea.c
storage/xtradb/dict/dict0dict.c
storage/xtradb/dict/dict0mem.c
storage/xtradb/fil/fil0fil.c
storage/xtradb/fsp/fsp0fsp.c
storage/xtradb/handler/ha_innodb.cc
storage/xtradb/handler/ha_innodb.h
storage/xtradb/handler/i_s.cc
storage/xtradb/handler/i_s.h
storage/xtradb/handler/innodb_patch_info.h
storage/xtradb/include/btr0btr.ic
storage/xtradb/include/buf0buddy.h
storage/xtradb/include/buf0buf.h
storage/xtradb/include/buf0buf.ic
storage/xtradb/include/buf0types.h
storage/xtradb/include/dict0dict.h
storage/xtradb/include/dict0mem.h
storage/xtradb/include/fil0fil.h
storage/xtradb/include/fsp0types.h
storage/xtradb/include/fut0fut.ic
storage/xtradb/include/ha_prototypes.h
storage/xtradb/include/page0cur.h
storage/xtradb/include/page0types.h
storage/xtradb/include/srv0srv.h
storage/xtradb/include/trx0sys.h
storage/xtradb/include/univ.i
storage/xtradb/include/ut0rnd.h
storage/xtradb/include/ut0rnd.ic
storage/xtradb/lock/lock0lock.c
storage/xtradb/log/log0log.c
storage/xtradb/log/log0recv.c
storage/xtradb/mtr/mtr0log.c
storage/xtradb/page/page0cur.c
storage/xtradb/page/page0zip.c
storage/xtradb/row/row0ins.c
storage/xtradb/row/row0merge.c
storage/xtradb/row/row0sel.c
storage/xtradb/srv/srv0srv.c
storage/xtradb/srv/srv0start.c
storage/xtradb/trx/trx0i_s.c
storage/xtradb/trx/trx0trx.c
support-files/compiler_warnings.supp
support-files/mysql.spec.sh
The size of the diff (1809184 lines) is larger than your specified limit of 5000 lines
--
lp:~maria-captains/maria/5.1-converting
https://code.launchpad.net/~maria-captains/maria/5.1-converting
Your team Maria developers is subscribed to branch lp:~maria-captains/maria/5.1-converting.
To unsubscribe from this branch go to https://code.launchpad.net/~maria-captains/maria/5.1-converting/+edit-subsc…
1
0
Re: [Maria-developers] [Commits] Rev 2861: fix questionable UNIV_EXPECT's in the xtradb that confused old gcc. in http://bazaar.launchpad.net/~maria-captains/maria/5.1/
by Michael Widenius 16 Jun '10
by Michael Widenius 16 Jun '10
16 Jun '10
Hi!
>>>>> "serg" == serg <serg(a)askmonty.org> writes:
serg> At http://bazaar.launchpad.net/~maria-captains/maria/5.1/
serg> ------------------------------------------------------------
serg> revno: 2861
serg> revision-id: sergii(a)pisem.net-20100609115351-op2cui7bw14y76kp
serg> parent: knielsen(a)knielsen-hq.org-20100531084334-81f5z74nxx6v9zww
serg> committer: Sergei Golubchik <sergii(a)pisem.net>
serg> branch nick: 5.1
serg> timestamp: Wed 2010-06-09 13:53:51 +0200
serg> message:
serg> fix questionable UNIV_EXPECT's in the xtradb that confused old gcc.
I assume you have cc: the XtraDB developers about this change so that
we don't have to do it over and over again?
Regards,
Monty
2
1
[Maria-developers] [Commits] Rev 2866: mysqltest: use setenv, not putenv, to make gcov happy. in http://bazaar.launchpad.net/~maria-captains/maria/5.1/
by Michael Widenius 16 Jun '10
by Michael Widenius 16 Jun '10
16 Jun '10
Hi!
>>>>> "serg" == serg <serg(a)askmonty.org> writes:
serg> At http://bazaar.launchpad.net/~maria-captains/maria/5.1/
serg> ------------------------------------------------------------
serg> revno: 2866
serg> revision-id: sergii(a)pisem.net-20100614091854-5ynq6lo943qlaacw
serg> parent: monty(a)askmonty.org-20100613221332-ldsnptg0j0mn8u9a
serg> committer: Sergei Golubchik <sergii(a)pisem.net>
serg> branch nick: 5.1
serg> timestamp: Mon 2010-06-14 11:18:54 +0200
serg> message:
serg> mysqltest: use setenv, not putenv, to make gcov happy.
serg> (backport from MySQL)
+static int setenv(const char *name, const char *value, int overwrite)
+{
+ size_t buflen= strlen(name) + strlen(value) + 2;
+ char *envvar= (char *)malloc(buflen);
+ if(!envvar)
+ return ENOMEM;
+ strcpy(envvar, name);
+ strcat(envvar, "=");
+ strcat(envvar, value);
+ putenv(envvar);
+ return 0;
+}
+#endif
I expected better from you :)
A much better version is:
strcat(strcat(strmov(envvar, name), "="), value);
The other question I have is will this not cause a memory leek?
If we allocate the same string many times in here it will definitely
be a memory leak as putenv() will never free the old value.
Regards,
Monty
2
1
[Maria-developers] Updated (by Psergey): Add DS-MRR support for clustered primary keys (121)
by worklog-noreply@askmonty.org 13 Jun '10
by worklog-noreply@askmonty.org 13 Jun '10
13 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Add DS-MRR support for clustered primary keys
CREATION DATE..: Sat, 12 Jun 2010, 08:23
SUPERVISOR.....: Igor
IMPLEMENTOR....:
COPIES TO......:
CATEGORY.......: Client-BackLog
TASK ID........: 121 (http://askmonty.org/worklog/?tid=121)
VERSION........: Benchmarks-3.0
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Psergey - Sun, 13 Jun 2010, 11:56)=-=-
High-Level Specification modified.
--- /tmp/wklog.121.old.9396 2010-06-13 11:56:34.000000000 +0000
+++ /tmp/wklog.121.new.9396 2010-06-13 11:56:34.000000000 +0000
@@ -1,17 +1,55 @@
-Basic idea: DS-MRR scan should be done as follows:
+1. Choices to be made
+---------------------
-1. Sort incoming keys
-2. Use the sorted keys to do a disk-ordered retrieval
+1.1 Handling of complex ranges
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The "sort incoming keys" part is easy when we have only equality ranges.
+If we allow ranges of arbitrary form (including ranges with one endpoint
+being infinity and/or ranges overlapping with one another), sorting becomes
+non-trivial. Do we need to support this case or support only equality ranges?
-Unresolved questions:
+Decision: the new code should handle only the case with equality ranges.
+For non-equality ranges, the execution will proceed as before.
-* The "sort incoming keys" part is trivial when we have only equality ranges.
- If we allow ranges of arbitrary form ( including ranges with one endpoint
- being infinity or ranges overlapping with one another), sorting becomes
- non-trivial. Do we need to support this case or support only equality ranges?
+1.2 Handling index prefix scans
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+What do we do if asked to do a scan on a prefix of clustered PK?
-* Can/should we use the fact rowid=={clustered PK value} for InnoDB's clustered
- PKs? (current decision: No)
+Decision: handle this if the ranges are equality ranges. The difference from
+scan on full primary key is that
+- we will have to use read_range_XXX() or index_read()/index_next_same()
+ functions, while for full primary key value we could have used rnd_pos().
+- One equality range can produce multiple matching records.
-* Do we support scanning on a prefix of clustered PK? (Yes but scan on prefix
- of clustered PK will not be in disk order. We need to run it with regular mode)
+1.3 Use of knowledge that primary_key==rowid
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Can/should we use the fact rowid=={clustered PK value} for InnoDB's clustered
+PKs?
+Decision: don't make this assumption.
+
+
+2. Code-level changes overview
+------------------------------
+
+DsMrr_impl::choose_mrr_impl():
+Enable MRR when
+ - ihis is a clustered primary key
+ - incoming ranges are single-point (HA_MRR_SINGLE_POINT is set)
+ - will need to make the SQL layer to set this flag
+ - incoming ranges are not already sorted (HA_MRR_SORTED is not set)
+
+(TODO do we need new cost formula?)
+
+DsMrr_impl::dsmrr_init()
+ - different elem_size (not rowid length but key tuple length)
+ - don't create the secondary handler object, we won't need it.
+
+DsMrr_impl::dsmrr_fill_buffer():
+ - need a variant of this function that would not access the index but just
+ fill and sort the array.
+
+DsMrr_impl::dsmrr_next():
+ - should abstract-out:
+ - buffer element size
+ - rnd_pos/index_read call.
+ - Also for CPK prefix scans there can be multi
-=-=(Psergey - Sun, 13 Jun 2010, 11:55)=-=-
High Level Description modified.
--- /tmp/wklog.121.old.9380 2010-06-13 11:55:42.000000000 +0000
+++ /tmp/wklog.121.new.9380 2010-06-13 11:55:42.000000000 +0000
@@ -1,18 +1,18 @@
-Currently, DS-MRR doesn't support operation over clustered primary keys. The
-reason for this was that
- - Clustered primary keys are stored in disk order and so, if the sequence of
- ranges is ordered, the reads will already go in disk order (and so DS-MRR's
- step of re-ordering reads is not necessary).
+Currently, DS-MRR doesn't allow to do MRR scans over clustered primary keys. The
+reason for this is that
+ - Clustered primary keys are stored in disk order, so, if the sequence of
+ scanned ranges is ordered, the reads will automatically happen in disk
+ order, and DS-MRR's step of re-ordering reads is redundant.
- Within DS-MRR implementation, the "get rowids from keys" step is not
necessary when using clustered primary key, because in InnoDB/XtraDB
clustered primary key values are the rowids.
-However, with BKA making the MRR calls, there are cases where DS-MRR over
+However, when MRR calls are made by BKA, there are cases where DS-MRR over
clustered primary key is beneficial:
* BKA may provide lookup keys that have duplicates and/or are in arbitrary
order. In that case, DS-MRR implementation may sort the key values and
- order them, so that it hits the disk in key order.
+ order them, so that it hits the disk in key(=disk) order.
* When running multi-table join with high @@join_cache_level value (and so,
linked join buffers), lack of MRR implementation causes the chain of linked
@@ -20,3 +20,9 @@
* TODO anything else?
+This WL entry is about addressing the above by adding support of DS-MRR over
+clustered primary key that would work according to this strategy:
+1. Sort incoming keys
+2. Use the sorted keys to do a disk-ordered retrieval.
+
+
-=-=(Psergey - Sun, 13 Jun 2010, 09:42)=-=-
High-Level Specification modified.
--- /tmp/wklog.121.old.5009 2010-06-13 09:42:38.000000000 +0000
+++ /tmp/wklog.121.new.5009 2010-06-13 09:42:38.000000000 +0000
@@ -6,11 +6,12 @@
Unresolved questions:
* The "sort incoming keys" part is trivial when we have only equality ranges.
- If we allow ranges of arbitrary form ( ncluding ranges with one endpoint
+ If we allow ranges of arbitrary form ( including ranges with one endpoint
being infinity or ranges overlapping with one another), sorting becomes
- non-trival. Do we need to support this case or support only equality ranges?
+ non-trivial. Do we need to support this case or support only equality ranges?
* Can/should we use the fact rowid=={clustered PK value} for InnoDB's clustered
- PKs?
+ PKs? (current decision: No)
-* Do we support scanning on a prefix of clustered PK? (seems to be yes?)
+* Do we support scanning on a prefix of clustered PK? (Yes but scan on prefix
+ of clustered PK will not be in disk order. We need to run it with regular mode)
-=-=(Psergey - Sat, 12 Jun 2010, 08:39)=-=-
High-Level Specification modified.
--- /tmp/wklog.121.old.538 2010-06-12 08:39:46.000000000 +0000
+++ /tmp/wklog.121.new.538 2010-06-12 08:39:46.000000000 +0000
@@ -1 +1,16 @@
+Basic idea: DS-MRR scan should be done as follows:
+1. Sort incoming keys
+2. Use the sorted keys to do a disk-ordered retrieval
+
+Unresolved questions:
+
+* The "sort incoming keys" part is trivial when we have only equality ranges.
+ If we allow ranges of arbitrary form ( ncluding ranges with one endpoint
+ being infinity or ranges overlapping with one another), sorting becomes
+ non-trival. Do we need to support this case or support only equality ranges?
+
+* Can/should we use the fact rowid=={clustered PK value} for InnoDB's clustered
+ PKs?
+
+* Do we support scanning on a prefix of clustered PK? (seems to be yes?)
DESCRIPTION:
Currently, DS-MRR doesn't allow to do MRR scans over clustered primary keys. The
reason for this is that
- Clustered primary keys are stored in disk order, so, if the sequence of
scanned ranges is ordered, the reads will automatically happen in disk
order, and DS-MRR's step of re-ordering reads is redundant.
- Within DS-MRR implementation, the "get rowids from keys" step is not
necessary when using clustered primary key, because in InnoDB/XtraDB
clustered primary key values are the rowids.
However, when MRR calls are made by BKA, there are cases where DS-MRR over
clustered primary key is beneficial:
* BKA may provide lookup keys that have duplicates and/or are in arbitrary
order. In that case, DS-MRR implementation may sort the key values and
order them, so that it hits the disk in key(=disk) order.
* When running multi-table join with high @@join_cache_level value (and so,
linked join buffers), lack of MRR implementation causes the chain of linked
join buffers to break. (TODO and so what? Is that really a problem?)
* TODO anything else?
This WL entry is about addressing the above by adding support of DS-MRR over
clustered primary key that would work according to this strategy:
1. Sort incoming keys
2. Use the sorted keys to do a disk-ordered retrieval.
HIGH-LEVEL SPECIFICATION:
1. Choices to be made
---------------------
1.1 Handling of complex ranges
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The "sort incoming keys" part is easy when we have only equality ranges.
If we allow ranges of arbitrary form (including ranges with one endpoint
being infinity and/or ranges overlapping with one another), sorting becomes
non-trivial. Do we need to support this case or support only equality ranges?
Decision: the new code should handle only the case with equality ranges.
For non-equality ranges, the execution will proceed as before.
1.2 Handling index prefix scans
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
What do we do if asked to do a scan on a prefix of clustered PK?
Decision: handle this if the ranges are equality ranges. The difference from
scan on full primary key is that
- we will have to use read_range_XXX() or index_read()/index_next_same()
functions, while for full primary key value we could have used rnd_pos().
- One equality range can produce multiple matching records.
1.3 Use of knowledge that primary_key==rowid
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Can/should we use the fact rowid=={clustered PK value} for InnoDB's clustered
PKs?
Decision: don't make this assumption.
2. Code-level changes overview
------------------------------
DsMrr_impl::choose_mrr_impl():
Enable MRR when
- ihis is a clustered primary key
- incoming ranges are single-point (HA_MRR_SINGLE_POINT is set)
- will need to make the SQL layer to set this flag
- incoming ranges are not already sorted (HA_MRR_SORTED is not set)
(TODO do we need new cost formula?)
DsMrr_impl::dsmrr_init()
- different elem_size (not rowid length but key tuple length)
- don't create the secondary handler object, we won't need it.
DsMrr_impl::dsmrr_fill_buffer():
- need a variant of this function that would not access the index but just
fill and sort the array.
DsMrr_impl::dsmrr_next():
- should abstract-out:
- buffer element size
- rnd_pos/index_read call.
- Also for CPK prefix scans there can be multi
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Psergey): Add DS-MRR support for clustered primary keys (121)
by worklog-noreply@askmonty.org 13 Jun '10
by worklog-noreply@askmonty.org 13 Jun '10
13 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Add DS-MRR support for clustered primary keys
CREATION DATE..: Sat, 12 Jun 2010, 08:23
SUPERVISOR.....: Igor
IMPLEMENTOR....:
COPIES TO......:
CATEGORY.......: Client-BackLog
TASK ID........: 121 (http://askmonty.org/worklog/?tid=121)
VERSION........: Benchmarks-3.0
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Psergey - Sun, 13 Jun 2010, 11:56)=-=-
High-Level Specification modified.
--- /tmp/wklog.121.old.9396 2010-06-13 11:56:34.000000000 +0000
+++ /tmp/wklog.121.new.9396 2010-06-13 11:56:34.000000000 +0000
@@ -1,17 +1,55 @@
-Basic idea: DS-MRR scan should be done as follows:
+1. Choices to be made
+---------------------
-1. Sort incoming keys
-2. Use the sorted keys to do a disk-ordered retrieval
+1.1 Handling of complex ranges
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The "sort incoming keys" part is easy when we have only equality ranges.
+If we allow ranges of arbitrary form (including ranges with one endpoint
+being infinity and/or ranges overlapping with one another), sorting becomes
+non-trivial. Do we need to support this case or support only equality ranges?
-Unresolved questions:
+Decision: the new code should handle only the case with equality ranges.
+For non-equality ranges, the execution will proceed as before.
-* The "sort incoming keys" part is trivial when we have only equality ranges.
- If we allow ranges of arbitrary form ( including ranges with one endpoint
- being infinity or ranges overlapping with one another), sorting becomes
- non-trivial. Do we need to support this case or support only equality ranges?
+1.2 Handling index prefix scans
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+What do we do if asked to do a scan on a prefix of clustered PK?
-* Can/should we use the fact rowid=={clustered PK value} for InnoDB's clustered
- PKs? (current decision: No)
+Decision: handle this if the ranges are equality ranges. The difference from
+scan on full primary key is that
+- we will have to use read_range_XXX() or index_read()/index_next_same()
+ functions, while for full primary key value we could have used rnd_pos().
+- One equality range can produce multiple matching records.
-* Do we support scanning on a prefix of clustered PK? (Yes but scan on prefix
- of clustered PK will not be in disk order. We need to run it with regular mode)
+1.3 Use of knowledge that primary_key==rowid
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Can/should we use the fact rowid=={clustered PK value} for InnoDB's clustered
+PKs?
+Decision: don't make this assumption.
+
+
+2. Code-level changes overview
+------------------------------
+
+DsMrr_impl::choose_mrr_impl():
+Enable MRR when
+ - ihis is a clustered primary key
+ - incoming ranges are single-point (HA_MRR_SINGLE_POINT is set)
+ - will need to make the SQL layer to set this flag
+ - incoming ranges are not already sorted (HA_MRR_SORTED is not set)
+
+(TODO do we need new cost formula?)
+
+DsMrr_impl::dsmrr_init()
+ - different elem_size (not rowid length but key tuple length)
+ - don't create the secondary handler object, we won't need it.
+
+DsMrr_impl::dsmrr_fill_buffer():
+ - need a variant of this function that would not access the index but just
+ fill and sort the array.
+
+DsMrr_impl::dsmrr_next():
+ - should abstract-out:
+ - buffer element size
+ - rnd_pos/index_read call.
+ - Also for CPK prefix scans there can be multi
-=-=(Psergey - Sun, 13 Jun 2010, 11:55)=-=-
High Level Description modified.
--- /tmp/wklog.121.old.9380 2010-06-13 11:55:42.000000000 +0000
+++ /tmp/wklog.121.new.9380 2010-06-13 11:55:42.000000000 +0000
@@ -1,18 +1,18 @@
-Currently, DS-MRR doesn't support operation over clustered primary keys. The
-reason for this was that
- - Clustered primary keys are stored in disk order and so, if the sequence of
- ranges is ordered, the reads will already go in disk order (and so DS-MRR's
- step of re-ordering reads is not necessary).
+Currently, DS-MRR doesn't allow to do MRR scans over clustered primary keys. The
+reason for this is that
+ - Clustered primary keys are stored in disk order, so, if the sequence of
+ scanned ranges is ordered, the reads will automatically happen in disk
+ order, and DS-MRR's step of re-ordering reads is redundant.
- Within DS-MRR implementation, the "get rowids from keys" step is not
necessary when using clustered primary key, because in InnoDB/XtraDB
clustered primary key values are the rowids.
-However, with BKA making the MRR calls, there are cases where DS-MRR over
+However, when MRR calls are made by BKA, there are cases where DS-MRR over
clustered primary key is beneficial:
* BKA may provide lookup keys that have duplicates and/or are in arbitrary
order. In that case, DS-MRR implementation may sort the key values and
- order them, so that it hits the disk in key order.
+ order them, so that it hits the disk in key(=disk) order.
* When running multi-table join with high @@join_cache_level value (and so,
linked join buffers), lack of MRR implementation causes the chain of linked
@@ -20,3 +20,9 @@
* TODO anything else?
+This WL entry is about addressing the above by adding support of DS-MRR over
+clustered primary key that would work according to this strategy:
+1. Sort incoming keys
+2. Use the sorted keys to do a disk-ordered retrieval.
+
+
-=-=(Psergey - Sun, 13 Jun 2010, 09:42)=-=-
High-Level Specification modified.
--- /tmp/wklog.121.old.5009 2010-06-13 09:42:38.000000000 +0000
+++ /tmp/wklog.121.new.5009 2010-06-13 09:42:38.000000000 +0000
@@ -6,11 +6,12 @@
Unresolved questions:
* The "sort incoming keys" part is trivial when we have only equality ranges.
- If we allow ranges of arbitrary form ( ncluding ranges with one endpoint
+ If we allow ranges of arbitrary form ( including ranges with one endpoint
being infinity or ranges overlapping with one another), sorting becomes
- non-trival. Do we need to support this case or support only equality ranges?
+ non-trivial. Do we need to support this case or support only equality ranges?
* Can/should we use the fact rowid=={clustered PK value} for InnoDB's clustered
- PKs?
+ PKs? (current decision: No)
-* Do we support scanning on a prefix of clustered PK? (seems to be yes?)
+* Do we support scanning on a prefix of clustered PK? (Yes but scan on prefix
+ of clustered PK will not be in disk order. We need to run it with regular mode)
-=-=(Psergey - Sat, 12 Jun 2010, 08:39)=-=-
High-Level Specification modified.
--- /tmp/wklog.121.old.538 2010-06-12 08:39:46.000000000 +0000
+++ /tmp/wklog.121.new.538 2010-06-12 08:39:46.000000000 +0000
@@ -1 +1,16 @@
+Basic idea: DS-MRR scan should be done as follows:
+1. Sort incoming keys
+2. Use the sorted keys to do a disk-ordered retrieval
+
+Unresolved questions:
+
+* The "sort incoming keys" part is trivial when we have only equality ranges.
+ If we allow ranges of arbitrary form ( ncluding ranges with one endpoint
+ being infinity or ranges overlapping with one another), sorting becomes
+ non-trival. Do we need to support this case or support only equality ranges?
+
+* Can/should we use the fact rowid=={clustered PK value} for InnoDB's clustered
+ PKs?
+
+* Do we support scanning on a prefix of clustered PK? (seems to be yes?)
DESCRIPTION:
Currently, DS-MRR doesn't allow to do MRR scans over clustered primary keys. The
reason for this is that
- Clustered primary keys are stored in disk order, so, if the sequence of
scanned ranges is ordered, the reads will automatically happen in disk
order, and DS-MRR's step of re-ordering reads is redundant.
- Within DS-MRR implementation, the "get rowids from keys" step is not
necessary when using clustered primary key, because in InnoDB/XtraDB
clustered primary key values are the rowids.
However, when MRR calls are made by BKA, there are cases where DS-MRR over
clustered primary key is beneficial:
* BKA may provide lookup keys that have duplicates and/or are in arbitrary
order. In that case, DS-MRR implementation may sort the key values and
order them, so that it hits the disk in key(=disk) order.
* When running multi-table join with high @@join_cache_level value (and so,
linked join buffers), lack of MRR implementation causes the chain of linked
join buffers to break. (TODO and so what? Is that really a problem?)
* TODO anything else?
This WL entry is about addressing the above by adding support of DS-MRR over
clustered primary key that would work according to this strategy:
1. Sort incoming keys
2. Use the sorted keys to do a disk-ordered retrieval.
HIGH-LEVEL SPECIFICATION:
1. Choices to be made
---------------------
1.1 Handling of complex ranges
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The "sort incoming keys" part is easy when we have only equality ranges.
If we allow ranges of arbitrary form (including ranges with one endpoint
being infinity and/or ranges overlapping with one another), sorting becomes
non-trivial. Do we need to support this case or support only equality ranges?
Decision: the new code should handle only the case with equality ranges.
For non-equality ranges, the execution will proceed as before.
1.2 Handling index prefix scans
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
What do we do if asked to do a scan on a prefix of clustered PK?
Decision: handle this if the ranges are equality ranges. The difference from
scan on full primary key is that
- we will have to use read_range_XXX() or index_read()/index_next_same()
functions, while for full primary key value we could have used rnd_pos().
- One equality range can produce multiple matching records.
1.3 Use of knowledge that primary_key==rowid
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Can/should we use the fact rowid=={clustered PK value} for InnoDB's clustered
PKs?
Decision: don't make this assumption.
2. Code-level changes overview
------------------------------
DsMrr_impl::choose_mrr_impl():
Enable MRR when
- ihis is a clustered primary key
- incoming ranges are single-point (HA_MRR_SINGLE_POINT is set)
- will need to make the SQL layer to set this flag
- incoming ranges are not already sorted (HA_MRR_SORTED is not set)
(TODO do we need new cost formula?)
DsMrr_impl::dsmrr_init()
- different elem_size (not rowid length but key tuple length)
- don't create the secondary handler object, we won't need it.
DsMrr_impl::dsmrr_fill_buffer():
- need a variant of this function that would not access the index but just
fill and sort the array.
DsMrr_impl::dsmrr_next():
- should abstract-out:
- buffer element size
- rnd_pos/index_read call.
- Also for CPK prefix scans there can be multi
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Psergey): Add DS-MRR support for clustered primary keys (121)
by worklog-noreply@askmonty.org 13 Jun '10
by worklog-noreply@askmonty.org 13 Jun '10
13 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Add DS-MRR support for clustered primary keys
CREATION DATE..: Sat, 12 Jun 2010, 08:23
SUPERVISOR.....: Igor
IMPLEMENTOR....:
COPIES TO......:
CATEGORY.......: Client-BackLog
TASK ID........: 121 (http://askmonty.org/worklog/?tid=121)
VERSION........: Benchmarks-3.0
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Psergey - Sun, 13 Jun 2010, 11:55)=-=-
High Level Description modified.
--- /tmp/wklog.121.old.9380 2010-06-13 11:55:42.000000000 +0000
+++ /tmp/wklog.121.new.9380 2010-06-13 11:55:42.000000000 +0000
@@ -1,18 +1,18 @@
-Currently, DS-MRR doesn't support operation over clustered primary keys. The
-reason for this was that
- - Clustered primary keys are stored in disk order and so, if the sequence of
- ranges is ordered, the reads will already go in disk order (and so DS-MRR's
- step of re-ordering reads is not necessary).
+Currently, DS-MRR doesn't allow to do MRR scans over clustered primary keys. The
+reason for this is that
+ - Clustered primary keys are stored in disk order, so, if the sequence of
+ scanned ranges is ordered, the reads will automatically happen in disk
+ order, and DS-MRR's step of re-ordering reads is redundant.
- Within DS-MRR implementation, the "get rowids from keys" step is not
necessary when using clustered primary key, because in InnoDB/XtraDB
clustered primary key values are the rowids.
-However, with BKA making the MRR calls, there are cases where DS-MRR over
+However, when MRR calls are made by BKA, there are cases where DS-MRR over
clustered primary key is beneficial:
* BKA may provide lookup keys that have duplicates and/or are in arbitrary
order. In that case, DS-MRR implementation may sort the key values and
- order them, so that it hits the disk in key order.
+ order them, so that it hits the disk in key(=disk) order.
* When running multi-table join with high @@join_cache_level value (and so,
linked join buffers), lack of MRR implementation causes the chain of linked
@@ -20,3 +20,9 @@
* TODO anything else?
+This WL entry is about addressing the above by adding support of DS-MRR over
+clustered primary key that would work according to this strategy:
+1. Sort incoming keys
+2. Use the sorted keys to do a disk-ordered retrieval.
+
+
-=-=(Psergey - Sun, 13 Jun 2010, 09:42)=-=-
High-Level Specification modified.
--- /tmp/wklog.121.old.5009 2010-06-13 09:42:38.000000000 +0000
+++ /tmp/wklog.121.new.5009 2010-06-13 09:42:38.000000000 +0000
@@ -6,11 +6,12 @@
Unresolved questions:
* The "sort incoming keys" part is trivial when we have only equality ranges.
- If we allow ranges of arbitrary form ( ncluding ranges with one endpoint
+ If we allow ranges of arbitrary form ( including ranges with one endpoint
being infinity or ranges overlapping with one another), sorting becomes
- non-trival. Do we need to support this case or support only equality ranges?
+ non-trivial. Do we need to support this case or support only equality ranges?
* Can/should we use the fact rowid=={clustered PK value} for InnoDB's clustered
- PKs?
+ PKs? (current decision: No)
-* Do we support scanning on a prefix of clustered PK? (seems to be yes?)
+* Do we support scanning on a prefix of clustered PK? (Yes but scan on prefix
+ of clustered PK will not be in disk order. We need to run it with regular mode)
-=-=(Psergey - Sat, 12 Jun 2010, 08:39)=-=-
High-Level Specification modified.
--- /tmp/wklog.121.old.538 2010-06-12 08:39:46.000000000 +0000
+++ /tmp/wklog.121.new.538 2010-06-12 08:39:46.000000000 +0000
@@ -1 +1,16 @@
+Basic idea: DS-MRR scan should be done as follows:
+1. Sort incoming keys
+2. Use the sorted keys to do a disk-ordered retrieval
+
+Unresolved questions:
+
+* The "sort incoming keys" part is trivial when we have only equality ranges.
+ If we allow ranges of arbitrary form ( ncluding ranges with one endpoint
+ being infinity or ranges overlapping with one another), sorting becomes
+ non-trival. Do we need to support this case or support only equality ranges?
+
+* Can/should we use the fact rowid=={clustered PK value} for InnoDB's clustered
+ PKs?
+
+* Do we support scanning on a prefix of clustered PK? (seems to be yes?)
DESCRIPTION:
Currently, DS-MRR doesn't allow to do MRR scans over clustered primary keys. The
reason for this is that
- Clustered primary keys are stored in disk order, so, if the sequence of
scanned ranges is ordered, the reads will automatically happen in disk
order, and DS-MRR's step of re-ordering reads is redundant.
- Within DS-MRR implementation, the "get rowids from keys" step is not
necessary when using clustered primary key, because in InnoDB/XtraDB
clustered primary key values are the rowids.
However, when MRR calls are made by BKA, there are cases where DS-MRR over
clustered primary key is beneficial:
* BKA may provide lookup keys that have duplicates and/or are in arbitrary
order. In that case, DS-MRR implementation may sort the key values and
order them, so that it hits the disk in key(=disk) order.
* When running multi-table join with high @@join_cache_level value (and so,
linked join buffers), lack of MRR implementation causes the chain of linked
join buffers to break. (TODO and so what? Is that really a problem?)
* TODO anything else?
This WL entry is about addressing the above by adding support of DS-MRR over
clustered primary key that would work according to this strategy:
1. Sort incoming keys
2. Use the sorted keys to do a disk-ordered retrieval.
HIGH-LEVEL SPECIFICATION:
Basic idea: DS-MRR scan should be done as follows:
1. Sort incoming keys
2. Use the sorted keys to do a disk-ordered retrieval
Unresolved questions:
* The "sort incoming keys" part is trivial when we have only equality ranges.
If we allow ranges of arbitrary form ( including ranges with one endpoint
being infinity or ranges overlapping with one another), sorting becomes
non-trivial. Do we need to support this case or support only equality ranges?
* Can/should we use the fact rowid=={clustered PK value} for InnoDB's clustered
PKs? (current decision: No)
* Do we support scanning on a prefix of clustered PK? (Yes but scan on prefix
of clustered PK will not be in disk order. We need to run it with regular mode)
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Psergey): Add DS-MRR support for clustered primary keys (121)
by worklog-noreply@askmonty.org 13 Jun '10
by worklog-noreply@askmonty.org 13 Jun '10
13 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Add DS-MRR support for clustered primary keys
CREATION DATE..: Sat, 12 Jun 2010, 08:23
SUPERVISOR.....: Igor
IMPLEMENTOR....:
COPIES TO......:
CATEGORY.......: Client-BackLog
TASK ID........: 121 (http://askmonty.org/worklog/?tid=121)
VERSION........: Benchmarks-3.0
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Psergey - Sun, 13 Jun 2010, 11:55)=-=-
High Level Description modified.
--- /tmp/wklog.121.old.9380 2010-06-13 11:55:42.000000000 +0000
+++ /tmp/wklog.121.new.9380 2010-06-13 11:55:42.000000000 +0000
@@ -1,18 +1,18 @@
-Currently, DS-MRR doesn't support operation over clustered primary keys. The
-reason for this was that
- - Clustered primary keys are stored in disk order and so, if the sequence of
- ranges is ordered, the reads will already go in disk order (and so DS-MRR's
- step of re-ordering reads is not necessary).
+Currently, DS-MRR doesn't allow to do MRR scans over clustered primary keys. The
+reason for this is that
+ - Clustered primary keys are stored in disk order, so, if the sequence of
+ scanned ranges is ordered, the reads will automatically happen in disk
+ order, and DS-MRR's step of re-ordering reads is redundant.
- Within DS-MRR implementation, the "get rowids from keys" step is not
necessary when using clustered primary key, because in InnoDB/XtraDB
clustered primary key values are the rowids.
-However, with BKA making the MRR calls, there are cases where DS-MRR over
+However, when MRR calls are made by BKA, there are cases where DS-MRR over
clustered primary key is beneficial:
* BKA may provide lookup keys that have duplicates and/or are in arbitrary
order. In that case, DS-MRR implementation may sort the key values and
- order them, so that it hits the disk in key order.
+ order them, so that it hits the disk in key(=disk) order.
* When running multi-table join with high @@join_cache_level value (and so,
linked join buffers), lack of MRR implementation causes the chain of linked
@@ -20,3 +20,9 @@
* TODO anything else?
+This WL entry is about addressing the above by adding support of DS-MRR over
+clustered primary key that would work according to this strategy:
+1. Sort incoming keys
+2. Use the sorted keys to do a disk-ordered retrieval.
+
+
-=-=(Psergey - Sun, 13 Jun 2010, 09:42)=-=-
High-Level Specification modified.
--- /tmp/wklog.121.old.5009 2010-06-13 09:42:38.000000000 +0000
+++ /tmp/wklog.121.new.5009 2010-06-13 09:42:38.000000000 +0000
@@ -6,11 +6,12 @@
Unresolved questions:
* The "sort incoming keys" part is trivial when we have only equality ranges.
- If we allow ranges of arbitrary form ( ncluding ranges with one endpoint
+ If we allow ranges of arbitrary form ( including ranges with one endpoint
being infinity or ranges overlapping with one another), sorting becomes
- non-trival. Do we need to support this case or support only equality ranges?
+ non-trivial. Do we need to support this case or support only equality ranges?
* Can/should we use the fact rowid=={clustered PK value} for InnoDB's clustered
- PKs?
+ PKs? (current decision: No)
-* Do we support scanning on a prefix of clustered PK? (seems to be yes?)
+* Do we support scanning on a prefix of clustered PK? (Yes but scan on prefix
+ of clustered PK will not be in disk order. We need to run it with regular mode)
-=-=(Psergey - Sat, 12 Jun 2010, 08:39)=-=-
High-Level Specification modified.
--- /tmp/wklog.121.old.538 2010-06-12 08:39:46.000000000 +0000
+++ /tmp/wklog.121.new.538 2010-06-12 08:39:46.000000000 +0000
@@ -1 +1,16 @@
+Basic idea: DS-MRR scan should be done as follows:
+1. Sort incoming keys
+2. Use the sorted keys to do a disk-ordered retrieval
+
+Unresolved questions:
+
+* The "sort incoming keys" part is trivial when we have only equality ranges.
+ If we allow ranges of arbitrary form ( ncluding ranges with one endpoint
+ being infinity or ranges overlapping with one another), sorting becomes
+ non-trival. Do we need to support this case or support only equality ranges?
+
+* Can/should we use the fact rowid=={clustered PK value} for InnoDB's clustered
+ PKs?
+
+* Do we support scanning on a prefix of clustered PK? (seems to be yes?)
DESCRIPTION:
Currently, DS-MRR doesn't allow to do MRR scans over clustered primary keys. The
reason for this is that
- Clustered primary keys are stored in disk order, so, if the sequence of
scanned ranges is ordered, the reads will automatically happen in disk
order, and DS-MRR's step of re-ordering reads is redundant.
- Within DS-MRR implementation, the "get rowids from keys" step is not
necessary when using clustered primary key, because in InnoDB/XtraDB
clustered primary key values are the rowids.
However, when MRR calls are made by BKA, there are cases where DS-MRR over
clustered primary key is beneficial:
* BKA may provide lookup keys that have duplicates and/or are in arbitrary
order. In that case, DS-MRR implementation may sort the key values and
order them, so that it hits the disk in key(=disk) order.
* When running multi-table join with high @@join_cache_level value (and so,
linked join buffers), lack of MRR implementation causes the chain of linked
join buffers to break. (TODO and so what? Is that really a problem?)
* TODO anything else?
This WL entry is about addressing the above by adding support of DS-MRR over
clustered primary key that would work according to this strategy:
1. Sort incoming keys
2. Use the sorted keys to do a disk-ordered retrieval.
HIGH-LEVEL SPECIFICATION:
Basic idea: DS-MRR scan should be done as follows:
1. Sort incoming keys
2. Use the sorted keys to do a disk-ordered retrieval
Unresolved questions:
* The "sort incoming keys" part is trivial when we have only equality ranges.
If we allow ranges of arbitrary form ( including ranges with one endpoint
being infinity or ranges overlapping with one another), sorting becomes
non-trivial. Do we need to support this case or support only equality ranges?
* Can/should we use the fact rowid=={clustered PK value} for InnoDB's clustered
PKs? (current decision: No)
* Do we support scanning on a prefix of clustered PK? (Yes but scan on prefix
of clustered PK will not be in disk order. We need to run it with regular mode)
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Psergey): Add DS-MRR support for clustered primary keys (121)
by worklog-noreply@askmonty.org 13 Jun '10
by worklog-noreply@askmonty.org 13 Jun '10
13 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Add DS-MRR support for clustered primary keys
CREATION DATE..: Sat, 12 Jun 2010, 08:23
SUPERVISOR.....: Igor
IMPLEMENTOR....:
COPIES TO......:
CATEGORY.......: Client-BackLog
TASK ID........: 121 (http://askmonty.org/worklog/?tid=121)
VERSION........: Benchmarks-3.0
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Psergey - Sun, 13 Jun 2010, 09:42)=-=-
High-Level Specification modified.
--- /tmp/wklog.121.old.5009 2010-06-13 09:42:38.000000000 +0000
+++ /tmp/wklog.121.new.5009 2010-06-13 09:42:38.000000000 +0000
@@ -6,11 +6,12 @@
Unresolved questions:
* The "sort incoming keys" part is trivial when we have only equality ranges.
- If we allow ranges of arbitrary form ( ncluding ranges with one endpoint
+ If we allow ranges of arbitrary form ( including ranges with one endpoint
being infinity or ranges overlapping with one another), sorting becomes
- non-trival. Do we need to support this case or support only equality ranges?
+ non-trivial. Do we need to support this case or support only equality ranges?
* Can/should we use the fact rowid=={clustered PK value} for InnoDB's clustered
- PKs?
+ PKs? (current decision: No)
-* Do we support scanning on a prefix of clustered PK? (seems to be yes?)
+* Do we support scanning on a prefix of clustered PK? (Yes but scan on prefix
+ of clustered PK will not be in disk order. We need to run it with regular mode)
-=-=(Psergey - Sat, 12 Jun 2010, 08:39)=-=-
High-Level Specification modified.
--- /tmp/wklog.121.old.538 2010-06-12 08:39:46.000000000 +0000
+++ /tmp/wklog.121.new.538 2010-06-12 08:39:46.000000000 +0000
@@ -1 +1,16 @@
+Basic idea: DS-MRR scan should be done as follows:
+1. Sort incoming keys
+2. Use the sorted keys to do a disk-ordered retrieval
+
+Unresolved questions:
+
+* The "sort incoming keys" part is trivial when we have only equality ranges.
+ If we allow ranges of arbitrary form ( ncluding ranges with one endpoint
+ being infinity or ranges overlapping with one another), sorting becomes
+ non-trival. Do we need to support this case or support only equality ranges?
+
+* Can/should we use the fact rowid=={clustered PK value} for InnoDB's clustered
+ PKs?
+
+* Do we support scanning on a prefix of clustered PK? (seems to be yes?)
DESCRIPTION:
Currently, DS-MRR doesn't support operation over clustered primary keys. The
reason for this was that
- Clustered primary keys are stored in disk order and so, if the sequence of
ranges is ordered, the reads will already go in disk order (and so DS-MRR's
step of re-ordering reads is not necessary).
- Within DS-MRR implementation, the "get rowids from keys" step is not
necessary when using clustered primary key, because in InnoDB/XtraDB
clustered primary key values are the rowids.
However, with BKA making the MRR calls, there are cases where DS-MRR over
clustered primary key is beneficial:
* BKA may provide lookup keys that have duplicates and/or are in arbitrary
order. In that case, DS-MRR implementation may sort the key values and
order them, so that it hits the disk in key order.
* When running multi-table join with high @@join_cache_level value (and so,
linked join buffers), lack of MRR implementation causes the chain of linked
join buffers to break. (TODO and so what? Is that really a problem?)
* TODO anything else?
HIGH-LEVEL SPECIFICATION:
Basic idea: DS-MRR scan should be done as follows:
1. Sort incoming keys
2. Use the sorted keys to do a disk-ordered retrieval
Unresolved questions:
* The "sort incoming keys" part is trivial when we have only equality ranges.
If we allow ranges of arbitrary form ( including ranges with one endpoint
being infinity or ranges overlapping with one another), sorting becomes
non-trivial. Do we need to support this case or support only equality ranges?
* Can/should we use the fact rowid=={clustered PK value} for InnoDB's clustered
PKs? (current decision: No)
* Do we support scanning on a prefix of clustered PK? (Yes but scan on prefix
of clustered PK will not be in disk order. We need to run it with regular mode)
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Psergey): Add DS-MRR support for clustered primary keys (121)
by worklog-noreply@askmonty.org 13 Jun '10
by worklog-noreply@askmonty.org 13 Jun '10
13 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Add DS-MRR support for clustered primary keys
CREATION DATE..: Sat, 12 Jun 2010, 08:23
SUPERVISOR.....: Igor
IMPLEMENTOR....:
COPIES TO......:
CATEGORY.......: Client-BackLog
TASK ID........: 121 (http://askmonty.org/worklog/?tid=121)
VERSION........: Benchmarks-3.0
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Psergey - Sun, 13 Jun 2010, 09:42)=-=-
High-Level Specification modified.
--- /tmp/wklog.121.old.5009 2010-06-13 09:42:38.000000000 +0000
+++ /tmp/wklog.121.new.5009 2010-06-13 09:42:38.000000000 +0000
@@ -6,11 +6,12 @@
Unresolved questions:
* The "sort incoming keys" part is trivial when we have only equality ranges.
- If we allow ranges of arbitrary form ( ncluding ranges with one endpoint
+ If we allow ranges of arbitrary form ( including ranges with one endpoint
being infinity or ranges overlapping with one another), sorting becomes
- non-trival. Do we need to support this case or support only equality ranges?
+ non-trivial. Do we need to support this case or support only equality ranges?
* Can/should we use the fact rowid=={clustered PK value} for InnoDB's clustered
- PKs?
+ PKs? (current decision: No)
-* Do we support scanning on a prefix of clustered PK? (seems to be yes?)
+* Do we support scanning on a prefix of clustered PK? (Yes but scan on prefix
+ of clustered PK will not be in disk order. We need to run it with regular mode)
-=-=(Psergey - Sat, 12 Jun 2010, 08:39)=-=-
High-Level Specification modified.
--- /tmp/wklog.121.old.538 2010-06-12 08:39:46.000000000 +0000
+++ /tmp/wklog.121.new.538 2010-06-12 08:39:46.000000000 +0000
@@ -1 +1,16 @@
+Basic idea: DS-MRR scan should be done as follows:
+1. Sort incoming keys
+2. Use the sorted keys to do a disk-ordered retrieval
+
+Unresolved questions:
+
+* The "sort incoming keys" part is trivial when we have only equality ranges.
+ If we allow ranges of arbitrary form ( ncluding ranges with one endpoint
+ being infinity or ranges overlapping with one another), sorting becomes
+ non-trival. Do we need to support this case or support only equality ranges?
+
+* Can/should we use the fact rowid=={clustered PK value} for InnoDB's clustered
+ PKs?
+
+* Do we support scanning on a prefix of clustered PK? (seems to be yes?)
DESCRIPTION:
Currently, DS-MRR doesn't support operation over clustered primary keys. The
reason for this was that
- Clustered primary keys are stored in disk order and so, if the sequence of
ranges is ordered, the reads will already go in disk order (and so DS-MRR's
step of re-ordering reads is not necessary).
- Within DS-MRR implementation, the "get rowids from keys" step is not
necessary when using clustered primary key, because in InnoDB/XtraDB
clustered primary key values are the rowids.
However, with BKA making the MRR calls, there are cases where DS-MRR over
clustered primary key is beneficial:
* BKA may provide lookup keys that have duplicates and/or are in arbitrary
order. In that case, DS-MRR implementation may sort the key values and
order them, so that it hits the disk in key order.
* When running multi-table join with high @@join_cache_level value (and so,
linked join buffers), lack of MRR implementation causes the chain of linked
join buffers to break. (TODO and so what? Is that really a problem?)
* TODO anything else?
HIGH-LEVEL SPECIFICATION:
Basic idea: DS-MRR scan should be done as follows:
1. Sort incoming keys
2. Use the sorted keys to do a disk-ordered retrieval
Unresolved questions:
* The "sort incoming keys" part is trivial when we have only equality ranges.
If we allow ranges of arbitrary form ( including ranges with one endpoint
being infinity or ranges overlapping with one another), sorting becomes
non-trivial. Do we need to support this case or support only equality ranges?
* Can/should we use the fact rowid=={clustered PK value} for InnoDB's clustered
PKs? (current decision: No)
* Do we support scanning on a prefix of clustered PK? (Yes but scan on prefix
of clustered PK will not be in disk order. We need to run it with regular mode)
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Psergey): Add DS-MRR support for clustered primary keys (121)
by worklog-noreply@askmonty.org 12 Jun '10
by worklog-noreply@askmonty.org 12 Jun '10
12 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Add DS-MRR support for clustered primary keys
CREATION DATE..: Sat, 12 Jun 2010, 08:23
SUPERVISOR.....: Igor
IMPLEMENTOR....:
COPIES TO......:
CATEGORY.......: Client-BackLog
TASK ID........: 121 (http://askmonty.org/worklog/?tid=121)
VERSION........: Benchmarks-3.0
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Psergey - Sat, 12 Jun 2010, 08:39)=-=-
High-Level Specification modified.
--- /tmp/wklog.121.old.538 2010-06-12 08:39:46.000000000 +0000
+++ /tmp/wklog.121.new.538 2010-06-12 08:39:46.000000000 +0000
@@ -1 +1,16 @@
+Basic idea: DS-MRR scan should be done as follows:
+1. Sort incoming keys
+2. Use the sorted keys to do a disk-ordered retrieval
+
+Unresolved questions:
+
+* The "sort incoming keys" part is trivial when we have only equality ranges.
+ If we allow ranges of arbitrary form ( ncluding ranges with one endpoint
+ being infinity or ranges overlapping with one another), sorting becomes
+ non-trival. Do we need to support this case or support only equality ranges?
+
+* Can/should we use the fact rowid=={clustered PK value} for InnoDB's clustered
+ PKs?
+
+* Do we support scanning on a prefix of clustered PK? (seems to be yes?)
DESCRIPTION:
Currently, DS-MRR doesn't support operation over clustered primary keys. The
reason for this was that
- Clustered primary keys are stored in disk order and so, if the sequence of
ranges is ordered, the reads will already go in disk order (and so DS-MRR's
step of re-ordering reads is not necessary).
- Within DS-MRR implementation, the "get rowids from keys" step is not
necessary when using clustered primary key, because in InnoDB/XtraDB
clustered primary key values are the rowids.
However, with BKA making the MRR calls, there are cases where DS-MRR over
clustered primary key is beneficial:
* BKA may provide lookup keys that have duplicates and/or are in arbitrary
order. In that case, DS-MRR implementation may sort the key values and
order them, so that it hits the disk in key order.
* When running multi-table join with high @@join_cache_level value (and so,
linked join buffers), lack of MRR implementation causes the chain of linked
join buffers to break. (TODO and so what? Is that really a problem?)
* TODO anything else?
HIGH-LEVEL SPECIFICATION:
Basic idea: DS-MRR scan should be done as follows:
1. Sort incoming keys
2. Use the sorted keys to do a disk-ordered retrieval
Unresolved questions:
* The "sort incoming keys" part is trivial when we have only equality ranges.
If we allow ranges of arbitrary form ( ncluding ranges with one endpoint
being infinity or ranges overlapping with one another), sorting becomes
non-trival. Do we need to support this case or support only equality ranges?
* Can/should we use the fact rowid=={clustered PK value} for InnoDB's clustered
PKs?
* Do we support scanning on a prefix of clustered PK? (seems to be yes?)
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Updated (by Psergey): Add DS-MRR support for clustered primary keys (121)
by worklog-noreply@askmonty.org 12 Jun '10
by worklog-noreply@askmonty.org 12 Jun '10
12 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Add DS-MRR support for clustered primary keys
CREATION DATE..: Sat, 12 Jun 2010, 08:23
SUPERVISOR.....: Igor
IMPLEMENTOR....:
COPIES TO......:
CATEGORY.......: Client-BackLog
TASK ID........: 121 (http://askmonty.org/worklog/?tid=121)
VERSION........: Benchmarks-3.0
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
-=-=(Psergey - Sat, 12 Jun 2010, 08:39)=-=-
High-Level Specification modified.
--- /tmp/wklog.121.old.538 2010-06-12 08:39:46.000000000 +0000
+++ /tmp/wklog.121.new.538 2010-06-12 08:39:46.000000000 +0000
@@ -1 +1,16 @@
+Basic idea: DS-MRR scan should be done as follows:
+1. Sort incoming keys
+2. Use the sorted keys to do a disk-ordered retrieval
+
+Unresolved questions:
+
+* The "sort incoming keys" part is trivial when we have only equality ranges.
+ If we allow ranges of arbitrary form ( ncluding ranges with one endpoint
+ being infinity or ranges overlapping with one another), sorting becomes
+ non-trival. Do we need to support this case or support only equality ranges?
+
+* Can/should we use the fact rowid=={clustered PK value} for InnoDB's clustered
+ PKs?
+
+* Do we support scanning on a prefix of clustered PK? (seems to be yes?)
DESCRIPTION:
Currently, DS-MRR doesn't support operation over clustered primary keys. The
reason for this was that
- Clustered primary keys are stored in disk order and so, if the sequence of
ranges is ordered, the reads will already go in disk order (and so DS-MRR's
step of re-ordering reads is not necessary).
- Within DS-MRR implementation, the "get rowids from keys" step is not
necessary when using clustered primary key, because in InnoDB/XtraDB
clustered primary key values are the rowids.
However, with BKA making the MRR calls, there are cases where DS-MRR over
clustered primary key is beneficial:
* BKA may provide lookup keys that have duplicates and/or are in arbitrary
order. In that case, DS-MRR implementation may sort the key values and
order them, so that it hits the disk in key order.
* When running multi-table join with high @@join_cache_level value (and so,
linked join buffers), lack of MRR implementation causes the chain of linked
join buffers to break. (TODO and so what? Is that really a problem?)
* TODO anything else?
HIGH-LEVEL SPECIFICATION:
Basic idea: DS-MRR scan should be done as follows:
1. Sort incoming keys
2. Use the sorted keys to do a disk-ordered retrieval
Unresolved questions:
* The "sort incoming keys" part is trivial when we have only equality ranges.
If we allow ranges of arbitrary form ( ncluding ranges with one endpoint
being infinity or ranges overlapping with one another), sorting becomes
non-trival. Do we need to support this case or support only equality ranges?
* Can/should we use the fact rowid=={clustered PK value} for InnoDB's clustered
PKs?
* Do we support scanning on a prefix of clustered PK? (seems to be yes?)
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] New (by Psergey): Add DS-MRR support for clustered primary keys (121)
by worklog-noreply@askmonty.org 12 Jun '10
by worklog-noreply@askmonty.org 12 Jun '10
12 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Add DS-MRR support for clustered primary keys
CREATION DATE..: Sat, 12 Jun 2010, 08:23
SUPERVISOR.....: Igor
IMPLEMENTOR....:
COPIES TO......:
CATEGORY.......: Client-BackLog
TASK ID........: 121 (http://askmonty.org/worklog/?tid=121)
VERSION........: Benchmarks-3.0
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
DESCRIPTION:
Currently, DS-MRR doesn't support operation over clustered primary keys. The
reason for this was that
- Clustered primary keys are stored in disk order and so, if the sequence of
ranges is ordered, the reads will already go in disk order (and so DS-MRR's
step of re-ordering reads is not necessary).
- Within DS-MRR implementation, the "get rowids from keys" step is not
necessary when using clustered primary key, because in InnoDB/XtraDB
clustered primary key values are the rowids.
However, with BKA making the MRR calls, there are cases where DS-MRR over
clustered primary key is beneficial:
* BKA may provide lookup keys that have duplicates and/or are in arbitrary
order. In that case, DS-MRR implementation may sort the key values and
order them, so that it hits the disk in key order.
* When running multi-table join with high @@join_cache_level value (and so,
linked join buffers), lack of MRR implementation causes the chain of linked
join buffers to break. (TODO and so what? Is that really a problem?)
* TODO anything else?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] New (by Psergey): Add DS-MRR support for clustered primary keys (121)
by worklog-noreply@askmonty.org 12 Jun '10
by worklog-noreply@askmonty.org 12 Jun '10
12 Jun '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Add DS-MRR support for clustered primary keys
CREATION DATE..: Sat, 12 Jun 2010, 08:23
SUPERVISOR.....: Igor
IMPLEMENTOR....:
COPIES TO......:
CATEGORY.......: Client-BackLog
TASK ID........: 121 (http://askmonty.org/worklog/?tid=121)
VERSION........: Benchmarks-3.0
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0
PROGRESS NOTES:
DESCRIPTION:
Currently, DS-MRR doesn't support operation over clustered primary keys. The
reason for this was that
- Clustered primary keys are stored in disk order and so, if the sequence of
ranges is ordered, the reads will already go in disk order (and so DS-MRR's
step of re-ordering reads is not necessary).
- Within DS-MRR implementation, the "get rowids from keys" step is not
necessary when using clustered primary key, because in InnoDB/XtraDB
clustered primary key values are the rowids.
However, with BKA making the MRR calls, there are cases where DS-MRR over
clustered primary key is beneficial:
* BKA may provide lookup keys that have duplicates and/or are in arbitrary
order. In that case, DS-MRR implementation may sort the key values and
order them, so that it hits the disk in key order.
* When running multi-table join with high @@join_cache_level value (and so,
linked join buffers), lack of MRR implementation causes the chain of linked
join buffers to break. (TODO and so what? Is that really a problem?)
* TODO anything else?
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
Welcome,
We designed a custom Storage Engine (CLDB) for MySQL/MariaDB
which now pass all premature tests.
I wonder, how can we solve license problem...
We would not share our sources, but we will just use shared library
(*.so)...
Which licenses should our customer who bought MariaDB with CLDB use to
stay legally...
If it is possible, can we use in corporation use MariaDB on GPL and
just sell CLDB (storage engine) on other license (which one) ??
Thank You for Your quick reply...
--
___________________________________________________________
Mateusz Matan
IT Security R&D Department, C/C++ programmer
ComArch S.A., Al. Jana Pawła II 41d, 31-864 Kraków
tel: (+48 12) 684 8411
e-mail: Mateusz.Matan(a)comarch.pl
3
3