developers
Threads by month
- ----- 2025 -----
- February
- January
- ----- 2024 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2023 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2022 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2021 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2020 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2019 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2018 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2017 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2016 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2015 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2014 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2013 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2012 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2011 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2010 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2009 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- 6826 discussions
[Maria-developers] New (by Igor): Partitioned Key Cache for MyISAM (85)
by worklog-noreply@askmonty.org 13 Feb '10
by worklog-noreply@askmonty.org 13 Feb '10
13 Feb '10
-----------------------------------------------------------------------
WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Partitioned Key Cache for MyISAM
CREATION DATE..: Sun, 14 Feb 2010, 00:10
SUPERVISOR.....: Monty
IMPLEMENTOR....:
COPIES TO......: Igor, Monty, Sergei
CATEGORY.......: Server-BackLog
TASK ID........: 85 (http://askmonty.org/worklog/?tid=85)
VERSION........: Benchmarks-3.0
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 80 (hours remain)
ORIG. ESTIMATE.: 80
PROGRESS NOTES:
DESCRIPTION:
A partitioned key cache is a collection of structures for regular MyiSAM key
caches called key cache partitions. Any page from a file can be placed into a
buffer of only one partition. The number of the partition is calculated from the
file number and the position of the page in the file, and it's always the same
for the page. The function that maps pages into partitions takes care of even
distribution of pages among partitions.
Partition key cache mitigate one of the major problem of simple key cache:
thread contention for key cache lock (mutex). Every call of a key cache
interface function must acquire this lock. So threads compete for this lock even
in the case when they have acquired shared locks for the file and pages they
want read from are in the key cache buffers. When working with a partitioned key
cache any key cache interface function that needs only one page has to acquire
the key cache lock only for the partition the page is ascribed to. This makes
the chances for threads not compete for the same key cache lock better.
The idea and the original of the partitioned key cache was provided by one of
our external contributers.
ESTIMATED WORK TIME
ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)
1
0
[Maria-developers] Rev 2757: Fix for previous cset in file:///home/psergey/dev/maria-5.3-subqueries-r6/
by Sergey Petrunya 12 Feb '10
by Sergey Petrunya 12 Feb '10
12 Feb '10
At file:///home/psergey/dev/maria-5.3-subqueries-r6/
------------------------------------------------------------
revno: 2757
revision-id: psergey(a)askmonty.org-20100212181041-5rwekm1wpvwaikkx
parent: psergey(a)askmonty.org-20100211235958-p11o4e80dlrn2bsq
committer: Sergey Petrunya <psergey(a)askmonty.org>
branch nick: maria-5.3-subqueries-r6
timestamp: Fri 2010-02-12 21:10:41 +0300
message:
Fix for previous cset
=== modified file 'sql/item_subselect.cc'
--- a/sql/item_subselect.cc 2010-02-11 23:59:58 +0000
+++ b/sql/item_subselect.cc 2010-02-12 18:10:41 +0000
@@ -1316,6 +1316,7 @@
(char *)in_left_expr_name);
master_unit->uncacheable|= UNCACHEABLE_DEPENDENT;
+ select_lex->uncacheable|= UNCACHEABLE_DEPENDENT;
}
if (!abort_on_null && left_expr->maybe_null && !pushed_cond_guards)
1
0
[Maria-developers] Rev 3770: MWL#68 Subquery optimization: Efficient NOT IN execution with NULLs in file:///home/tsk/mprog/src/mysql-6.0-mwl68/
by timour@askmonty.org 12 Feb '10
by timour@askmonty.org 12 Feb '10
12 Feb '10
At file:///home/tsk/mprog/src/mysql-6.0-mwl68/
------------------------------------------------------------
revno: 3770
revision-id: timour(a)askmonty.org-20100212143343-l0pjascssuqedfk6
parent: timour(a)askmonty.org-20100201120948-mdt7gtwcz50q1dzp
committer: timour(a)askmonty.org
branch nick: mysql-6.0-mwl68
timestamp: Fri 2010-02-12 16:33:43 +0200
message:
MWL#68 Subquery optimization: Efficient NOT IN execution with NULLs
This patch implements working partial matching for materialized subqueries.
The code passes the full regression test, except differences in EXPLAIN.
There are no other known test failures.
Diff too large for email (1593 lines, the limit is 1000).
1
0
[Maria-developers] bzr commit into file:///home/tsk/mprog/src/mysql-6.0-mwl68/ branch (timour:3770)
by timour@askmonty.org 12 Feb '10
by timour@askmonty.org 12 Feb '10
12 Feb '10
#At file:///home/tsk/mprog/src/mysql-6.0-mwl68/ based on revid:timour@askmonty.org-20100201120948-mdt7gtwcz50q1dzp
3770 timour(a)askmonty.org 2010-02-12
MWL#68 Subquery optimization: Efficient NOT IN execution with NULLs
This patch implements working partial matching for materialized subqueries.
The code passes the full regression test, except differences in EXPLAIN.
There are no other known test failures.
modified:
sql/item_subselect.cc
sql/item_subselect.h
sql/sql_class.cc
sql/sql_class.h
sql/sql_select.cc
=== modified file 'sql/item_subselect.cc'
--- a/sql/item_subselect.cc 2010-02-01 12:09:48 +0000
+++ b/sql/item_subselect.cc 2010-02-12 14:33:43 +0000
@@ -2436,6 +2436,17 @@ int subselect_uniquesubquery_engine::sca
for (;;)
{
error=table->file->rnd_next(table->record[0]);
+ /*
+ TODO: The below tests are wrong, Monty's proposal:
+ if (error) {
+ if (error == HA_ERR_RECORD_DELETED)
+ continue;
+ if (error = HA_ERR_END_OF_FILE)
+ break;
+ else
+ report error;
+ break;
+ */
if (error && error != HA_ERR_END_OF_FILE)
{
error= report_error(table, error);
@@ -2453,6 +2464,11 @@ int subselect_uniquesubquery_engine::sca
}
table->file->ha_rnd_end();
+ /*
+ TODO: it seems to be an error to return TRUE when the error was
+ HA_ERR_END_OF_FILE which is perfectly fine. HA_ERR_END_OF_FILE
+ only means we didn't find a match.
+ */
DBUG_RETURN(error != 0);
}
@@ -2517,6 +2533,10 @@ bool subselect_uniquesubquery_engine::co
See also the comment for the subselect_uniquesubquery_engine::exec()
function.
*/
+ /*
+ TODO: If not all outer cols are NULL, how we know the result is NULL,
+ and not FALSE? Even on top-level.
+ */
null_keypart= (*copy)->null_key;
if (null_keypart)
{
@@ -2556,6 +2576,59 @@ bool subselect_uniquesubquery_engine::co
/*
+ @retval 1 A NULL was found in the outer reference, index lookup is
+ not applicable, the outer ref is unsusable as a lookup key,
+ use some other method to find a match.
+ @retval 0 The outer ref was copied into an index lookup key.
+ @retval -1 The outer ref cannot possibly match any row, IN is FALSE.
+*/
+
+int subselect_uniquesubquery_engine::copy_ref_key_simple()
+{
+ for (store_key **copy= tab->ref.key_copy ; *copy ; copy++)
+ {
+ enum store_key::store_key_result store_res;
+ store_res= (*copy)->copy();
+ tab->ref.key_err= store_res;
+
+ /*
+ When there is a NULL part in the key we don't need to make index
+ lookup for such key thus we don't need to copy whole key.
+ If we later should do a sequential scan return OK. Fail otherwise.
+
+ See also the comment for the subselect_uniquesubquery_engine::exec()
+ function.
+ */
+ /*
+ TODO: If not all outer cols are NULL, how we know the result is NULL,
+ and not FALSE? Even on top-level.
+ */
+ null_keypart= (*copy)->null_key;
+ if (null_keypart)
+ return 1;
+
+ /*
+ Check if the error is equal to STORE_KEY_FATAL. This is not expressed
+ using the store_key::store_key_result enum because ref.key_err is a
+ boolean and we want to detect both TRUE and STORE_KEY_FATAL from the
+ space of the union of the values of [TRUE, FALSE] and
+ store_key::store_key_result.
+ TODO: fix the variable an return types.
+ */
+ if (store_res == store_key::STORE_KEY_FATAL)
+ {
+ /*
+ Error converting the left IN operand to the column type of the right
+ IN operand.
+ */
+ return -1;
+ }
+ }
+ return 0;
+}
+
+
+/*
Execute subselect
SYNOPSIS
@@ -2595,7 +2668,10 @@ int subselect_uniquesubquery_engine::exe
/* TODO: change to use of 'full_scan' here? */
if (copy_ref_key())
+ {
+ /* TODO: copy_ref_key() == 1 means NULL result, not error, why return 1? */
DBUG_RETURN(1);
+ }
if (table->status)
{
/*
@@ -2637,6 +2713,52 @@ int subselect_uniquesubquery_engine::exe
/*
+ TODO: this needs more thinking, as exec() is a bit wrong IMO.
+ - we don't need empty_result_set, as it is == 1 <=> when
+ item->value == 0
+ - scan_table() returns >0 even when there was no actuall error,
+ but we only found EOF while scanning.
+ - scan_table should not check table->status, but it should check
+ HA_ERR_END_OF_FILE
+*/
+
+int subselect_uniquesubquery_engine::index_lookup()
+{
+ DBUG_ENTER("subselect_uniquesubquery_engine::index_lookup");
+ int error;
+ TABLE *table= tab->table;
+ empty_result_set= TRUE;
+ table->status= 0;
+
+ if (!table->file->inited)
+ table->file->ha_index_init(tab->ref.key, 0);
+ error= table->file->index_read_map(table->record[0],
+ tab->ref.key_buff,
+ make_prev_keypart_map(tab->ref.key_parts),
+ HA_READ_KEY_EXACT);
+ DBUG_PRINT("info", ("lookup result: %i", error));
+ if (error &&
+ error != HA_ERR_KEY_NOT_FOUND && error != HA_ERR_END_OF_FILE)
+ error= report_error(table, error);
+ else
+ {
+ error= 0;
+ table->null_row= 0;
+ if (!table->status && (!cond || cond->val_int()))
+ {
+ ((Item_in_subselect *) item)->value= 1;
+ empty_result_set= FALSE;
+ }
+ else
+ ((Item_in_subselect *) item)->value= 0;
+ }
+
+ DBUG_RETURN(error);
+}
+
+
+
+/*
Index-lookup subselect 'engine' - run the subquery
SYNOPSIS
@@ -3136,6 +3258,7 @@ void subselect_hash_sj_engine::set_strat
Item_in_subselect *item_in= (Item_in_subselect *) item;
select_materialize_with_stats *result_sink=
(select_materialize_with_stats *) result;
+ Item *outer_col;
DBUG_ENTER("subselect_hash_sj_engine::set_strategy_using_data");
@@ -3146,13 +3269,20 @@ void subselect_hash_sj_engine::set_strat
{
if (!bitmap_is_set(&partial_match_key_parts, i))
continue;
-
- if (result_sink->get_null_count_of_col(i) == 0)
+ outer_col= item_in->left_expr->element_index(i);
+ /*
+ If column 'i' doesn't contain NULLs, and the corresponding outer reference
+ cannot have a NULL value, then 'i' is a non-nullable column.
+ */
+ if (result_sink->get_null_count_of_col(i) == 0 && !outer_col->maybe_null)
{
bitmap_clear_bit(&partial_match_key_parts, i);
bitmap_set_bit(&non_null_key_parts, i);
--count_partial_match_columns;
}
+ if (result_sink->get_null_count_of_col(i) ==
+ tmp_table->file->stats.records)
+ ++count_null_only_columns;
}
/* If no column contains NULLs use regular hash index lookups. */
@@ -3177,6 +3307,7 @@ bitmap_init_memroot(MY_BITMAP *map, uint
bitmap_buffer_size(n_bits))) ||
bitmap_init(map, bitmap_buf, n_bits, FALSE))
return TRUE;
+ bitmap_clear_all(map);
return FALSE;
}
@@ -3209,10 +3340,10 @@ bool subselect_hash_sj_engine::init_perm
DBUG_ENTER("subselect_hash_sj_engine::init_permanent");
- if (!(bitmap_init_memroot(&non_null_key_parts, tmp_columns->elements,
- thd->mem_root)) ||
- !(bitmap_init_memroot(&partial_match_key_parts, tmp_columns->elements,
- thd->mem_root)))
+ if (bitmap_init_memroot(&non_null_key_parts, tmp_columns->elements,
+ thd->mem_root) ||
+ bitmap_init_memroot(&partial_match_key_parts, tmp_columns->elements,
+ thd->mem_root))
DBUG_RETURN(TRUE);
set_strategy_using_schema();
@@ -3548,33 +3679,45 @@ int subselect_hash_sj_engine::exec()
if (strategy == PARTIAL_MATCH)
{
- subselect_rowid_merge_engine *new_lookup_engine;
+ subselect_rowid_merge_engine *rowid_merge_engine;
uint count_pm_keys;
MY_BITMAP *nn_key_parts;
+ bool has_covering_null_row;
+ select_materialize_with_stats *result_sink=
+ (select_materialize_with_stats *) result;
+
/* Total number of keys needed for partial matching. */
- if (count_partial_match_columns < tmp_table->s->fields)
- {
- count_pm_keys= count_partial_match_columns + 1;
- nn_key_parts= &non_null_key_parts;
- }
+ nn_key_parts= (count_partial_match_columns < tmp_table->s->fields) ?
+ &non_null_key_parts : NULL;
+
+ has_covering_null_row= (result_sink->get_max_nulls_in_row() ==
+ tmp_table->s->fields -
+ (nn_key_parts ? bitmap_bits_set(nn_key_parts) : 0));
+
+ if (has_covering_null_row)
+ count_pm_keys= nn_key_parts ? 1 : 0;
else
- {
- count_pm_keys= count_partial_match_columns;
- nn_key_parts= NULL;
- }
+ count_pm_keys= count_partial_match_columns - count_null_only_columns +
+ (nn_key_parts ? 1 : 0);
- if (!(new_lookup_engine=
- new subselect_rowid_merge_engine(lookup_engine,
+ if (!(rowid_merge_engine=
+ new subselect_rowid_merge_engine((subselect_uniquesubquery_engine*)
+ lookup_engine,
tmp_table,
count_pm_keys,
+ has_covering_null_row,
item, result)) ||
- new_lookup_engine->init(nn_key_parts, &partial_match_key_parts))
+ rowid_merge_engine->init(nn_key_parts, &partial_match_key_parts))
{
- delete new_lookup_engine;
strategy= PARTIAL_MATCH_SCAN;
+ delete rowid_merge_engine;
/* TODO: setup execution structures for partial match via scanning. */
}
- strategy= PARTIAL_MATCH_INDEX;
+ else
+ {
+ strategy= PARTIAL_MATCH_INDEX;
+ lookup_engine= rowid_merge_engine;
+ }
}
item_in->change_engine(lookup_engine);
@@ -3632,15 +3775,49 @@ Ordered_key::Ordered_key(uint key_idx_ar
ha_rows min_null_row_arg, ha_rows max_null_row_arg,
uchar *row_num_to_rowid_arg)
: key_idx(key_idx_arg), tbl(tbl_arg), search_key(search_key_arg),
- row_num_to_rowid(row_num_to_rowid_arg), null_count(null_count_arg),
- min_null_row(min_null_row_arg), max_null_row(max_null_row_arg)
+ row_num_to_rowid(row_num_to_rowid_arg), null_count(null_count_arg)
+{
+ DBUG_ASSERT(tbl->file->stats.records > null_count);
+ key_buff_elements= tbl->file->stats.records - null_count;
+ cur_key_idx= HA_POS_ERROR;
+
+ DBUG_ASSERT((null_count && min_null_row_arg && max_null_row_arg) ||
+ (!null_count && !min_null_row_arg && !max_null_row_arg));
+ if (null_count)
+ {
+ /* The counters are 1-based, for key access we need 0-based indexes. */
+ min_null_row= min_null_row_arg - 1;
+ max_null_row= max_null_row_arg - 1;
+ }
+ else
+ min_null_row= max_null_row= 0;
+}
+
+
+Ordered_key::~Ordered_key()
{
- key_column_count= search_key->cols();
- cur_row= HA_POS_ERROR;
+ /*
+ All data structures are allocated on thd->mem_root, thus we don't
+ free them here.
+ */
}
/*
+ Cleanup that needs to be done for each PS (re)execution.
+*/
+
+void Ordered_key::cleanup()
+{
+ /*
+ Currently these keys are recreated for each PS re-execution, thus
+ there is nothing to cleanup, the whole object goes away after execution
+ is over. All handler related initialization/deinitialization is done by
+ the parent subselect_rowid_merge_engine object.
+ */
+}
+
+/*
Initialize a multi-column index.
*/
@@ -3648,10 +3825,10 @@ bool Ordered_key::init(MY_BITMAP *column
{
THD *thd= tbl->in_use;
uint cur_key_col= 0;
+ Item_field *cur_tmp_field;
+ Item_func_lt *fn_less_than;
- DBUG_ENTER("Ordered_key::init");
-
- DBUG_ASSERT(key_column_count == bitmap_bits_set(columns_to_index));
+ key_column_count= bitmap_bits_set(columns_to_index);
// TODO: check for mem allocation err, revert to scan
@@ -3660,22 +3837,26 @@ bool Ordered_key::init(MY_BITMAP *column
compare_pred= (Item_func_lt**) thd->alloc(key_column_count *
sizeof(Item_func_lt*));
- for (uint i= 0; i < columns_to_index->n_bits; i++, cur_key_col++)
+ for (uint i= 0; i < columns_to_index->n_bits; i++)
{
if (!bitmap_is_set(columns_to_index, i))
continue;
- key_columns[cur_key_col]= new Item_field(tbl->field[i]);
+ cur_tmp_field= new Item_field(tbl->field[i]);
/* Create the predicate (tmp_column[i] < outer_ref[i]). */
- compare_pred[cur_key_col]= new Item_func_lt(key_columns[cur_key_col],
- search_key->element_index(i));
+ fn_less_than= new Item_func_lt(cur_tmp_field,
+ search_key->element_index(i));
+ fn_less_than->fix_fields(thd, (Item**) &fn_less_than);
+ key_columns[cur_key_col]= cur_tmp_field;
+ compare_pred[cur_key_col]= fn_less_than;
+ ++cur_key_col;
}
if (alloc_keys_buffers())
{
/* TODO revert to partial match via table scan. */
- DBUG_RETURN(TRUE);
+ return TRUE;
}
- DBUG_RETURN(FALSE);
+ return FALSE;
}
@@ -3687,9 +3868,7 @@ bool Ordered_key::init(int col_idx)
{
THD *thd= tbl->in_use;
- DBUG_ENTER("Ordered_key::init");
-
- DBUG_ASSERT(key_column_count == 1);
+ key_column_count= 1;
// TODO: check for mem allocation err, revert to scan
@@ -3700,23 +3879,25 @@ bool Ordered_key::init(int col_idx)
/* Create the predicate (tmp_column[i] < outer_ref[i]). */
compare_pred[0]= new Item_func_lt(key_columns[0],
search_key->element_index(col_idx));
+ compare_pred[0]->fix_fields(thd, (Item**)&compare_pred[0]);
if (alloc_keys_buffers())
{
/* TODO revert to partial match via table scan. */
- DBUG_RETURN(TRUE);
+ return TRUE;
}
- DBUG_RETURN(FALSE);
+ return FALSE;
}
bool Ordered_key::alloc_keys_buffers()
{
THD *thd= tbl->in_use;
- ha_rows row_count= tbl->file->stats.records;
- if (!(row_index= (ha_rows*) thd->alloc((row_count - null_count) *
- sizeof(ha_rows))))
+ DBUG_ASSERT(key_buff_elements > 0);
+
+ if (!(key_buff= (rownum_t*) thd->alloc(key_buff_elements *
+ sizeof(rownum_t))))
return TRUE;
/*
@@ -3724,10 +3905,14 @@ bool Ordered_key::alloc_keys_buffers()
(max_null_row - min_null_row), and then use min_null_row as
lookup offset.
*/
- if (!(bitmap_init_memroot(&null_key, max_null_row,
- thd->mem_root)))
+ if (bitmap_init_memroot(&null_key,
+ /* this is max array index, we need count, so +1. */
+ max_null_row + 1,
+ thd->mem_root))
return TRUE;
+ cur_key_idx= HA_POS_ERROR;
+
return FALSE;
}
@@ -3735,66 +3920,88 @@ bool Ordered_key::alloc_keys_buffers()
/*
Quick sort comparison function that compares two rows of the same table
indentfied with their row numbers.
+
+ @retval -1
+ @retval 0
+ @retval +1
*/
-int Ordered_key::cmp_rows_by_rownum(Ordered_key *key, ha_rows *a, ha_rows *b)
+int
+Ordered_key::cmp_keys_by_row_data(ha_rows a, ha_rows b)
{
uchar *rowid_a, *rowid_b;
int error, cmp_res;
- TABLE *tbl= key->tbl;
/* The length in bytes of the rowids (positions) of tmp_table. */
uint rowid_length= tbl->file->ref_length;
- DBUG_ENTER("Ordered_key::cmp_rows_by_rownum");
if (a == b)
- DBUG_RETURN(0);
+ return 0;
/* Get the corresponding rowids. */
- rowid_a= key->row_num_to_rowid + (*a) * rowid_length;
- rowid_b= key->row_num_to_rowid + (*b) * rowid_length;
+ rowid_a= row_num_to_rowid + a * rowid_length;
+ rowid_b= row_num_to_rowid + b * rowid_length;
/* Fetch the rows for comparison. */
error= tbl->file->rnd_pos(tbl->record[0], rowid_a);
DBUG_ASSERT(!error);
error= tbl->file->rnd_pos(tbl->record[1], rowid_b);
DBUG_ASSERT(!error);
- /* Compare the two rows. */
- for (Field **f_ptr= tbl->field; *f_ptr; f_ptr++)
+ /*
+ Compare the two rows by the corresponding values of the indexed
+ columns.
+ */
+ for (uint i= 0; i < key_column_count; i++)
{
- if ((cmp_res= (*f_ptr)->cmp_offset(tbl->s->rec_buff_length)))
- DBUG_RETURN(cmp_res);
+ Field *cur_field= key_columns[i]->field;
+ if ((cmp_res= cur_field->cmp_offset(tbl->s->rec_buff_length)))
+ return (cmp_res > 0 ? 1 : -1);
}
- DBUG_RETURN(0);
+ return 0;
+}
+
+
+int
+Ordered_key::cmp_keys_by_row_data_and_rownum(Ordered_key *key,
+ rownum_t* a, rownum_t* b)
+{
+ /* The result of comparing the two keys according to their row data. */
+ int cmp_row_res= key->cmp_keys_by_row_data(*a, *b);
+ if (cmp_row_res)
+ return cmp_row_res;
+ return (*a < *b) ? -1 : (*a > *b) ? 1 : 0;
}
void Ordered_key::sort_keys()
{
- my_qsort(row_index, tbl->file->stats.records, sizeof(ha_rows),
- (qsort_cmp) &cmp_rows_by_rownum);
+ my_qsort2(key_buff, key_buff_elements, sizeof(rownum_t),
+ (qsort2_cmp) &cmp_keys_by_row_data_and_rownum, (void*) this);
+ /* Invalidate the current row position. */
+ cur_key_idx= HA_POS_ERROR;
}
/*
Compare the value(s) of the current key in 'search_key' with the
- data of the current table record accessible via 'key_columns'.
+ data of the current table record.
@notes The comparison result follows from the way compare_pred
is created in Ordered_key::init. Currently compare_pred compares
a field in of the current row with the corresponding Item that
contains the search key.
+ @param row_num Number of the row (not index in the key_buff array)
+
@retval -1 if (current row < search_key)
@retval 0 if (current row == search_key)
@retval +1 if (current row > search_key)
*/
-int Ordered_key::compare_row_with_key(ha_rows row_num)
+int Ordered_key::cmp_key_with_search_key(rownum_t row_num)
{
/* The length in bytes of the rowids (positions) of tmp_table. */
uint rowid_length= tbl->file->ref_length;
uchar *cur_rowid= row_num_to_rowid + row_num * rowid_length;
int error, cmp_res;
- DBUG_ENTER("Ordered_key::compare");
error= tbl->file->rnd_pos(tbl->record[0], cur_rowid);
DBUG_ASSERT(!error);
@@ -3804,9 +4011,9 @@ int Ordered_key::compare_row_with_key(ha
/* Unlike Arg_comparator::compare_row() here there should be no NULLs. */
DBUG_ASSERT(!compare_pred[i]->null_value);
if (cmp_res)
- DBUG_RETURN(cmp_res);
+ return (cmp_res > 0 ? 1 : -1);
}
- DBUG_RETURN(0);
+ return 0;
}
@@ -3818,17 +4025,24 @@ int Ordered_key::compare_row_with_key(ha
bool Ordered_key::lookup()
{
- DBUG_ENTER("Ordered_key::lookup");
+ DBUG_ASSERT(key_buff_elements);
ha_rows lo= 0;
- ha_rows hi= tbl->file->stats.records - 1;
+ ha_rows hi= key_buff_elements - 1;
ha_rows mid;
int cmp_res;
while (lo <= hi)
{
mid= lo + (hi - lo) / 2;
- cmp_res= compare_row_with_key(mid);
+ cmp_res= cmp_key_with_search_key(key_buff[mid]);
+ /*
+ In order to find the minimum match, check if the pevious element is
+ equal or smaller than the found one. If equal, we need to search further
+ to the left.
+ */
+ if (!cmp_res && mid > 0)
+ cmp_res= !cmp_key_with_search_key(key_buff[mid - 1]) ? 1 : 0;
if (cmp_res == -1)
{
@@ -3838,17 +4052,48 @@ bool Ordered_key::lookup()
else if (cmp_res == 1)
{
/* row[mid] > search_key */
+ if (!mid)
+ goto not_found;
hi= mid - 1;
}
else
{
/* row[mid] == search_key */
- cur_row= mid;
- DBUG_RETURN(TRUE);
+ cur_key_idx= mid;
+ return TRUE;
}
}
+not_found:
+ cur_key_idx= HA_POS_ERROR;
+ return FALSE;
+}
- DBUG_RETURN(FALSE);
+
+/*
+ Move the current index pointer to the next key with the same column
+ values as the current key. Since the index is sorted, all such keys
+ are contiguous.
+*/
+
+bool Ordered_key::next_same()
+{
+ DBUG_ASSERT(key_buff_elements);
+
+ if (cur_key_idx < key_buff_elements - 1)
+ {
+ /*
+ TODO:
+ The below is quite inefficient, since as a result we will fetch every
+ row (except the last one) twice. There must be a more efficient way,
+ e.g. swapping record[0] and record[1], and reading only the new record.
+ */
+ if (!cmp_keys_by_row_data(key_buff[cur_key_idx], key_buff[cur_key_idx + 1]))
+ {
+ ++cur_key_idx;
+ return TRUE;
+ }
+ }
+ return FALSE;
}
@@ -3865,56 +4110,147 @@ subselect_rowid_merge_engine::init(MY_BI
/* The length in bytes of the rowids (positions) of tmp_table. */
uint rowid_length= tmp_table->file->ref_length;
ha_rows row_count= tmp_table->file->stats.records;
+ rownum_t cur_rownum= 0;
select_materialize_with_stats *result_sink=
(select_materialize_with_stats *) result;
uint cur_key= 0;
+ Item_in_subselect *item_in= (Item_in_subselect*) item;
+ int error;
- if (!(row_num_to_rowid= (uchar*) thd->alloc(row_count * rowid_length *
- sizeof(uchar))))
- return TRUE;
+ if (keys_count == 0)
+ {
+ /* There is nothing to initialize, we will only do regular lookups. */
+ return FALSE;
+ }
- if (!(bitmap_init_memroot(&matching_keys, keys_count, thd->mem_root)))
+ DBUG_ASSERT(!has_covering_null_row || (has_covering_null_row &&
+ keys_count == 1 &&
+ non_null_key_parts));
+
+ if (!(merge_keys= (Ordered_key**) thd->alloc(keys_count *
+ sizeof(Ordered_key*))) ||
+ !(row_num_to_rowid= (uchar*) thd->alloc(row_count * rowid_length *
+ sizeof(uchar))))
return TRUE;
- merge_keys= (Ordered_key**) thd->alloc(keys_count * sizeof(Ordered_key*));
/* Create the only non-NULL key if there is any. */
if (non_null_key_parts)
{
- non_null_key= new Ordered_key(cur_key, tmp_table, item, 0, 0, 0,
- row_num_to_rowid);
+ non_null_key= new Ordered_key(cur_key, tmp_table, item_in->left_expr,
+ 0, 0, 0, row_num_to_rowid);
if (non_null_key->init(non_null_key_parts))
{
// TODO: revert to partial matching via scanning
return TRUE;
}
merge_keys[cur_key]= non_null_key;
- non_null_key->sort_keys();
+ merge_keys[cur_key]->first();
++cur_key;
}
+
/*
- Create one single-column NULL-key for each column in
- partial_match_key_parts.
+ If there is a covering NULL row, the only key that is needed is the
+ only non-NULL key that is already created above.
*/
- for (uint i= 0; i < partial_match_key_parts->n_bits; i++, cur_key++)
+ if (!has_covering_null_row)
+ {
+ if (bitmap_init_memroot(&matching_keys, keys_count, thd->mem_root) ||
+ bitmap_init_memroot(&matching_outer_cols, keys_count, thd->mem_root) ||
+ bitmap_init_memroot(&null_only_columns, keys_count, thd->mem_root))
+ return TRUE;
+
+ /*
+ Create one single-column NULL-key for each column in
+ partial_match_key_parts.
+ */
+ for (uint i= 0; i < partial_match_key_parts->n_bits; i++)
+ {
+ if (!bitmap_is_set(partial_match_key_parts, i))
+ continue;
+
+ if (result_sink->get_null_count_of_col(i) == row_count)
+ bitmap_set_bit(&null_only_columns, cur_key);
+ else
+ {
+ merge_keys[cur_key]= new Ordered_key(cur_key, tmp_table,
+ item_in->left_expr->element_index(i),
+ result_sink->get_null_count_of_col(i),
+ result_sink->get_min_null_of_col(i),
+ result_sink->get_max_null_of_col(i),
+ row_num_to_rowid);
+ if (merge_keys[cur_key]->init(i))
+ {
+ // TODO: revert to partial matching via scanning
+ return TRUE;
+ }
+ merge_keys[cur_key]->first();
+ }
+ ++cur_key;
+ }
+ }
+
+ /* Populate the indexes with data from the temporary table. */
+ tmp_table->file->ha_rnd_init(1);
+ tmp_table->file->extra_opt(HA_EXTRA_CACHE,
+ current_thd->variables.read_buff_size);
+ tmp_table->null_row= 0;
+ while (TRUE)
{
- if (!bitmap_is_set(partial_match_key_parts, i))
+ error= tmp_table->file->rnd_next(tmp_table->record[0]);
+ if (error == HA_ERR_RECORD_DELETED)
+ {
+ /* We get this for duplicate records that should not be in tmp_table. */
continue;
+ }
+ /*
+ This is a temp table that we fully own, there should be no other
+ cause to stop the iteration than EOF.
+ */
+ DBUG_ASSERT(!error || error == HA_ERR_END_OF_FILE);
+ if (error == HA_ERR_END_OF_FILE)
+ {
+ DBUG_ASSERT(cur_rownum == tmp_table->file->stats.records);
+ break;
+ }
- merge_keys[cur_key]= new Ordered_key(cur_key, tmp_table, item,
- result_sink->get_null_count_of_col(i),
- result_sink->get_min_null_of_col(i),
- result_sink->get_max_null_of_col(i),
- row_num_to_rowid);
- if (merge_keys[cur_key]->init(i))
+ /*
+ Save the position of this record in the row_num -> rowid mapping.
+ */
+ tmp_table->file->position(tmp_table->record[0]);
+ memcpy(row_num_to_rowid + cur_rownum * rowid_length,
+ tmp_table->file->ref, rowid_length);
+
+ /* Add the current row number to the corresponding keys. */
+ if (non_null_key)
{
- // TODO: revert to partial matching via scanning
- return TRUE;
+ /* By definition there are no NULLs in the non-NULL key. */
+ non_null_key->add_key(cur_rownum);
}
- merge_keys[cur_key]->sort_keys();
+
+ for (uint i= (non_null_key ? 1 : 0); i < keys_count; i++)
+ {
+ /*
+ Check if the first and only indexed column contains NULL in the curent
+ row, and add the row number to the corresponding key.
+ */
+ if (tmp_table->field[merge_keys[i]->get_field_idx(0)]->is_null())
+ merge_keys[i]->set_null(cur_rownum);
+ else
+ merge_keys[i]->add_key(cur_rownum);
+ }
+ ++cur_rownum;
}
+ tmp_table->file->ha_rnd_end();
+
+ /* Sort the keys in each of the indexes. */
+ for (uint i= 0; i < keys_count; i++)
+ merge_keys[i]->sort_keys();
+
+ // TODO: sort all the keys by NULL selectivity
+
if (init_queue(&pq, keys_count, 0, FALSE,
- subselect_rowid_merge_engine::cmp_key_by_cur_row, NULL))
+ subselect_rowid_merge_engine::cmp_keys_by_cur_rownum, NULL))
{
// TODO: revert to partial matching via scanning
return TRUE;
@@ -3924,9 +4260,19 @@ subselect_rowid_merge_engine::init(MY_BI
}
+subselect_rowid_merge_engine::~subselect_rowid_merge_engine()
+{
+ delete_queue(&pq);
+}
+
+
void subselect_rowid_merge_engine::cleanup()
{
- // TODO
+ lookup_engine->cleanup();
+ /* Tell handler we don't need the index anymore */
+ if (tmp_table->file->inited)
+ tmp_table->file->ha_rnd_end();
+ queue_remove_all(&pq);
}
@@ -3934,8 +4280,8 @@ void subselect_rowid_merge_engine::clean
*/
int
-subselect_rowid_merge_engine::cmp_key_by_null_selectivity(Ordered_key *a,
- Ordered_key *b)
+subselect_rowid_merge_engine::cmp_keys_by_null_selectivity(Ordered_key *a,
+ Ordered_key *b)
{
double a_sel= a->null_selectivity();
double b_sel= b->null_selectivity();
@@ -3951,37 +4297,26 @@ subselect_rowid_merge_engine::cmp_key_by
*/
int
-subselect_rowid_merge_engine::cmp_key_by_cur_row(void *arg,
- uchar *k1, uchar *k2)
+subselect_rowid_merge_engine::cmp_keys_by_cur_rownum(void *arg,
+ uchar *k1, uchar *k2)
{
- ha_rows row1= ((Ordered_key*) k1)->current();
- ha_rows row2= ((Ordered_key*) k2)->current();
+ rownum_t r1= ((Ordered_key*) k1)->current();
+ rownum_t r2= ((Ordered_key*) k2)->current();
- if (row1 > row2)
- return 1;
- if (row1 == row2)
- return 0;
- return -1;
+ return (r1 < r2) ? -1 : (r1 > r2) ? 1 : 0;
}
/*
- Check if certain table row contains a NULL in all columns in all columns for
- which there is no value match.
-
- @details Notice that if a column is not in the set 'keys', we assume that has
- been checked otherwise that there is a partial or complete match for this
- column. This allows to encode columns that consist of only NULLs as simply
- missing in the set 'keys', because such columns match any value in any row.
+ Check if certain table row contains a NULL in all columns for which there is
+ no match in the corresponding value index.
@retval TRUE if a NULL row exists
@retval FALSE otherwise
*/
-bool subselect_rowid_merge_engine::test_null_row(ha_rows row_num)
+bool subselect_rowid_merge_engine::test_null_row(rownum_t row_num)
{
- DBUG_ENTER("subselect_rowid_merge_engine::test_null_row");
-
for (uint i = 0; i < keys_count; i++)
{
if (bitmap_is_set(&matching_keys, i))
@@ -3993,9 +4328,9 @@ bool subselect_rowid_merge_engine::test_
continue;
}
if (!merge_keys[i]->is_null(row_num))
- DBUG_RETURN(FALSE);
+ return FALSE;
}
- DBUG_RETURN(TRUE);
+ return TRUE;
}
@@ -4007,88 +4342,120 @@ bool subselect_rowid_merge_engine::test_
bool subselect_rowid_merge_engine::partial_match()
{
Ordered_key *min_key; /* Key that contains the current minimum position. */
- ha_rows min_row; /* Current row number of min_key. */
+ rownum_t min_row_num; /* Current row number of min_key. */
Ordered_key *cur_key;
- ha_rows cur_row;
-
- DBUG_ENTER("subselect_rowid_merge_engine::partial_match");
+ rownum_t cur_row_num;
+ uint count_nulls_in_search_key= 0;
/* If there is a non-NULL key, it must be the first key in the keys array. */
- DBUG_ASSERT(non_null_key && merge_keys[0] == non_null_key);
+ DBUG_ASSERT(!non_null_key || (non_null_key && merge_keys[0] == non_null_key));
/* Check if there is a match for the columns of the only non-NULL key. */
if (non_null_key && !non_null_key->lookup())
- DBUG_RETURN(FALSE);
+ return FALSE;
+
+ /*
+ If there is a NULL (sub)row that covers all NULL-able columns,
+ then there is a guranteed partial match, and we don't need to search
+ for the matching row.
+ */
+ if (has_covering_null_row)
+ return TRUE;
+
if (non_null_key)
queue_insert(&pq, (uchar *) non_null_key);
-
/*
- Add all non-empty value keys to the priority queue. Do not process the
- non_null_key, since it was already processed above.
+ Do not add the non_null_key, since it was already processed above.
*/
- uint i= non_null_key ? 1 : 0; /* Skip the non-NULL key, already processed. */
- for (; i < keys_count; i++)
+ bitmap_clear_all(&matching_outer_cols);
+ for (uint i= test(non_null_key); i < keys_count; i++)
{
- if (merge_keys[i]->lookup())
+ DBUG_ASSERT(merge_keys[i]->get_column_count() == 1);
+ if (merge_keys[i]->get_search_key(0)->is_null())
+ {
+ ++count_nulls_in_search_key;
+ bitmap_set_bit(&matching_outer_cols, merge_keys[i]->get_key_idx());
+ }
+ else if (merge_keys[i]->lookup())
queue_insert(&pq, (uchar *) merge_keys[i]);
}
+
/*
- Not all value keys are empty, thus we don't have only NULL keys. If we had,
- the only possible match is a NULL row, and we cheked there is no such row,
- therefore the result is known to be FALSE. In fact this algorithm makes
- sense for at least two non-NULL columns.
+ If the outer reference consists of only NULLs, or if it has NULLs in all
+ nullable columns, the result is UNKNOWN.
*/
- DBUG_ASSERT(pq.elements > 1);
+ if (count_nulls_in_search_key ==
+ ((Item_in_subselect *) item)->left_expr->cols() -
+ (non_null_key ? non_null_key->get_column_count() : 0))
+ return TRUE;
+
+ /*
+ If there is no NULL (sub)row that covers all NULL columns, and there is no
+ single match for any of the NULL columns, the result is FALSE.
+ */
+ if (pq.elements - test(non_null_key) == 0)
+ return FALSE;
+
+ DBUG_ASSERT(pq.elements);
+
min_key= (Ordered_key*) queue_remove(&pq, 0);
- min_row= min_key->current();
- bitmap_clear_all(&matching_keys);
+ min_row_num= min_key->current();
+ bitmap_copy(&matching_keys, &null_only_columns);
bitmap_set_bit(&matching_keys, min_key->get_key_idx());
- min_key->next();
- if (!min_key->is_eof())
+ bitmap_union(&matching_keys, &matching_outer_cols);
+ if (min_key->next_same())
queue_insert(&pq, (uchar *) min_key);
+ if (pq.elements == 0)
+ {
+ /*
+ Check the only matching row of the only key min_key for NULL matches
+ in the other columns.
+ */
+ if (test_null_row(min_row_num))
+ return TRUE;
+ else
+ return FALSE;
+ }
+
while (TRUE)
{
cur_key= (Ordered_key*) queue_remove(&pq, 0);
- cur_row= min_key->current();
+ cur_row_num= cur_key->current();
- if (cur_row == min_row)
- {
+ if (cur_row_num == min_row_num)
bitmap_set_bit(&matching_keys, cur_key->get_key_idx());
- /* There cannot be a complete match, as we already checked for one. */
- DBUG_ASSERT(bitmap_bits_set(&matching_keys) < matching_keys.n_bits);
- }
else
{
/* Follows from the correct use of priority queue. */
- DBUG_ASSERT(cur_row > min_row);
- if (test_null_row(min_row))
- DBUG_RETURN(TRUE);
+ DBUG_ASSERT(cur_row_num > min_row_num);
+ if (test_null_row(min_row_num))
+ return TRUE;
else
{
min_key= cur_key;
- min_row= cur_row;
- bitmap_clear_all(&matching_keys);
+ min_row_num= cur_row_num;
+ bitmap_copy(&matching_keys, &null_only_columns);
bitmap_set_bit(&matching_keys, min_key->get_key_idx());
+ bitmap_union(&matching_keys, &matching_outer_cols);
}
}
- cur_key->next();
- if (!cur_key->is_eof())
+ if (cur_key->next_same())
queue_insert(&pq, (uchar *) cur_key);
if (pq.elements == 0)
{
/* Check the last row of the last column in PQ for NULL matches. */
- if (test_null_row(min_row))
- DBUG_RETURN(TRUE);
+ if (test_null_row(min_row_num))
+ return TRUE;
else
- DBUG_RETURN(FALSE);
+ return FALSE;
}
}
/* We should never get here. */
DBUG_ASSERT(FALSE);
- DBUG_RETURN(FALSE);
+ return FALSE;
}
@@ -4097,22 +4464,54 @@ int subselect_rowid_merge_engine::exec()
Item_in_subselect *item_in= (Item_in_subselect *) item;
int res;
- DBUG_ENTER("subselect_rowid_merge_engine::exec");
-
- if ((res= lookup_engine->exec()))
+ /* Try to find a matching row by index lookup. */
+ res= lookup_engine->copy_ref_key_simple();
+ if (res == -1)
+ {
+ /* The result is FALSE based on the outer reference. */
+ item_in->value= 0;
+ item_in->null_value= 0;
+ return 0;
+ }
+ else if (res == 0)
{
- /* An error occured during exec(). */
- DBUG_RETURN(res);
+ if ((res= lookup_engine->index_lookup()))
+ {
+ /* An error occured during lookup(). */
+ item_in->value= 0;
+ item_in->null_value= 0;
+ return res;
+ }
+ else if (item_in->value)
+ {
+ /*
+ A complete match was found, the result of IN is TRUE.
+ Notice: (this->item == lookup_engine->item)
+ */
+ return 0;
+ }
}
- else if (item_in->value == 1)
+
+ if (has_covering_null_row && !keys_count)
{
/*
- A complete match was found, the result of IN is TRUE.
- Notice: (this->item == lookup_engine->item)
+ If there is a NULL-only row that coveres all columns the result of IN
+ is UNKNOWN.
*/
- DBUG_RETURN(0);
+ item_in->value= 0;
+ /*
+ TODO: which one is the right way to propagate an UNKNOWN result?
+ Should we also set empty_result_set= FALSE; ???
+ */
+ //item_in->was_null= 1;
+ item_in->null_value= 1;
+ return 0;
}
+ /* All data accesses during execution are via handler::rnd_pos() */
+ if (tmp_table->file->inited)
+ tmp_table->file->ha_index_end();
+ tmp_table->file->ha_rnd_init(0);
/*
There is no complete match. Look for a partial match (UNKNOWN result), or
no match (FALSE).
@@ -4121,18 +4520,25 @@ int subselect_rowid_merge_engine::exec()
{
/* The result of IN is UNKNOWN. */
item_in->value= 0;
- /* TODO: which one is the right way to propagate an UNKNOWN result? */
- item_in->was_null= 1;
+ /*
+ TODO: which one is the right way to propagate an UNKNOWN result?
+ Should we also set empty_result_set= FALSE; ???
+ */
+ //item_in->was_null= 1;
item_in->null_value= 1;
}
else
{
/* The result of IN is FALSE. */
item_in->value= 0;
- /* TODO: which one is the right way to propagate an UNKNOWN result? */
- item_in->was_null= 0;
+ /*
+ TODO: which one is the right way to propagate an UNKNOWN result?
+ Should we also set empty_result_set= FALSE; ???
+ */
+ //item_in->was_null= 0;
item_in->null_value= 0;
}
+ tmp_table->file->ha_rnd_end();
- DBUG_RETURN(0);
+ return 0;
}
=== modified file 'sql/item_subselect.h'
--- a/sql/item_subselect.h 2010-02-01 12:09:48 +0000
+++ b/sql/item_subselect.h 2010-02-12 14:33:43 +0000
@@ -610,8 +610,10 @@ public:
virtual void print (String *str, enum_query_type query_type);
bool change_result(Item_subselect *si, select_result_interceptor *result);
bool no_tables();
+ int index_lookup();
int scan_table();
bool copy_ref_key();
+ int copy_ref_key_simple();
bool no_rows() { return empty_result_set; }
virtual enum_engine_type engine_type() { return UNIQUESUBQUERY_ENGINE; }
};
@@ -678,6 +680,34 @@ inline bool Item_subselect::is_uncacheab
return engine->uncacheable();
}
+/*
+ Distinguish the type od (0-based) row numbers from the type of the index into
+ an array of row numbers.
+*/
+typedef ha_rows rownum_t;
+
+
+/*
+ An Ordered_key is an in-memory table index that allows O(log(N)) time
+ lookups of a multi-part key.
+
+ If the index is over a single column, then this column may contain NULLs, and
+ the NULLs are stored and tested separately for NULL in O(1) via is_null().
+ Multi-part indexes assume that the indexed columns do not contain NULLs.
+
+ TODO:
+ = Due to the unnatural assymetry between single and multi-part indexes, it
+ makes sense to somehow refactor or extend the class.
+
+ = This class can be refactored into a base abstract interface, and two
+ subclasses:
+ - one to represent single-column indexes, and
+ - another to represent multi-column indexes.
+ Such separation would allow slightly more efficient implementation of
+ the single-column indexes.
+ = The current design requires such indexes to be fully recreated for each
+ PS (re)execution, however most of the comprising objects can be reused.
+*/
class Ordered_key
{
@@ -701,11 +731,12 @@ protected:
/* Value index related members. */
/*
The actual value index, consists of a sorted sequence of row numbers.
- There are tbl->file->stats.records elements in this array.
*/
- ha_rows *row_index;
- /* Current element in 'row_index'. */
- ha_rows cur_row;
+ rownum_t *key_buff;
+ /* Number of elements in key_buff. */
+ ha_rows key_buff_elements;
+ /* Current element in 'key_buff'. */
+ ha_rows cur_key_idx;
/*
Mapping from row numbers to row ids. The element row_num_to_rowid[i]
contains a buffer with the rowid for the row numbered 'i'.
@@ -734,15 +765,21 @@ protected:
Quick sort comparison function that compares two rows of the same table
indentfied with their row numbers.
*/
- static int cmp_rows_by_rownum(Ordered_key *key, ha_rows* a, ha_rows* b);
+ int cmp_keys_by_row_data(rownum_t a, rownum_t b);
+ static int cmp_keys_by_row_data_and_rownum(Ordered_key *key,
+ rownum_t* a, rownum_t* b);
- int compare_row_with_key(ha_rows row_num);
+ int cmp_key_with_search_key(rownum_t row_num);
public:
+ static void *operator new(size_t size) throw ()
+ { return sql_alloc(size); }
Ordered_key(uint key_idx_arg, TABLE *tbl_arg,
Item *search_key_arg, ha_rows null_count_arg,
ha_rows min_null_row_arg, ha_rows max_null_row_arg,
uchar *row_num_to_rowid_arg);
+ ~Ordered_key();
+ void cleanup();
/* Initialize a multi-column index. */
bool init(MY_BITMAP *columns_to_index);
/* Initialize a single-column index. */
@@ -750,10 +787,21 @@ public:
uint get_column_count() { return key_column_count; }
uint get_key_idx() { return key_idx; }
- void add_key(ha_rows row_num)
+ uint get_field_idx(uint i)
+ {
+ DBUG_ASSERT(i < key_column_count);
+ return key_columns[i]->field->field_index;
+ }
+ Item *get_search_key(uint i)
{
- row_index[cur_row]= row_num;
- ++cur_row;
+ return search_key->element_index(key_columns[i]->field->field_index);
+ }
+ void add_key(rownum_t row_num)
+ {
+ /* The caller must know how many elements to add. */
+ DBUG_ASSERT(key_buff_elements && cur_key_idx < key_buff_elements);
+ key_buff[cur_key_idx]= row_num;
+ ++cur_key_idx;
}
void sort_keys();
@@ -766,28 +814,38 @@ public:
this->search_key.
*/
bool lookup();
- /* Return the current index element. */
- ha_rows current() { return row_index[cur_row]; }
- /* Move the current index cursor at the next match. */
+ /* Move the current index cursor to the first key. */
+ void first()
+ {
+ DBUG_ASSERT(key_buff_elements);
+ cur_key_idx= 0;
+ }
+ /* TODO */
+ bool next_same();
+ /* Move the current index cursor to the next key. */
bool next()
{
- if (cur_row < tbl->file->stats.records)
+ DBUG_ASSERT(key_buff_elements);
+ if (cur_key_idx < key_buff_elements - 1)
{
- ++cur_row;
+ ++cur_key_idx;
return TRUE;
}
return FALSE;
};
- /* Return false if all matches are exhausted, true otherwise. */
- bool is_eof() { return cur_row == tbl->file->stats.records; }
+ /* Return the current index element. */
+ rownum_t current()
+ {
+ DBUG_ASSERT(key_buff_elements && cur_key_idx < key_buff_elements);
+ return key_buff[cur_key_idx];
+ }
- void set_null(ha_rows row_num)
+ void set_null(rownum_t row_num)
{
bitmap_set_bit(&null_key, row_num);
}
- bool is_null(ha_rows row_num)
+ bool is_null(rownum_t row_num)
{
- DBUG_ENTER("Ordered_key::is_null");
/*
Indexes consisting of only NULLs do not have a bitmap buffer at all.
Their only initialized member is 'n_bits', which is equal to the number
@@ -796,11 +854,11 @@ public:
if (null_count == tbl->file->stats.records)
{
DBUG_ASSERT(tbl->file->stats.records == null_key.n_bits);
- DBUG_RETURN(TRUE);
+ return TRUE;
}
if (row_num > max_null_row || row_num < min_null_row)
- DBUG_RETURN(FALSE);
- DBUG_RETURN(bitmap_is_set(&null_key, row_num));
+ return FALSE;
+ return bitmap_is_set(&null_key, row_num);
}
};
@@ -815,18 +873,28 @@ protected:
TRUE, then subselect_rowid_merge_engine further distinguishes between
FALSE and UNKNOWN.
*/
- subselect_engine *lookup_engine;
+ subselect_uniquesubquery_engine *lookup_engine;
/*
- Mapping from row numbers to row ids. The element row_num_to_rowid[i]
- contains a buffer with the rowid for the row numbered 'i'.
+ Mapping from row numbers to row ids. The rowids are stored sequentially
+ in the array - rowid[i] is located in row_num_to_rowid + i * rowid_length.
*/
uchar *row_num_to_rowid;
/*
A subset of all the keys for which there is a match for the same row.
- Used during execution. Computed for each call to exec().
+ Used during execution. Computed for each outer reference
*/
MY_BITMAP matching_keys;
/*
+ The columns of the outer reference that are NULL. Computed for each
+ outer reference.
+ */
+ MY_BITMAP matching_outer_cols;
+ /*
+ Columns that consist of only NULLs. Such columns match any value.
+ Computed once per query execution.
+ */
+ MY_BITMAP null_only_columns;
+ /*
Indexes of row numbers, sorted by <column_value, row_number>. If an
index may contain NULLs, the NULLs are stored efficiently in a bitmap.
@@ -849,44 +917,59 @@ protected:
This queue is used by the partial match algorithm in method exec().
*/
QUEUE pq;
+ /* True if there is a NULL (sub)row that covers all NULLable columns. */
+ bool has_covering_null_row;
protected:
/*
Comparison function to compare keys in order of increasing bitmap
selectivity.
*/
- static int cmp_key_by_null_selectivity(Ordered_key *a, Ordered_key *b);
+ static int cmp_keys_by_null_selectivity(Ordered_key *a, Ordered_key *b);
/*
Comparison function used by the priority queue pq, the 'smaller' key
is the one with the smaller current row number.
*/
- static int cmp_key_by_cur_row(void *arg, uchar *k1, uchar *k2);
+ static int cmp_keys_by_cur_rownum(void *arg, uchar *k1, uchar *k2);
- bool test_null_row(ha_rows row_num);
+ bool test_null_row(rownum_t row_num);
bool partial_match();
public:
- subselect_rowid_merge_engine(subselect_engine *lookup_engine_arg,
+ subselect_rowid_merge_engine(subselect_uniquesubquery_engine *engine_arg,
TABLE *tmp_table_arg, uint keys_count_arg,
+ uint has_covering_null_row_arg,
Item_subselect *item_arg,
select_result_interceptor *result_arg)
:subselect_engine(item_arg, result_arg),
- tmp_table(tmp_table_arg), lookup_engine(lookup_engine_arg),
- keys_count(keys_count_arg)
- {}
-
+ tmp_table(tmp_table_arg), lookup_engine(engine_arg),
+ keys_count(keys_count_arg), non_null_key(NULL),
+ has_covering_null_row(has_covering_null_row_arg)
+ {
+ thd= lookup_engine->get_thd();
+ }
+ ~subselect_rowid_merge_engine();
bool init(MY_BITMAP *non_null_key_parts, MY_BITMAP *partial_match_key_parts);
void cleanup();
int prepare() { return 0; }
void fix_length_and_dec(Item_cache**) {}
int exec();
- uint cols() { return 0; }
+ uint cols() { /* TODO: what is the correct value? */ return 1; }
uint8 uncacheable() { return UNCACHEABLE_DEPENDENT; }
void exclude() {}
table_map upper_select_const_tables() { return 0; }
void print(String*, enum_query_type) {}
bool change_result(Item_subselect*, select_result_interceptor*)
- { return false; }
+ { DBUG_ASSERT(FALSE); return false; }
bool no_tables() { return false; }
- bool no_rows() {return false; }
+ bool no_rows()
+ {
+ /*
+ TODO: It is completely unclear what is the semantics of this
+ method. The current result is computed so that the call to no_rows()
+ from Item_in_optimizer::val_int() sets Item_in_optimizer::null_value
+ correctly.
+ */
+ return !(((Item_in_subselect *) item)->null_value);
+ }
};
@@ -933,6 +1016,7 @@ protected:
/* Keyparts of the single column indexes with NULL, one keypart per index. */
MY_BITMAP partial_match_key_parts;
uint count_partial_match_columns;
+ uint count_null_only_columns;
/*
A conjunction of all the equality condtions between all pairs of expressions
that are arguments of an IN predicate. We need these to post-filter some
@@ -962,7 +1046,7 @@ public:
:subselect_engine(in_predicate, NULL), tmp_table(NULL),
is_materialized(FALSE), materialize_engine(old_engine), lookup_engine(NULL),
materialize_join(NULL), count_partial_match_columns(0),
- semi_join_conds(NULL)
+ count_null_only_columns(0), semi_join_conds(NULL)
{
set_thd(thd);
}
=== modified file 'sql/sql_class.cc'
--- a/sql/sql_class.cc 2010-01-22 16:18:05 +0000
+++ b/sql/sql_class.cc 2010-02-12 14:33:43 +0000
@@ -2931,12 +2931,13 @@ create_result_table(THD *thd_arg, List<I
options, HA_POS_ERROR, (char*) table_alias)))
return TRUE;
- /* TODO: if/where/when to free this buffer? */
- col_stat= (Column_statistics*) table->in_use->calloc(table->s->fields *
- sizeof(Column_statistics));
+ col_stat= (Column_statistics*) table->in_use->alloc(table->s->fields *
+ sizeof(Column_statistics));
if (!stat)
return TRUE;
+ cleanup();
+
table->file->extra(HA_EXTRA_WRITE_CACHE);
table->file->extra(HA_EXTRA_IGNORE_DUP_KEY);
return FALSE;
@@ -2966,14 +2967,14 @@ bool select_materialize_with_stats::send
{
++cur_col_stat->null_count;
cur_col_stat->max_null_row= count_rows;
- if (cur_col_stat->min_null_row == 0)
+ if (!cur_col_stat->min_null_row)
cur_col_stat->min_null_row= count_rows;
++nulls_in_row;
}
++cur_col_stat;
}
- if (nulls_in_row == items.elements)
- ++null_record_count;
+ if (nulls_in_row > max_nulls_in_row)
+ max_nulls_in_row= nulls_in_row;
return select_union::send_data(items);
}
=== modified file 'sql/sql_class.h'
--- a/sql/sql_class.h 2010-02-01 12:09:48 +0000
+++ b/sql/sql_class.h 2010-02-12 14:33:43 +0000
@@ -3044,17 +3044,20 @@ protected:
public:
/* Count of NULLs per column. */
ha_rows null_count;
- /* The row number that contains the last NULL in a column. */
- ha_rows max_null_row;
/* The row number that contains the first NULL in a column. */
ha_rows min_null_row;
+ /* The row number that contains the last NULL in a column. */
+ ha_rows max_null_row;
};
/* Array of statistics data per column. */
Column_statistics* col_stat;
- /* The number of rows that consist only of NULL values. */
- ha_rows null_record_count;
+ /*
+ The number of columns in the biggest sub-row that consists of only
+ NULL values.
+ */
+ ha_rows max_nulls_in_row;
/*
Count of rows writtent to the temp table. This is redundant as it is
already stored in handler::stats.records, however that one is relatively
@@ -3063,11 +3066,7 @@ protected:
ha_rows count_rows;
public:
- select_materialize_with_stats()
- {
- null_record_count= 0;
- count_rows= 0;
- }
+ select_materialize_with_stats() {}
virtual bool create_result_table(THD *thd, List<Item> *column_types,
bool is_distinct, ulonglong options,
const char *alias, bool bit_fields_as_long);
@@ -3075,9 +3074,9 @@ public:
bool send_data(List<Item> &items);
void cleanup()
{
- null_record_count= 0;
- count_rows= 0;
memset(col_stat, 0, table->s->fields * sizeof(Column_statistics));
+ max_nulls_in_row= 0;
+ count_rows= 0;
}
ha_rows get_null_count_of_col(uint idx)
{
@@ -3094,7 +3093,7 @@ public:
DBUG_ASSERT(idx < table->s->fields);
return col_stat[idx].min_null_row;
}
- ha_rows get_null_record_count() { return null_record_count; }
+ ha_rows get_max_nulls_in_row() { return max_nulls_in_row; }
};
=== modified file 'sql/sql_select.cc'
--- a/sql/sql_select.cc 2010-01-22 16:18:05 +0000
+++ b/sql/sql_select.cc 2010-02-12 14:33:43 +0000
@@ -707,7 +707,7 @@ JOIN::prepare(Item ***rref_pointer_array
subquery_types_allow_materialization(in_subs))
{
// psergey-todo: duplicated_subselect_card_check: where it's done?
- if (in_subs->is_top_level_item() && // 4
+ if (//in_subs->is_top_level_item() && // 4
!in_subs->is_correlated && // 5
in_subs->exec_method == Item_in_subselect::NOT_TRANSFORMED) // 6
in_subs->exec_method= Item_in_subselect::MATERIALIZATION;
1
0
[Maria-developers] [Branch ~maria-captains/maria/5.1] Rev 2817: Fix for LPBUG#516148 Test maria.maria3 fails when --without-maria-tmp-tables is set
by noreply@launchpad.net 12 Feb '10
by noreply@launchpad.net 12 Feb '10
12 Feb '10
------------------------------------------------------------
revno: 2817
committer: Michael Widenius <monty(a)askmonty.org>
branch nick: maria-5.1
timestamp: Fri 2010-02-12 16:21:13 +0200
message:
Fix for LPBUG#516148 Test maria.maria3 fails when --without-maria-tmp-tables is set
modified:
mysql-test/suite/maria/r/maria3.result
mysql-test/suite/maria/t/maria3.test
storage/maria/ha_maria.cc
--
lp:maria
https://code.launchpad.net/~maria-captains/maria/5.1
Your team Maria developers is subscribed to branch lp:maria.
To unsubscribe from this branch go to https://code.launchpad.net/~maria-captains/maria/5.1/+edit-subscription.
1
0
[Maria-developers] bzr commit into MariaDB 5.1, with Maria 1.5:maria branch (monty:2817)
by Michael Widenius 12 Feb '10
by Michael Widenius 12 Feb '10
12 Feb '10
#At lp:maria based on revid:monty@askmonty.org-20100211191524-rbd8pfcchi9ewm4a
2817 Michael Widenius 2010-02-12
Fix for LPBUG#516148 Test maria.maria3 fails when --without-maria-tmp-tables is set
modified:
mysql-test/suite/maria/r/maria3.result
mysql-test/suite/maria/t/maria3.test
storage/maria/ha_maria.cc
per-file messages:
mysql-test/suite/maria/r/maria3.result
Updated test results
mysql-test/suite/maria/t/maria3.test
Don't show maria_used_for_temp_tables, as it's value is depending on configure options
=== modified file 'mysql-test/suite/maria/r/maria3.result'
--- a/mysql-test/suite/maria/r/maria3.result 2009-09-18 01:04:43 +0000
+++ b/mysql-test/suite/maria/r/maria3.result 2010-02-12 14:21:13 +0000
@@ -301,7 +301,7 @@ check table t1 extended;
Table Op Msg_type Msg_text
test.t1 check status OK
drop table t1;
-show variables like 'maria%';
+select lower(variable_name) as Variable_name, Variable_value as Value from information_schema.session_variables where variable_name like "maria%" and variable_name not like "maria_used_for_temp_tables" order by 1;
Variable_name Value
maria_block_size 8192
maria_checkpoint_interval 30
@@ -309,16 +309,15 @@ maria_force_start_after_recovery_failure
maria_log_file_size 4294959104
maria_log_purge_type immediate
maria_max_sort_file_size 9223372036853727232
-maria_page_checksum OFF
maria_pagecache_age_threshold 300
maria_pagecache_buffer_size 8384512
maria_pagecache_division_limit 100
+maria_page_checksum OFF
maria_recover OFF
maria_repair_threads 1
maria_sort_buffer_size 8388608
maria_stats_method nulls_unequal
maria_sync_log_dir NEWFILE
-maria_used_for_temp_tables ON
show status like 'maria%';
Variable_name Value
Maria_pagecache_blocks_not_flushed #
=== modified file 'mysql-test/suite/maria/t/maria3.test'
--- a/mysql-test/suite/maria/t/maria3.test 2009-06-02 09:58:27 +0000
+++ b/mysql-test/suite/maria/t/maria3.test 2010-02-12 14:21:13 +0000
@@ -259,7 +259,7 @@ drop table t1;
# Fix if we are using safemalloc
--replace_result 8388572 8388600
-show variables like 'maria%';
+select lower(variable_name) as Variable_name, Variable_value as Value from information_schema.session_variables where variable_name like "maria%" and variable_name not like "maria_used_for_temp_tables" order by 1;
--replace_column 2 #
show status like 'maria%';
=== modified file 'storage/maria/ha_maria.cc'
--- a/storage/maria/ha_maria.cc 2010-02-10 19:06:24 +0000
+++ b/storage/maria/ha_maria.cc 2010-02-12 14:21:13 +0000
@@ -3278,11 +3278,11 @@ static struct st_mysql_sys_var* system_v
MYSQL_SYSVAR(block_size),
MYSQL_SYSVAR(checkpoint_interval),
MYSQL_SYSVAR(force_start_after_recovery_failures),
- MYSQL_SYSVAR(page_checksum),
MYSQL_SYSVAR(log_dir_path),
MYSQL_SYSVAR(log_file_size),
MYSQL_SYSVAR(log_purge_type),
MYSQL_SYSVAR(max_sort_file_size),
+ MYSQL_SYSVAR(page_checksum),
MYSQL_SYSVAR(pagecache_age_threshold),
MYSQL_SYSVAR(pagecache_buffer_size),
MYSQL_SYSVAR(pagecache_division_limit),
1
0
[Maria-developers] Rev 2741: Group commit for maria engine. in file:///Users/bell/maria/bzr/work-maria-5.2-groupcommit2/
by sanja@askmonty.org 12 Feb '10
by sanja@askmonty.org 12 Feb '10
12 Feb '10
At file:///Users/bell/maria/bzr/work-maria-5.2-groupcommit2/
------------------------------------------------------------
revno: 2741
revision-id: sanja(a)askmonty.org-20100212131228-bgxli0wfybhjkvg9
parent: sergii(a)pisem.net-20100212084731-b5jst7oxhzp251pg
committer: sanja(a)askmonty.org
branch nick: work-maria-5.2-groupcommit2
timestamp: Fri 2010-02-12 15:12:28 +0200
message:
Group commit for maria engine.
=== added file 'mysql-test/suite/maria/r/group_commit.result'
--- a/mysql-test/suite/maria/r/group_commit.result 1970-01-01 00:00:00 +0000
+++ b/mysql-test/suite/maria/r/group_commit.result 2010-02-12 13:12:28 +0000
@@ -0,0 +1,17 @@
+drop table if exists t1;
+create table t1 (a int);
+SET GLOBAL maria_group_commit="NONE";
+SET GLOBAL maria_group_commit_interval= 0;
+SET GLOBAL maria_group_commit="NONE";
+SET GLOBAL maria_group_commit_interval= 100;
+SET GLOBAL maria_group_commit="HARD";
+SET GLOBAL maria_group_commit_interval= 0;
+SET GLOBAL maria_group_commit="HARD";
+SET GLOBAL maria_group_commit_interval= 100;
+SET GLOBAL maria_group_commit="SOFT";
+SET GLOBAL maria_group_commit_interval= 0;
+SET GLOBAL maria_group_commit="SOFT";
+SET GLOBAL maria_group_commit_interval= 100;
+SET GLOBAL maria_group_commit="NONE";
+SET GLOBAL maria_group_commit_interval= 0;
+drop table t1;
=== modified file 'mysql-test/suite/maria/r/maria3.result'
--- a/mysql-test/suite/maria/r/maria3.result 2009-09-18 01:04:43 +0000
+++ b/mysql-test/suite/maria/r/maria3.result 2010-02-12 13:12:28 +0000
@@ -306,6 +306,8 @@
maria_block_size 8192
maria_checkpoint_interval 30
maria_force_start_after_recovery_failures 0
+maria_group_commit none
+maria_group_commit_interval 0
maria_log_file_size 4294959104
maria_log_purge_type immediate
maria_max_sort_file_size 9223372036853727232
@@ -328,6 +330,7 @@
Maria_pagecache_reads #
Maria_pagecache_write_requests #
Maria_pagecache_writes #
+Maria_transaction_log_syncs #
create table t1 (b char(0));
insert into t1 values(NULL),("");
select length(b) from t1;
=== added file 'mysql-test/suite/maria/t/group_commit.test'
--- a/mysql-test/suite/maria/t/group_commit.test 1970-01-01 00:00:00 +0000
+++ b/mysql-test/suite/maria/t/group_commit.test 2010-02-12 13:12:28 +0000
@@ -0,0 +1,71 @@
+# Test different ways of syncing (mostly syntax)
+
+--disable_warnings
+drop table if exists t1;
+--enable_warnings
+
+create table t1 (a int);
+
+SET GLOBAL maria_group_commit="NONE";
+SET GLOBAL maria_group_commit_interval= 0;
+--disable_query_log
+let $num = 5000;
+while ($num)
+{
+ insert into t1 values (1);
+ dec $num;
+}
+--enable_query_log
+SET GLOBAL maria_group_commit="NONE";
+SET GLOBAL maria_group_commit_interval= 100;
+--disable_query_log
+let $num = 5000;
+while ($num)
+{
+ insert into t1 values (1);
+ dec $num;
+}
+--enable_query_log
+SET GLOBAL maria_group_commit="HARD";
+SET GLOBAL maria_group_commit_interval= 0;
+--disable_query_log
+let $num = 5000;
+while ($num)
+{
+ insert into t1 values (1);
+ dec $num;
+}
+--enable_query_log
+SET GLOBAL maria_group_commit="HARD";
+SET GLOBAL maria_group_commit_interval= 100;
+--disable_query_log
+let $num = 5000;
+while ($num)
+{
+ insert into t1 values (1);
+ dec $num;
+}
+--enable_query_log
+SET GLOBAL maria_group_commit="SOFT";
+SET GLOBAL maria_group_commit_interval= 0;
+--disable_query_log
+let $num = 5000;
+while ($num)
+{
+ insert into t1 values (1);
+ dec $num;
+}
+--enable_query_log
+SET GLOBAL maria_group_commit="SOFT";
+SET GLOBAL maria_group_commit_interval= 100;
+--disable_query_log
+let $num = 5000;
+while ($num)
+{
+ insert into t1 values (1);
+ dec $num;
+}
+--enable_query_log
+SET GLOBAL maria_group_commit="NONE";
+SET GLOBAL maria_group_commit_interval= 0;
+drop table t1;
=== added directory 'randgen'
=== added directory 'randgen/conf'
=== added file 'randgen/conf/maria_group_commit.yy'
--- a/randgen/conf/maria_group_commit.yy 1970-01-01 00:00:00 +0000
+++ b/randgen/conf/maria_group_commit.yy 2010-02-12 13:12:28 +0000
@@ -0,0 +1,181 @@
+# test of group commit switching
+
+query:
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ change_group_commit | change_interval;
+
+
+select:
+ SELECT select_item FROM join where order_by limit;
+
+select_item:
+ * | X . _field ;
+
+join:
+ _table AS X |
+ _table AS X LEFT JOIN _table AS Y ON ( X . _field = Y . _field ) ;
+
+where:
+ |
+ WHERE X . _field < value |
+ WHERE X . _field > value |
+ WHERE X . _field = value ;
+
+where_delete:
+ |
+ WHERE _field < value |
+ WHERE _field > value |
+ WHERE _field = value ;
+
+order_by:
+ | ORDER BY X . _field ;
+
+limit:
+ | LIMIT _digit ;
+
+insert:
+ INSERT INTO _table ( _field , _field ) VALUES ( value , value ) ;
+
+update:
+ UPDATE _table AS X SET _field = value where order_by limit ;
+
+delete:
+ DELETE FROM _table where_delete LIMIT _digit ;
+
+value:
+ ' _letter ' | _digit | _date | _datetime | _time | _english ;
+
+change_group_commit:
+ SET GLOBAL MARIA_GROUP_COMMIT=none_soft_hard;
+
+none_soft_hard:
+ NONE | SOFT | HARD;
+
+change_interval:
+ set_interval | set_interval | set_interval | set_interval |
+ drop_interval;
+
+set_interval:
+ SET GLOBAL MARIA_GROUP_COMMIT_INTERVAL=_tinyint_unsigned;
+
+drop_interval:
+ SET GLOBAL MARIA_GROUP_COMMIT_INTERVAL=0;
=== modified file 'storage/maria/ha_maria.cc'
--- a/storage/maria/ha_maria.cc 2010-02-10 19:06:24 +0000
+++ b/storage/maria/ha_maria.cc 2010-02-12 13:12:28 +0000
@@ -102,22 +102,40 @@
array_elements(maria_translog_purge_type_names) - 1, "",
maria_translog_purge_type_names, NULL
};
+
+/* transactional log directory sync */
const char *maria_sync_log_dir_names[]=
{
"NEVER", "NEWFILE", "ALWAYS", NullS
};
-
TYPELIB maria_sync_log_dir_typelib=
{
array_elements(maria_sync_log_dir_names) - 1, "",
maria_sync_log_dir_names, NULL
};
+/* transactional log group commit */
+const char *maria_group_commit_names[]=
+{
+ "none", "hard", "soft", NullS
+};
+TYPELIB maria_group_commit_typelib=
+{
+ array_elements(maria_group_commit_names) - 1, "",
+ maria_group_commit_names, NULL
+};
+
/** Interval between background checkpoints in seconds */
static ulong checkpoint_interval;
static void update_checkpoint_interval(MYSQL_THD thd,
struct st_mysql_sys_var *var,
void *var_ptr, const void *save);
+static void update_maria_group_commit(MYSQL_THD thd,
+ struct st_mysql_sys_var *var,
+ void *var_ptr, const void *save);
+static void update_maria_group_commit_interval(MYSQL_THD thd,
+ struct st_mysql_sys_var *var,
+ void *var_ptr, const void *save);
/** After that many consecutive recovery failures, remove logs */
static ulong force_start_after_recovery_failures;
static void update_log_file_size(MYSQL_THD thd,
@@ -164,6 +182,24 @@
NULL, update_log_file_size, TRANSLOG_FILE_SIZE,
TRANSLOG_MIN_FILE_SIZE, 0xffffffffL, TRANSLOG_PAGE_SIZE);
+static MYSQL_SYSVAR_ENUM(group_commit, maria_group_commit,
+ PLUGIN_VAR_RQCMDARG,
+ "Specifies maria group commit mode. "
+ "Possible values are \"none\" (no group commit), "
+ "\"hard\" (with waiting to actual commit), "
+ "\"soft\" (no wait for commit (DANGEROUS!!!))",
+ NULL, update_maria_group_commit,
+ TRANSLOG_GCOMMIT_NONE, &maria_group_commit_typelib);
+
+static MYSQL_SYSVAR_ULONG(group_commit_interval, maria_group_commit_interval,
+ PLUGIN_VAR_RQCMDARG,
+ "Interval between commite in microseconds (1/1000000c)."
+ " 0 stands for no waiting"
+ " for other threads to come and do a commit in \"hard\" mode and no"
+ " sync()/commit at all in \"soft\" mode. Option has only an effect"
+ " if maria_group_commit is used",
+ NULL, update_maria_group_commit_interval, 0, 0, UINT_MAX, 1);
+
static MYSQL_SYSVAR_ENUM(log_purge_type, log_purge_type,
PLUGIN_VAR_RQCMDARG,
"Specifies how maria transactional log will be purged. "
@@ -3278,6 +3314,8 @@
MYSQL_SYSVAR(block_size),
MYSQL_SYSVAR(checkpoint_interval),
MYSQL_SYSVAR(force_start_after_recovery_failures),
+ MYSQL_SYSVAR(group_commit),
+ MYSQL_SYSVAR(group_commit_interval),
MYSQL_SYSVAR(page_checksum),
MYSQL_SYSVAR(log_dir_path),
MYSQL_SYSVAR(log_file_size),
@@ -3309,6 +3347,92 @@
}
/**
+ @brief Updates group commit mode
+*/
+
+static void update_maria_group_commit(MYSQL_THD thd,
+ struct st_mysql_sys_var *var,
+ void *var_ptr, const void *save)
+{
+ ulong value= (ulong)*((long *)var_ptr);
+ DBUG_ENTER("update_maria_group_commit");
+ DBUG_PRINT("enter", ("old value: %lu new value %lu rate %lu",
+ value, (ulong)(*(long *)save),
+ maria_group_commit_interval));
+ /* old value */
+ switch (value) {
+ case TRANSLOG_GCOMMIT_NONE:
+ break;
+ case TRANSLOG_GCOMMIT_HARD:
+ translog_hard_group_commit(FALSE);
+ break;
+ case TRANSLOG_GCOMMIT_SOFT:
+ translog_soft_sync(FALSE);
+ if (maria_group_commit_interval)
+ translog_soft_sync_end();
+ break;
+ default:
+ DBUG_ASSERT(0); /* impossible */
+ }
+ value= *(ulong *)var_ptr= (ulong)(*(long *)save);
+ translog_sync();
+ /* new value */
+ switch (value) {
+ case TRANSLOG_GCOMMIT_NONE:
+ break;
+ case TRANSLOG_GCOMMIT_HARD:
+ translog_hard_group_commit(TRUE);
+ break;
+ case TRANSLOG_GCOMMIT_SOFT:
+ translog_soft_sync(TRUE);
+ /* variable change made under global lock so we can just read it */
+ if (maria_group_commit_interval)
+ translog_soft_sync_start();
+ break;
+ default:
+ DBUG_ASSERT(0); /* impossible */
+ }
+ DBUG_VOID_RETURN;
+}
+
+/**
+ @brief Updates group commit interval
+*/
+
+static void update_maria_group_commit_interval(MYSQL_THD thd,
+ struct st_mysql_sys_var *var,
+ void *var_ptr, const void *save)
+{
+ ulong new_value= (ulong)*((long *)save);
+ ulong *value_ptr= (ulong*) var_ptr;
+ DBUG_ENTER("update_maria_group_commit_interval");
+ DBUG_PRINT("enter", ("old value: %lu new value %lu group commit %lu",
+ *value_ptr, new_value, maria_group_commit));
+
+ /* variable change made under global lock so we can just read it */
+ switch (maria_group_commit) {
+ case TRANSLOG_GCOMMIT_NONE:
+ *value_ptr= new_value;
+ translog_set_group_commit_interval(new_value);
+ break;
+ case TRANSLOG_GCOMMIT_HARD:
+ *value_ptr= new_value;
+ translog_set_group_commit_interval(new_value);
+ break;
+ case TRANSLOG_GCOMMIT_SOFT:
+ if (*value_ptr)
+ translog_soft_sync_end();
+ translog_set_group_commit_interval(new_value);
+ if ((*value_ptr= new_value))
+ translog_soft_sync_start();
+ break;
+ default:
+ DBUG_ASSERT(0); /* impossible */
+ }
+ DBUG_VOID_RETURN;
+}
+
+/**
@brief Updates the transaction log file limit.
*/
@@ -3330,6 +3454,7 @@
{"Maria_pagecache_reads", (char*) &maria_pagecache_var.global_cache_read, SHOW_LONGLONG},
{"Maria_pagecache_write_requests", (char*) &maria_pagecache_var.global_cache_w_requests, SHOW_LONGLONG},
{"Maria_pagecache_writes", (char*) &maria_pagecache_var.global_cache_write, SHOW_LONGLONG},
+ {"Maria_transaction_log_syncs", (char*) &translog_syncs, SHOW_LONGLONG},
{NullS, NullS, SHOW_LONG}
};
=== modified file 'storage/maria/ma_init.c'
--- a/storage/maria/ma_init.c 2008-10-09 20:03:54 +0000
+++ b/storage/maria/ma_init.c 2010-02-12 13:12:28 +0000
@@ -82,6 +82,11 @@
maria_inited= maria_multi_threaded= FALSE;
ft_free_stopwords();
ma_checkpoint_end();
+ if (translog_status == TRANSLOG_OK)
+ {
+ translog_soft_sync_end();
+ translog_sync();
+ }
if ((trid= trnman_get_max_trid()) > max_trid_in_control_file)
{
/*
=== modified file 'storage/maria/ma_loghandler.c'
--- a/storage/maria/ma_loghandler.c 2010-01-06 21:27:53 +0000
+++ b/storage/maria/ma_loghandler.c 2010-02-12 13:12:28 +0000
@@ -18,6 +18,7 @@
#include "ma_blockrec.h" /* for some constants and in-write hooks */
#include "ma_key_recover.h" /* For some in-write hooks */
#include "ma_checkpoint.h"
+#include "ma_servicethread.h"
/*
On Windows, neither my_open() nor my_sync() work for directories.
@@ -47,6 +48,15 @@
#include <m_ctype.h>
#endif
+/** @brief protects checkpoint_in_progress */
+static pthread_mutex_t LOCK_soft_sync;
+/** @brief for killing the background checkpoint thread */
+static pthread_cond_t COND_soft_sync;
+/** @brief control structure for checkpoint background thread */
+static MA_SERVICE_THREAD_CONTROL soft_sync_control=
+ {THREAD_DEAD, FALSE, &LOCK_soft_sync, &COND_soft_sync};
+
+
/* transaction log file descriptor */
typedef struct st_translog_file
{
@@ -124,10 +134,24 @@
/* Previous buffer offset to detect it flush finish */
TRANSLOG_ADDRESS prev_buffer_offset;
/*
+ If the buffer was forced to close it save value of its horizon
+ otherwise LSN_IMPOSSIBLE
+ */
+ TRANSLOG_ADDRESS pre_force_close_horizon;
+ /*
How much is written (or will be written when copy_to_buffer_in_progress
become 0) to this buffer
*/
translog_size_t size;
+ /*
+ When moving from one log buffer to another, we write the last of the
+ previous buffer to file and then move to start using the new log
+ buffer. In the case of a part filed last page, this page is not moved
+ to the start of the new buffer but instead we set the 'skip_data'
+ variable to tell us how much data at the beginning of the buffer is not
+ relevant.
+ */
+ uint skipped_data;
/* File handler for this buffer */
TRANSLOG_FILE *file;
/* Threads which are waiting for buffer filling/freeing */
@@ -304,6 +328,7 @@
*/
pthread_mutex_t log_flush_lock;
pthread_cond_t log_flush_cond;
+ pthread_cond_t new_goal_cond;
/* Protects changing of headers of finished files (max_lsn) */
pthread_mutex_t file_header_lock;
@@ -344,13 +369,39 @@
ulong log_purge_type= TRANSLOG_PURGE_IMMIDIATE;
ulong log_file_size= TRANSLOG_FILE_SIZE;
+/* sync() of log files directory mode */
ulong sync_log_dir= TRANSLOG_SYNC_DIR_NEWFILE;
+ulong maria_group_commit= TRANSLOG_GCOMMIT_NONE;
+ulong maria_group_commit_interval= 0;
/* Marker for end of log */
static uchar end_of_log= 0;
#define END_OF_LOG &end_of_log
+/**
+ Switch for "soft" sync (no real sync() but periodical sync by service
+ thread)
+*/
+static volatile my_bool soft_sync= FALSE;
+/**
+ Switch for "hard" group commit mode
+*/
+static volatile my_bool hard_group_commit= FALSE;
+/**
+ File numbers interval which have to be sync()
+*/
+static uint32 soft_sync_min= 0;
+static uint32 soft_sync_max= 0;
+static uint32 soft_need_sync= 1;
+/**
+ stores interval in microseconds
+*/
+static uint32 group_commit_wait= 0;
enum enum_translog_status translog_status= TRANSLOG_UNINITED;
+ulonglong translog_syncs= 0; /* Number of sync()s */
+
+/* time of last flush */
+static ulonglong flush_start= 0;
/* chunk types */
#define TRANSLOG_CHUNK_LSN 0x00 /* 0 chunk refer as LSN (head or tail */
@@ -980,12 +1031,17 @@
static TRANSLOG_FILE *get_current_logfile()
{
TRANSLOG_FILE *file;
+ DBUG_ENTER("get_current_logfile");
rw_rdlock(&log_descriptor.open_files_lock);
+ DBUG_PRINT("info", ("max_file: %lu min_file: %lu open_files: %lu",
+ (ulong) log_descriptor.max_file,
+ (ulong) log_descriptor.min_file,
+ (ulong) log_descriptor.open_files.elements));
DBUG_ASSERT(log_descriptor.max_file - log_descriptor.min_file + 1 ==
log_descriptor.open_files.elements);
file= *dynamic_element(&log_descriptor.open_files, 0, TRANSLOG_FILE **);
rw_unlock(&log_descriptor.open_files_lock);
- return (file);
+ DBUG_RETURN(file);
}
uchar NEAR maria_trans_file_magic[]=
@@ -1069,6 +1125,7 @@
static my_bool translog_max_lsn_to_header(File file, LSN lsn)
{
uchar lsn_buff[LSN_STORE_SIZE];
+ my_bool rc;
DBUG_ENTER("translog_max_lsn_to_header");
DBUG_PRINT("enter", ("File descriptor: %ld "
"lsn: (%lu,0x%lx)",
@@ -1077,11 +1134,17 @@
lsn_store(lsn_buff, lsn);
- DBUG_RETURN(my_pwrite(file, lsn_buff,
- LSN_STORE_SIZE,
- (LOG_HEADER_DATA_SIZE - LSN_STORE_SIZE),
- log_write_flags) != 0 ||
- my_sync(file, MYF(MY_WME)) != 0);
+ rc= (my_pwrite(file, lsn_buff,
+ LSN_STORE_SIZE,
+ (LOG_HEADER_DATA_SIZE - LSN_STORE_SIZE),
+ log_write_flags) != 0 ||
+ my_sync(file, MYF(MY_WME)) != 0);
+ /*
+ We should not increase counter in case of error above, but it is so
+ unlikely that we can ignore this case
+ */
+ translog_syncs++;
+ DBUG_RETURN(rc);
}
@@ -1423,7 +1486,9 @@
static my_bool translog_buffer_init(struct st_translog_buffer *buffer, int num)
{
DBUG_ENTER("translog_buffer_init");
- buffer->prev_last_lsn= buffer->last_lsn= LSN_IMPOSSIBLE;
+ buffer->pre_force_close_horizon=
+ buffer->prev_last_lsn= buffer->last_lsn=
+ LSN_IMPOSSIBLE;
DBUG_PRINT("info", ("last_lsn and prev_last_lsn set to 0 buffer: 0x%lx",
(ulong) buffer));
@@ -1435,6 +1500,7 @@
memset(buffer->buffer, TRANSLOG_FILLER, TRANSLOG_WRITE_BUFFER);
/* Buffer size */
buffer->size= 0;
+ buffer->skipped_data= 0;
/* cond of thread which is waiting for buffer filling */
if (pthread_cond_init(&buffer->waiting_filling_buffer, 0))
DBUG_RETURN(1);
@@ -1489,7 +1555,10 @@
TODO: sync only we have changed the log
*/
if (!file->is_sync)
+ {
rc= my_sync(file->handler.file, MYF(MY_WME));
+ translog_syncs++;
+ }
rc|= my_close(file->handler.file, MYF(MY_WME));
my_free(file, MYF(0));
return test(rc);
@@ -2044,7 +2113,8 @@
(ulong) LSN_OFFSET(log_descriptor.horizon),
(ulong) LSN_OFFSET(log_descriptor.horizon)));
DBUG_ASSERT(buffer_no == buffer->buffer_no);
- buffer->prev_last_lsn= buffer->last_lsn= LSN_IMPOSSIBLE;
+ buffer->pre_force_close_horizon=
+ buffer->prev_last_lsn= buffer->last_lsn= LSN_IMPOSSIBLE;
DBUG_PRINT("info", ("last_lsn and prev_last_lsn set to 0 buffer: 0x%lx",
(ulong) buffer));
buffer->offset= log_descriptor.horizon;
@@ -2052,6 +2122,7 @@
buffer->file= get_current_logfile();
buffer->overlay= 0;
buffer->size= 0;
+ buffer->skipped_data= 0;
translog_cursor_init(cursor, buffer, buffer_no);
DBUG_PRINT("info", ("file: #%ld (%d) init cursor #%u: 0x%lx "
"chaser: %d Size: %lu (%lu)",
@@ -2523,6 +2594,7 @@
TRANSLOG_ADDRESS offset= buffer->offset;
TRANSLOG_FILE *file= buffer->file;
uint8 ver= buffer->ver;
+ uint skipped_data;
DBUG_ENTER("translog_buffer_flush");
DBUG_PRINT("enter",
("Buffer: #%u 0x%lx file: %d offset: (%lu,0x%lx) size: %lu",
@@ -2557,6 +2629,8 @@
disk
*/
file= buffer->file;
+ skipped_data= buffer->skipped_data;
+ DBUG_ASSERT(skipped_data < TRANSLOG_PAGE_SIZE);
for (i= 0, pg= LSN_OFFSET(buffer->offset) / TRANSLOG_PAGE_SIZE;
i < buffer->size;
i+= TRANSLOG_PAGE_SIZE, pg++)
@@ -2573,13 +2647,16 @@
DBUG_ASSERT(i + TRANSLOG_PAGE_SIZE <= buffer->size);
if (translog_status != TRANSLOG_OK && translog_status != TRANSLOG_SHUTDOWN)
DBUG_RETURN(1);
- if (pagecache_inject(log_descriptor.pagecache,
+ if (pagecache_write_part(log_descriptor.pagecache,
&file->handler, pg, 3,
buffer->buffer + i,
PAGECACHE_PLAIN_PAGE,
PAGECACHE_LOCK_LEFT_UNLOCKED,
- PAGECACHE_PIN_LEFT_UNPINNED, 0,
- LSN_IMPOSSIBLE))
+ PAGECACHE_PIN_LEFT_UNPINNED,
+ PAGECACHE_WRITE_DONE, 0,
+ LSN_IMPOSSIBLE,
+ skipped_data,
+ TRANSLOG_PAGE_SIZE - skipped_data))
{
DBUG_PRINT("error",
("Can't write page (%lu,0x%lx) to pagecache, error: %d",
@@ -2589,10 +2666,12 @@
translog_stop_writing();
DBUG_RETURN(1);
}
+ skipped_data= 0;
}
file->is_sync= 0;
- if (my_pwrite(file->handler.file, buffer->buffer,
- buffer->size, LSN_OFFSET(buffer->offset),
+ if (my_pwrite(file->handler.file, buffer->buffer + buffer->skipped_data,
+ buffer->size - buffer->skipped_data,
+ LSN_OFFSET(buffer->offset) + buffer->skipped_data,
log_write_flags))
{
DBUG_PRINT("error", ("Can't write buffer (%lu,0x%lx) size %lu "
@@ -2985,6 +3064,7 @@
uchar *from, *table= NULL;
int is_last_unfinished_page;
uint last_protected_sector= 0;
+ uint skipped_data= curr_buffer->skipped_data;
TRANSLOG_FILE file_copy;
uint8 ver= curr_buffer->ver;
translog_wait_for_writers(curr_buffer);
@@ -2997,7 +3077,38 @@
}
DBUG_ASSERT(LSN_FILE_NO(addr) == LSN_FILE_NO(curr_buffer->offset));
from= curr_buffer->buffer + (addr - curr_buffer->offset);
- memcpy(buffer, from, TRANSLOG_PAGE_SIZE);
+ if (skipped_data && addr == curr_buffer->offset)
+ {
+ /*
+ We read page part of which is not present in buffer,
+ so we should read absent part from file (page cache actually)
+ */
+ file= get_logfile_by_number(file_no);
+ DBUG_ASSERT(file != NULL);
+ /*
+ it's ok to not lock the page because:
+ - The log handler has it's own page cache.
+ - There is only one thread that can access the log
+ cache at a time
+ */
+ if (!(buffer= pagecache_read(log_descriptor.pagecache,
+ &file->handler,
+ LSN_OFFSET(addr) / TRANSLOG_PAGE_SIZE,
+ 3, buffer,
+ PAGECACHE_PLAIN_PAGE,
+ PAGECACHE_LOCK_LEFT_UNLOCKED,
+ NULL)))
+ DBUG_RETURN(NULL);
+ }
+ else
+ skipped_data= 0; /* Read after skipped in buffer data */
+ /*
+ Now we have correct data in buffer up to 'skipped_data'. The
+ following memcpy() will move the data from the internal buffer
+ that was not yet on disk.
+ */
+ memcpy(buffer + skipped_data, from + skipped_data,
+ TRANSLOG_PAGE_SIZE - skipped_data);
/*
We can use copy then in translog_page_validator() because it
do not put it permanently somewhere.
@@ -3291,6 +3402,7 @@
uint32 next_page_offset, page_rest;
uint32 i;
File fd;
+ int rc;
TRANSLOG_VALIDATOR_DATA data;
char path[FN_REFLEN];
uchar page_buff[TRANSLOG_PAGE_SIZE];
@@ -3316,14 +3428,19 @@
TRANSLOG_PAGE_SIZE);
page_rest= next_page_offset - LSN_OFFSET(addr);
memset(page_buff, TRANSLOG_FILLER, page_rest);
- if ((fd= open_logfile_by_number_no_cache(LSN_FILE_NO(addr))) < 0 ||
- ((my_chsize(fd, next_page_offset, TRANSLOG_FILLER, MYF(MY_WME)) ||
- (page_rest && my_pwrite(fd, page_buff, page_rest, LSN_OFFSET(addr),
- log_write_flags)) ||
- my_sync(fd, MYF(MY_WME))) |
- my_close(fd, MYF(MY_WME))) ||
- (sync_log_dir >= TRANSLOG_SYNC_DIR_ALWAYS &&
- sync_dir(log_descriptor.directory_fd, MYF(MY_WME | MY_IGNORE_BADFD))))
+ rc= ((fd= open_logfile_by_number_no_cache(LSN_FILE_NO(addr))) < 0 ||
+ ((my_chsize(fd, next_page_offset, TRANSLOG_FILLER, MYF(MY_WME)) ||
+ (page_rest && my_pwrite(fd, page_buff, page_rest, LSN_OFFSET(addr),
+ log_write_flags)) ||
+ my_sync(fd, MYF(MY_WME)))));
+ translog_syncs++;
+ rc|= (fd > 0 && my_close(fd, MYF(MY_WME)));
+ if (sync_log_dir >= TRANSLOG_SYNC_DIR_ALWAYS)
+ {
+ rc|= sync_dir(log_descriptor.directory_fd, MYF(MY_WME | MY_IGNORE_BADFD));
+ translog_syncs++;
+ }
+ if (rc)
DBUG_RETURN(1);
/* fix the horizon */
@@ -3483,7 +3600,10 @@
my_bool version_changed= 0;
DBUG_ENTER("translog_init_with_table");
+ translog_syncs= 0;
+ flush_start= 0;
id_to_share= NULL;
+
log_descriptor.directory_fd= -1;
log_descriptor.is_everything_flushed= 1;
log_descriptor.flush_in_progress= 0;
@@ -3511,6 +3631,7 @@
pthread_mutex_init(&log_descriptor.dirty_buffer_mask_lock,
MY_MUTEX_INIT_FAST) ||
pthread_cond_init(&log_descriptor.log_flush_cond, 0) ||
+ pthread_cond_init(&log_descriptor.new_goal_cond, 0) ||
my_rwlock_init(&log_descriptor.open_files_lock,
NULL) ||
my_init_dynamic_array(&log_descriptor.open_files,
@@ -3912,7 +4033,6 @@
log_descriptor.flushed= log_descriptor.horizon;
log_descriptor.in_buffers_only= log_descriptor.bc.buffer->offset;
log_descriptor.max_lsn= LSN_IMPOSSIBLE; /* set to 0 */
- log_descriptor.previous_flush_horizon= log_descriptor.horizon;
/*
Now 'flushed' is set to 'horizon' value, but 'horizon' is (potentially)
address of the next LSN and we want indicate that all LSNs that are
@@ -3995,6 +4115,10 @@
It is beginning of the log => there is no LSNs in the log =>
There is no harm in leaving it "as-is".
*/
+ log_descriptor.previous_flush_horizon= log_descriptor.horizon;
+ DBUG_PRINT("info", ("previous_flush_horizon: (%lu,0x%lx)",
+ LSN_IN_PARTS(log_descriptor.
+ previous_flush_horizon)));
DBUG_RETURN(0);
}
file_no--;
@@ -4070,6 +4194,9 @@
translog_free_record_header(&rec);
}
}
+ log_descriptor.previous_flush_horizon= log_descriptor.horizon;
+ DBUG_PRINT("info", ("previous_flush_horizon: (%lu,0x%lx)",
+ LSN_IN_PARTS(log_descriptor.previous_flush_horizon)));
DBUG_RETURN(0);
err:
ma_message_no_user(0, "log initialization failed");
@@ -4157,6 +4284,7 @@
pthread_mutex_destroy(&log_descriptor.log_flush_lock);
pthread_mutex_destroy(&log_descriptor.dirty_buffer_mask_lock);
pthread_cond_destroy(&log_descriptor.log_flush_cond);
+ pthread_cond_destroy(&log_descriptor.new_goal_cond);
rwlock_destroy(&log_descriptor.open_files_lock);
delete_dynamic(&log_descriptor.open_files);
delete_dynamic(&log_descriptor.unfinished_files);
@@ -6885,11 +7013,11 @@
{
translog_size_t res;
DBUG_ENTER("translog_read_record_header_from_buffer");
- DBUG_ASSERT(translog_is_LSN_chunk(page[page_offset]));
- DBUG_ASSERT(translog_status == TRANSLOG_OK ||
- translog_status == TRANSLOG_READONLY);
DBUG_PRINT("info", ("page byte: 0x%x offset: %u",
(uint) page[page_offset], (uint) page_offset));
+ DBUG_ASSERT(translog_is_LSN_chunk(page[page_offset]));
+ DBUG_ASSERT(translog_status == TRANSLOG_OK ||
+ translog_status == TRANSLOG_READONLY);
buff->type= (page[page_offset] & TRANSLOG_REC_TYPE);
buff->short_trid= uint2korr(page + page_offset + 1);
DBUG_PRINT("info", ("Type %u, Short TrID %u, LSN (%lu,0x%lx)",
@@ -7356,27 +7484,27 @@
"Buffer addr: (%lu,0x%lx) "
"Page addr: (%lu,0x%lx) "
"size: %lu (%lu) Pg: %u left: %u in progress %u",
- (uint) log_descriptor.bc.buffer_no,
- (ulong) log_descriptor.bc.buffer,
- LSN_IN_PARTS(log_descriptor.bc.buffer->offset),
+ (uint) old_buffer_no,
+ (ulong) old_buffer,
+ LSN_IN_PARTS(old_buffer->offset),
(ulong) LSN_FILE_NO(log_descriptor.horizon),
(ulong) (LSN_OFFSET(log_descriptor.horizon) -
log_descriptor.bc.current_page_fill),
- (ulong) log_descriptor.bc.buffer->size,
+ (ulong) old_buffer->size,
(ulong) (log_descriptor.bc.ptr -log_descriptor.bc.
buffer->buffer),
(uint) log_descriptor.bc.current_page_fill,
(uint) left,
- (uint) log_descriptor.bc.buffer->
+ (uint) old_buffer->
copy_to_buffer_in_progress));
translog_lock_assert_owner();
LINT_INIT(current_page_fill);
- new_buff_beginning= log_descriptor.bc.buffer->offset;
- new_buff_beginning+= log_descriptor.bc.buffer->size; /* increase offset */
+ new_buff_beginning= old_buffer->offset;
+ new_buff_beginning+= old_buffer->size; /* increase offset */
DBUG_ASSERT(log_descriptor.bc.ptr !=NULL);
DBUG_ASSERT(LSN_FILE_NO(log_descriptor.horizon) ==
- LSN_FILE_NO(log_descriptor.bc.buffer->offset));
+ LSN_FILE_NO(old_buffer->offset));
translog_check_cursor(&log_descriptor.bc);
DBUG_ASSERT(left < TRANSLOG_PAGE_SIZE);
if (left)
@@ -7387,18 +7515,20 @@
*/
DBUG_PRINT("info", ("left: %u", (uint) left));
+ old_buffer->pre_force_close_horizon=
+ old_buffer->offset + old_buffer->size;
/* decrease offset */
new_buff_beginning-= log_descriptor.bc.current_page_fill;
current_page_fill= log_descriptor.bc.current_page_fill;
memset(log_descriptor.bc.ptr, TRANSLOG_FILLER, left);
- log_descriptor.bc.buffer->size+= left;
+ old_buffer->size+= left;
DBUG_PRINT("info", ("Finish Page buffer #%u: 0x%lx "
"Size: %lu",
- (uint) log_descriptor.bc.buffer->buffer_no,
- (ulong) log_descriptor.bc.buffer,
- (ulong) log_descriptor.bc.buffer->size));
- DBUG_ASSERT(log_descriptor.bc.buffer->buffer_no ==
+ (uint) old_buffer->buffer_no,
+ (ulong) old_buffer,
+ (ulong) old_buffer->size));
+ DBUG_ASSERT(old_buffer->buffer_no ==
log_descriptor.bc.buffer_no);
}
else
@@ -7509,11 +7639,21 @@
if (left)
{
- /*
- TODO: do not copy beginning of the page if we have no CRC or sector
- checks on
- */
- memcpy(new_buffer->buffer, data, current_page_fill);
+ if (log_descriptor.flags &
+ (TRANSLOG_PAGE_CRC | TRANSLOG_SECTOR_PROTECTION))
+ memcpy(new_buffer->buffer, data, current_page_fill);
+ else
+ {
+ /*
+ This page header does not change if we add more data to the page so
+ we can not copy it and will not overwrite later
+ */
+ new_buffer->skipped_data= current_page_fill;
+#ifndef DBUG_OFF
+ memset(new_buffer->buffer, 0xa5, current_page_fill);
+#endif
+ DBUG_ASSERT(new_buffer->skipped_data < TRANSLOG_PAGE_SIZE);
+ }
}
old_buffer->next_buffer_offset= new_buffer->offset;
translog_buffer_lock(new_buffer);
@@ -7561,6 +7701,7 @@
{
log_descriptor.next_pass_max_lsn= lsn;
log_descriptor.max_lsn_requester= pthread_self();
+ pthread_cond_broadcast(&log_descriptor.new_goal_cond);
}
while (flush_no == log_descriptor.flush_no)
{
@@ -7572,66 +7713,78 @@
/**
- @brief Flush the log up to given LSN (included)
-
- @param lsn log record serial number up to which (inclusive)
- the log has to be flushed
-
- @return Operation status
+ @brief sync() range of files (inclusive) and directory (by request)
+
+ @param min min internal file number to flush
+ @param max max internal file number to flush
+ @param sync_dir need sync directory
+
+ return Operation status
@retval 0 OK
@retval 1 Error
-
-*/
-
-my_bool translog_flush(TRANSLOG_ADDRESS lsn)
-{
- LSN sent_to_disk= LSN_IMPOSSIBLE;
- TRANSLOG_ADDRESS flush_horizon;
- uint fn, i;
+*/
+
+static my_bool translog_sync_files(uint32 min, uint32 max,
+ my_bool sync_dir)
+{
+ uint fn;
+ my_bool rc= 0;
+ ulonglong flush_interval;
+ DBUG_ENTER("translog_sync_files");
+ DBUG_PRINT("info", ("min: %lu max: %lu sync dir: %d",
+ (ulong) min, (ulong) max, (int) sync_dir));
+ DBUG_ASSERT(min <= max);
+
+ flush_interval= group_commit_wait;
+ if (flush_interval)
+ flush_start= my_micro_time();
+ for (fn= min; fn <= max; fn++)
+ {
+ TRANSLOG_FILE *file= get_logfile_by_number(fn);
+ DBUG_ASSERT(file != NULL);
+ if (!file->is_sync)
+ {
+ if (my_sync(file->handler.file, MYF(MY_WME)))
+ {
+ rc= 1;
+ translog_stop_writing();
+ DBUG_RETURN(rc);
+ }
+ translog_syncs++;
+ file->is_sync= 1;
+ }
+ }
+
+ if (sync_dir)
+ {
+ if (!(rc= sync_dir(log_descriptor.directory_fd,
+ MYF(MY_WME | MY_IGNORE_BADFD))))
+ translog_syncs++;
+ }
+
+ DBUG_RETURN(rc);
+}
+
+
+/*
+ @brief Flushes buffers with LSNs in them less or equal address <lsn>
+
+ @param lsn address up to which all LSNs should be flushed,
+ can be reset to real last LSN address
+ @parem sent_to_disk returns 'sent to disk' position
+ @param flush_horizon returns horizon of the flush
+
+ @note About terminology see comment to translog_flush().
+*/
+
+void translog_flush_buffers(TRANSLOG_ADDRESS *lsn,
+ TRANSLOG_ADDRESS *sent_to_disk,
+ TRANSLOG_ADDRESS *flush_horizon)
+{
dirty_buffer_mask_t dirty_buffer_mask;
+ uint i;
uint8 last_buffer_no, start_buffer_no;
- my_bool rc= 0;
- DBUG_ENTER("translog_flush");
- DBUG_PRINT("enter", ("Flush up to LSN: (%lu,0x%lx)", LSN_IN_PARTS(lsn)));
- DBUG_ASSERT(translog_status == TRANSLOG_OK ||
- translog_status == TRANSLOG_READONLY);
- LINT_INIT(sent_to_disk);
-
- pthread_mutex_lock(&log_descriptor.log_flush_lock);
- DBUG_PRINT("info", ("Everything is flushed up to (%lu,0x%lx)",
- LSN_IN_PARTS(log_descriptor.flushed)));
- if (cmp_translog_addr(log_descriptor.flushed, lsn) >= 0)
- {
- pthread_mutex_unlock(&log_descriptor.log_flush_lock);
- DBUG_RETURN(0);
- }
- if (log_descriptor.flush_in_progress)
- {
- translog_flush_set_new_goal_and_wait(lsn);
- if (!pthread_equal(log_descriptor.max_lsn_requester, pthread_self()))
- {
- /* fix lsn if it was horizon */
- if (cmp_translog_addr(lsn, log_descriptor.bc.buffer->last_lsn) > 0)
- lsn= BUFFER_MAX_LSN(log_descriptor.bc.buffer);
- translog_flush_wait_for_end(lsn);
- pthread_mutex_unlock(&log_descriptor.log_flush_lock);
- DBUG_RETURN(0);
- }
- log_descriptor.next_pass_max_lsn= LSN_IMPOSSIBLE;
- }
- log_descriptor.flush_in_progress= 1;
- flush_horizon= log_descriptor.previous_flush_horizon;
- DBUG_PRINT("info", ("flush_in_progress is set"));
- pthread_mutex_unlock(&log_descriptor.log_flush_lock);
-
- translog_lock();
- if (log_descriptor.is_everything_flushed)
- {
- DBUG_PRINT("info", ("everything is flushed"));
- rc= (translog_status == TRANSLOG_READONLY);
- translog_unlock();
- goto out;
- }
+ DBUG_ENTER("translog_flush_buffers");
/*
We will recheck information when will lock buffers one by
@@ -7656,15 +7809,15 @@
/*
if LSN up to which we have to flush bigger then maximum LSN of previous
buffer and at least one LSN was saved in the current buffer (last_lsn !=
- LSN_IMPOSSIBLE) then we better finish the current buffer.
+ LSN_IMPOSSIBLE) then we have to close the current buffer.
*/
- if (cmp_translog_addr(lsn, log_descriptor.bc.buffer->prev_last_lsn) > 0 &&
+ if (cmp_translog_addr(*lsn, log_descriptor.bc.buffer->prev_last_lsn) > 0 &&
log_descriptor.bc.buffer->last_lsn != LSN_IMPOSSIBLE)
{
struct st_translog_buffer *buffer= log_descriptor.bc.buffer;
- lsn= log_descriptor.bc.buffer->last_lsn; /* fix lsn if it was horizon */
+ *lsn= log_descriptor.bc.buffer->last_lsn; /* fix lsn if it was horizon */
DBUG_PRINT("info", ("LSN to flush fixed to last lsn: (%lu,0x%lx)",
- LSN_IN_PARTS(log_descriptor.bc.buffer->last_lsn)));
+ LSN_IN_PARTS(log_descriptor.bc.buffer->last_lsn)));
last_buffer_no= log_descriptor.bc.buffer_no;
log_descriptor.is_everything_flushed= 1;
translog_force_current_buffer_to_finish();
@@ -7676,8 +7829,10 @@
TRANSLOG_BUFFERS_NO);
translog_unlock();
}
- sent_to_disk= translog_get_sent_to_disk();
- if (cmp_translog_addr(lsn, sent_to_disk) > 0)
+
+ /* flush buffers */
+ *sent_to_disk= translog_get_sent_to_disk();
+ if (cmp_translog_addr(*lsn, *sent_to_disk) > 0)
{
DBUG_PRINT("info", ("Start buffer #: %u last buffer #: %u",
@@ -7697,53 +7852,238 @@
LSN_IN_PARTS(buffer->last_lsn),
(buffer->file ?
"dirty" : "closed")));
- if (buffer->prev_last_lsn <= lsn &&
+ if (buffer->prev_last_lsn <= *lsn &&
buffer->file != NULL)
{
- DBUG_ASSERT(flush_horizon <= buffer->offset + buffer->size);
- flush_horizon= buffer->offset + buffer->size;
+ DBUG_ASSERT(*flush_horizon <= buffer->offset + buffer->size);
+ *flush_horizon= (buffer->pre_force_close_horizon != LSN_IMPOSSIBLE ?
+ buffer->pre_force_close_horizon :
+ buffer->offset + buffer->size);
+ /* pre_force_close_horizon is reset during new buffer start */
+ DBUG_PRINT("info", ("flush_horizon: (%lu,0x%lx)",
+ LSN_IN_PARTS(*flush_horizon)));
+ DBUG_ASSERT(*flush_horizon <= log_descriptor.horizon);
+
translog_buffer_flush(buffer);
}
translog_buffer_unlock(buffer);
i= (i + 1) % TRANSLOG_BUFFERS_NO;
} while (i != last_buffer_no);
- sent_to_disk= translog_get_sent_to_disk();
- }
-
- /* sync files from previous flush till current one */
- for (fn= LSN_FILE_NO(log_descriptor.flushed); fn <= LSN_FILE_NO(lsn); fn++)
- {
- TRANSLOG_FILE *file= get_logfile_by_number(fn);
- DBUG_ASSERT(file != NULL);
- if (!file->is_sync)
- {
- if (my_sync(file->handler.file, MYF(MY_WME)))
+ *sent_to_disk= translog_get_sent_to_disk();
+ }
+
+ DBUG_VOID_RETURN;
+}
+
+/**
+ @brief Flush the log up to given LSN (included)
+
+ @param lsn log record serial number up to which (inclusive)
+ the log has to be flushed
+
+ @return Operation status
+ @retval 0 OK
+ @retval 1 Error
+
+ @note
+
+ - Non group commit logic: Commits made in passes. Thread which started
+ flush first is performing actual flush, other threads sets new goal (LSN)
+ of the next pass (if it is maximum) and waits for the pass end or just
+ wait for the pass end.
+
+ - If hard group commit enabled and rate set to zero:
+ The first thread sends all changed buffers to disk. This is repeated
+ as long as there are new LSNs added. The process can not loop
+ forever because we have limited number of threads and they will wait
+ for the data to be synced.
+ Pseudo code:
+
+ do
+ send changed buffers to disk
+ while new_goal
+ sync
+
+ - If hard group commit switched ON and less than rate microseconds has
+ passed from last sync, then after buffers have been sent to disk
+ wait until rate microseconds has passed since last sync, do sync and return.
+ This ensures that if we call sync infrequently we don't do any waits.
+
+ - If soft group commit enabled everything works as with 'non group commit'
+ but the thread doesn't do any real sync(). If rate is not zero the
+ sync() will be performed by a service thread with the given rate
+ when needed (new LSN appears).
+
+ @note Terminology:
+ 'sent to disk' means written to disk but not sync()ed,
+ 'flushed' mean sent to disk and synced().
+*/
+
+my_bool translog_flush(TRANSLOG_ADDRESS lsn)
+{
+ struct timespec abstime;
+ ulonglong flush_interval;
+ ulonglong time_spent;
+ LSN sent_to_disk= LSN_IMPOSSIBLE;
+ TRANSLOG_ADDRESS flush_horizon;
+ my_bool rc= 0;
+ my_bool hgroup_commit_at_start;
+ DBUG_ENTER("translog_flush");
+ DBUG_PRINT("enter", ("Flush up to LSN: (%lu,0x%lx)", LSN_IN_PARTS(lsn)));
+ DBUG_ASSERT(translog_status == TRANSLOG_OK ||
+ translog_status == TRANSLOG_READONLY);
+ LINT_INIT(sent_to_disk);
+
+ pthread_mutex_lock(&log_descriptor.log_flush_lock);
+ DBUG_PRINT("info", ("Everything is flushed up to (%lu,0x%lx)",
+ LSN_IN_PARTS(log_descriptor.flushed)));
+ if (cmp_translog_addr(log_descriptor.flushed, lsn) >= 0)
+ {
+ pthread_mutex_unlock(&log_descriptor.log_flush_lock);
+ DBUG_RETURN(0);
+ }
+ if (log_descriptor.flush_in_progress)
+ {
+ translog_lock();
+ /* fix lsn if it was horizon */
+ if (cmp_translog_addr(lsn, log_descriptor.bc.buffer->last_lsn) > 0)
+ lsn= BUFFER_MAX_LSN(log_descriptor.bc.buffer);
+ translog_unlock();
+ translog_flush_set_new_goal_and_wait(lsn);
+ if (!pthread_equal(log_descriptor.max_lsn_requester, pthread_self()))
+ {
+ /*
+ translog_flush_wait_for_end() release log_flush_lock while is
+ waiting then acquire it again
+ */
+ translog_flush_wait_for_end(lsn);
+ pthread_mutex_unlock(&log_descriptor.log_flush_lock);
+ DBUG_RETURN(0);
+ }
+ log_descriptor.next_pass_max_lsn= LSN_IMPOSSIBLE;
+ }
+ log_descriptor.flush_in_progress= 1;
+ flush_horizon= log_descriptor.previous_flush_horizon;
+ DBUG_PRINT("info", ("flush_in_progress is set, flush_horizon: (%lu,0x%lx)",
+ LSN_IN_PARTS(flush_horizon)));
+ pthread_mutex_unlock(&log_descriptor.log_flush_lock);
+
+ hgroup_commit_at_start= hard_group_commit;
+ if (hgroup_commit_at_start)
+ flush_interval= group_commit_wait;
+
+ translog_lock();
+ if (log_descriptor.is_everything_flushed)
+ {
+ DBUG_PRINT("info", ("everything is flushed"));
+ translog_unlock();
+ pthread_mutex_lock(&log_descriptor.log_flush_lock);
+ goto out;
+ }
+
+ for (;;)
+ {
+ /* Following function flushes buffers and makes translog_unlock() */
+ translog_flush_buffers(&lsn, &sent_to_disk, &flush_horizon);
+
+ if (!hgroup_commit_at_start)
+ break; /* flush pass is ended */
+
+retest:
+ /*
+ We do not check time here because pthread_mutex_lock rarely takes
+ a lot of time so we can sacrifice a bit precision to performance
+ (taking into account that my_micro_time() might be expensive call).
+ */
+ if (flush_interval == 0)
+ break; /* flush pass is ended */
+
+ pthread_mutex_lock(&log_descriptor.log_flush_lock);
+ if (log_descriptor.next_pass_max_lsn == LSN_IMPOSSIBLE)
+ {
+ if (flush_interval == 0 ||
+ (time_spent= (my_micro_time() - flush_start)) >= flush_interval)
{
- rc= 1;
- translog_stop_writing();
- sent_to_disk= LSN_IMPOSSIBLE;
- goto out;
+ pthread_mutex_unlock(&log_descriptor.log_flush_lock);
+ break;
}
- file->is_sync= 1;
- }
- }
-
- if (sync_log_dir >= TRANSLOG_SYNC_DIR_ALWAYS &&
- (LSN_FILE_NO(log_descriptor.previous_flush_horizon) !=
- LSN_FILE_NO(flush_horizon) ||
- ((LSN_OFFSET(log_descriptor.previous_flush_horizon) - 1) /
- TRANSLOG_PAGE_SIZE) !=
- ((LSN_OFFSET(flush_horizon) - 1) / TRANSLOG_PAGE_SIZE)))
- rc|= sync_dir(log_descriptor.directory_fd, MYF(MY_WME | MY_IGNORE_BADFD));
+ DBUG_PRINT("info", ("flush waits: %llu interval: %llu spent: %llu",
+ flush_interval - time_spent,
+ flush_interval, time_spent));
+ /* wait time or next goal */
+ set_timespec_nsec(abstime, flush_interval - time_spent);
+ pthread_cond_timedwait(&log_descriptor.new_goal_cond,
+ &log_descriptor.log_flush_lock,
+ &abstime);
+ pthread_mutex_unlock(&log_descriptor.log_flush_lock);
+ DBUG_PRINT("info", ("retest conditions"));
+ goto retest;
+ }
+
+ /* take next goal */
+ lsn= log_descriptor.next_pass_max_lsn;
+ log_descriptor.next_pass_max_lsn= LSN_IMPOSSIBLE;
+ /* prevent other thread from continue */
+ log_descriptor.max_lsn_requester= pthread_self();
+ DBUG_PRINT("info", ("flush took next goal: (%lu,0x%lx)",
+ LSN_IN_PARTS(lsn)));
+ pthread_mutex_unlock(&log_descriptor.log_flush_lock);
+
+ /* next flush pass */
+ DBUG_PRINT("info", ("next flush pass"));
+ translog_lock();
+ }
+
+ /*
+ sync() files from previous flush till current one
+ */
+ if (!soft_sync || hgroup_commit_at_start)
+ {
+ if ((rc=
+ translog_sync_files(LSN_FILE_NO(log_descriptor.flushed),
+ LSN_FILE_NO(lsn),
+ sync_log_dir >= TRANSLOG_SYNC_DIR_ALWAYS &&
+ (LSN_FILE_NO(log_descriptor.
+ previous_flush_horizon) !=
+ LSN_FILE_NO(flush_horizon) ||
+ (LSN_OFFSET(log_descriptor.
+ previous_flush_horizon) /
+ TRANSLOG_PAGE_SIZE) !=
+ (LSN_OFFSET(flush_horizon) /
+ TRANSLOG_PAGE_SIZE)))))
+ {
+ sent_to_disk= LSN_IMPOSSIBLE;
+ pthread_mutex_lock(&log_descriptor.log_flush_lock);
+ goto out;
+ }
+ /* keep values for soft sync() and forced sync() actual */
+ {
+ uint32 fileno= LSN_FILE_NO(lsn);
+ my_atomic_rwlock_wrlock(&soft_sync_rwl);
+ my_atomic_store32(&soft_sync_min, fileno);
+ my_atomic_store32(&soft_sync_max, fileno);
+ my_atomic_rwlock_wrunlock(&soft_sync_rwl);
+ }
+ }
+ else
+ {
+ my_atomic_rwlock_wrlock(&soft_sync_rwl);
+ my_atomic_store32(&soft_sync_max, LSN_FILE_NO(lsn));
+ my_atomic_store32(&soft_need_sync, 1);
+ my_atomic_rwlock_wrunlock(&soft_sync_rwl);
+ }
+
+ DBUG_ASSERT(flush_horizon <= log_descriptor.horizon);
+
+ pthread_mutex_lock(&log_descriptor.log_flush_lock);
log_descriptor.previous_flush_horizon= flush_horizon;
out:
- pthread_mutex_lock(&log_descriptor.log_flush_lock);
if (sent_to_disk != LSN_IMPOSSIBLE)
log_descriptor.flushed= sent_to_disk;
log_descriptor.flush_in_progress= 0;
log_descriptor.flush_no++;
DBUG_PRINT("info", ("flush_in_progress is dropped"));
- pthread_mutex_unlock(&log_descriptor.log_flush_lock);\
+ pthread_mutex_unlock(&log_descriptor.log_flush_lock);
pthread_cond_broadcast(&log_descriptor.log_flush_cond);
DBUG_RETURN(rc);
}
@@ -8113,6 +8453,8 @@
my_bool translog_purge(TRANSLOG_ADDRESS low)
{
uint32 last_need_file= LSN_FILE_NO(low);
+ uint32 min_unsync;
+ int soft;
TRANSLOG_ADDRESS horizon= translog_get_horizon();
int rc= 0;
DBUG_ENTER("translog_purge");
@@ -8120,12 +8462,26 @@
DBUG_ASSERT(translog_status == TRANSLOG_OK ||
translog_status == TRANSLOG_READONLY);
+ soft= soft_sync;
+ my_atomic_rwlock_wrlock(&soft_sync_rwl);
+ min_unsync= my_atomic_load32(&soft_sync_min);
+ my_atomic_rwlock_wrunlock(&soft_sync_rwl);
+ DBUG_PRINT("info", ("min_unsync: %lu", (ulong) min_unsync));
+ if (soft && min_unsync < last_need_file)
+ {
+ last_need_file= min_unsync;
+ DBUG_PRINT("info", ("last_need_file set to %lu", (ulong)last_need_file));
+ }
+
pthread_mutex_lock(&log_descriptor.purger_lock);
+ DBUG_PRINT("info", ("last_lsn_checked file: %lu:",
+ (ulong) log_descriptor.last_lsn_checked));
if (LSN_FILE_NO(log_descriptor.last_lsn_checked) < last_need_file)
{
uint32 i;
uint32 min_file= translog_first_file(horizon, 1);
DBUG_ASSERT(min_file != 0); /* log is already started */
+ DBUG_PRINT("info", ("min_file: %lu:",(ulong) min_file));
for(i= min_file; i < last_need_file && rc == 0; i++)
{
LSN lsn= translog_get_file_max_lsn_stored(i);
@@ -8356,6 +8712,159 @@
}
+
+/**
+ Sets soft sync mode
+
+ @param mode TRUE if we need switch soft sync on else off
+*/
+
+void translog_soft_sync(my_bool mode)
+{
+ soft_sync= mode;
+}
+
+
+/**
+ Sets hard group commit
+
+ @param mode TRUE if we need switch hard group commit on else off
+*/
+
+void translog_hard_group_commit(my_bool mode)
+{
+ hard_group_commit= mode;
+}
+
+
+/**
+ @brief forced log sync (used when we are switching modes)
+*/
+
+void translog_sync()
+{
+ uint32 max= get_current_logfile()->number;
+ uint32 min;
+ DBUG_ENTER("ma_translog_sync");
+
+ my_atomic_rwlock_rdlock(&soft_sync_rwl);
+ min= my_atomic_load32(&soft_sync_min);
+ my_atomic_rwlock_rdunlock(&soft_sync_rwl);
+ if (!min)
+ min= max;
+
+ translog_sync_files(min, max, sync_log_dir >= TRANSLOG_SYNC_DIR_ALWAYS);
+
+ DBUG_VOID_RETURN;
+}
+
+
+/**
+ @brief set rate for group commit
+
+ @param interval interval to set.
+
+ @note We use this function with additional variable because have to
+ restart service thread with new value which we can't make inside changing
+ variable routine (update_maria_group_commit_interval)
+*/
+
+void translog_set_group_commit_interval(uint32 interval)
+{
+ DBUG_ENTER("translog_set_group_commit_interval");
+ group_commit_wait= interval;
+ DBUG_PRINT("info", ("wait: %llu",
+ (ulonglong)group_commit_wait));
+ DBUG_VOID_RETURN;
+}
+
+
+/**
+ @brief syncing service thread
+*/
+
+static pthread_handler_t
+ma_soft_sync_background( void *arg __attribute__((unused)))
+{
+
+ my_thread_init();
+ {
+ DBUG_ENTER("ma_soft_sync_background");
+ for(;;)
+ {
+ ulonglong prev_loop= my_micro_time();
+ ulonglong time, sleep;
+ uint32 min, max, sync_request;
+ my_atomic_rwlock_rdlock(&soft_sync_rwl);
+ min= my_atomic_load32(&soft_sync_min);
+ max= my_atomic_load32(&soft_sync_max);
+ sync_request= my_atomic_load32(&soft_need_sync);
+ my_atomic_store32(&soft_sync_min, max);
+ my_atomic_store32(&soft_need_sync, 0);
+ my_atomic_rwlock_rdunlock(&soft_sync_rwl);
+
+ sleep= group_commit_wait;
+ if (sync_request)
+ translog_sync_files(min, max, FALSE);
+ time= my_micro_time() - prev_loop;
+ if (time > sleep)
+ sleep= 0;
+ else
+ sleep-= time;
+ if (my_service_thread_sleep(&soft_sync_control, sleep))
+ break;
+ }
+ my_service_thread_signal_end(&soft_sync_control);
+ my_thread_end();
+ DBUG_RETURN(0);
+ }
+}
+
+
+/**
+ @brief Starts syncing thread
+*/
+
+int translog_soft_sync_start(void)
+{
+ pthread_t th;
+ int res= 0;
+ uint32 min, max;
+ DBUG_ENTER("translog_soft_sync_start");
+
+ /* check and init variables */
+ my_atomic_rwlock_rdlock(&soft_sync_rwl);
+ min= my_atomic_load32(&soft_sync_min);
+ max= my_atomic_load32(&soft_sync_max);
+ if (!max)
+ my_atomic_store32(&soft_sync_max, (max= get_current_logfile()->number));
+ if (!min)
+ my_atomic_store32(&soft_sync_min, max);
+ my_atomic_store32(&soft_need_sync, 1);
+ my_atomic_rwlock_rdunlock(&soft_sync_rwl);
+
+ if (!(res= ma_service_thread_control_init(&soft_sync_control)))
+ if (!(res= pthread_create(&th, NULL, ma_soft_sync_background, NULL)))
+ soft_sync_control.status= THREAD_RUNNING;
+ DBUG_RETURN(res);
+}
+
+
+/**
+ @brief Stops syncing thread
+*/
+
+void translog_soft_sync_end(void)
+{
+ DBUG_ENTER("translog_soft_sync_end");
+ if (soft_sync_control.inited)
+ {
+ ma_service_thread_control_end(&soft_sync_control);
+ }
+ DBUG_VOID_RETURN;
+}
+
+
#ifdef MARIA_DUMP_LOG
#include <my_getopt.h>
extern void translog_example_table_init();
=== modified file 'storage/maria/ma_loghandler.h'
--- a/storage/maria/ma_loghandler.h 2009-01-15 22:25:53 +0000
+++ b/storage/maria/ma_loghandler.h 2010-02-12 13:12:28 +0000
@@ -342,6 +342,14 @@
TRANSLOG_SHUTDOWN /* going to shutdown the loghandler */
};
extern enum enum_translog_status translog_status;
+extern ulonglong translog_syncs; /* Number of sync()s */
+
+void translog_soft_sync(my_bool mode);
+void translog_hard_group_commit(my_bool mode);
+int translog_soft_sync_start(void);
+void translog_soft_sync_end(void);
+void translog_sync();
+void translog_set_group_commit_interval(uint32 interval);
/*
all the rest added because of recovery; should we make
@@ -441,6 +449,14 @@
typedef enum
{
+ TRANSLOG_GCOMMIT_NONE,
+ TRANSLOG_GCOMMIT_HARD,
+ TRANSLOG_GCOMMIT_SOFT
+} enum_maria_group_commit;
+extern ulong maria_group_commit;
+extern ulong maria_group_commit_interval;
+typedef enum
+{
TRANSLOG_PURGE_IMMIDIATE,
TRANSLOG_PURGE_EXTERNAL,
TRANSLOG_PURGE_ONDEMAND
1
0
[Maria-developers] Rev 2740: Group commit for maria storage engine. in file:///Users/bell/maria/bzr/work-maria-5.2-groupcommit/
by sanja@askmonty.org 12 Feb '10
by sanja@askmonty.org 12 Feb '10
12 Feb '10
At file:///Users/bell/maria/bzr/work-maria-5.2-groupcommit/
------------------------------------------------------------
revno: 2740
revision-id: sanja(a)askmonty.org-20100212091325-sluwoeo04cvmjewk
parent: knielsen(a)knielsen-hq.org-20100201190519-b9uktnn90rwwiile
committer: sanja(a)askmonty.org
branch nick: work-maria-5.2-groupcommit
timestamp: Fri 2010-02-12 11:13:25 +0200
message:
Group commit for maria storage engine.
=== added file 'mysql-test/suite/maria/r/group_commit.result'
--- a/mysql-test/suite/maria/r/group_commit.result 1970-01-01 00:00:00 +0000
+++ b/mysql-test/suite/maria/r/group_commit.result 2010-02-12 09:13:25 +0000
@@ -0,0 +1,17 @@
+drop table if exists t1;
+create table t1 (a int);
+SET GLOBAL maria_group_commit="NONE";
+SET GLOBAL maria_group_commit_interval= 0;
+SET GLOBAL maria_group_commit="NONE";
+SET GLOBAL maria_group_commit_interval= 100;
+SET GLOBAL maria_group_commit="HARD";
+SET GLOBAL maria_group_commit_interval= 0;
+SET GLOBAL maria_group_commit="HARD";
+SET GLOBAL maria_group_commit_interval= 100;
+SET GLOBAL maria_group_commit="SOFT";
+SET GLOBAL maria_group_commit_interval= 0;
+SET GLOBAL maria_group_commit="SOFT";
+SET GLOBAL maria_group_commit_interval= 100;
+SET GLOBAL maria_group_commit="NONE";
+SET GLOBAL maria_group_commit_interval= 0;
+drop table t1;
=== modified file 'mysql-test/suite/maria/r/maria3.result'
--- a/mysql-test/suite/maria/r/maria3.result 2009-09-18 01:04:43 +0000
+++ b/mysql-test/suite/maria/r/maria3.result 2010-02-12 09:13:25 +0000
@@ -306,6 +306,8 @@
maria_block_size 8192
maria_checkpoint_interval 30
maria_force_start_after_recovery_failures 0
+maria_group_commit none
+maria_group_commit_interval 0
maria_log_file_size 4294959104
maria_log_purge_type immediate
maria_max_sort_file_size 9223372036853727232
@@ -328,6 +330,7 @@
Maria_pagecache_reads #
Maria_pagecache_write_requests #
Maria_pagecache_writes #
+Maria_transaction_log_syncs #
create table t1 (b char(0));
insert into t1 values(NULL),("");
select length(b) from t1;
=== added file 'mysql-test/suite/maria/t/group_commit.test'
--- a/mysql-test/suite/maria/t/group_commit.test 1970-01-01 00:00:00 +0000
+++ b/mysql-test/suite/maria/t/group_commit.test 2010-02-12 09:13:25 +0000
@@ -0,0 +1,71 @@
+# Test different ways of syncing (mostly syntax)
+
+--disable_warnings
+drop table if exists t1;
+--enable_warnings
+
+create table t1 (a int);
+
+SET GLOBAL maria_group_commit="NONE";
+SET GLOBAL maria_group_commit_interval= 0;
+--disable_query_log
+let $num = 5000;
+while ($num)
+{
+ insert into t1 values (1);
+ dec $num;
+}
+--enable_query_log
+SET GLOBAL maria_group_commit="NONE";
+SET GLOBAL maria_group_commit_interval= 100;
+--disable_query_log
+let $num = 5000;
+while ($num)
+{
+ insert into t1 values (1);
+ dec $num;
+}
+--enable_query_log
+SET GLOBAL maria_group_commit="HARD";
+SET GLOBAL maria_group_commit_interval= 0;
+--disable_query_log
+let $num = 5000;
+while ($num)
+{
+ insert into t1 values (1);
+ dec $num;
+}
+--enable_query_log
+SET GLOBAL maria_group_commit="HARD";
+SET GLOBAL maria_group_commit_interval= 100;
+--disable_query_log
+let $num = 5000;
+while ($num)
+{
+ insert into t1 values (1);
+ dec $num;
+}
+--enable_query_log
+SET GLOBAL maria_group_commit="SOFT";
+SET GLOBAL maria_group_commit_interval= 0;
+--disable_query_log
+let $num = 5000;
+while ($num)
+{
+ insert into t1 values (1);
+ dec $num;
+}
+--enable_query_log
+SET GLOBAL maria_group_commit="SOFT";
+SET GLOBAL maria_group_commit_interval= 100;
+--disable_query_log
+let $num = 5000;
+while ($num)
+{
+ insert into t1 values (1);
+ dec $num;
+}
+--enable_query_log
+SET GLOBAL maria_group_commit="NONE";
+SET GLOBAL maria_group_commit_interval= 0;
+drop table t1;
=== added directory 'randgen'
=== added directory 'randgen/conf'
=== added file 'randgen/conf/maria_group_commit.yy'
--- a/randgen/conf/maria_group_commit.yy 1970-01-01 00:00:00 +0000
+++ b/randgen/conf/maria_group_commit.yy 2010-02-12 09:13:25 +0000
@@ -0,0 +1,181 @@
+# test of group commit switching
+
+query:
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ select | insert | update| delete |
+ change_group_commit | change_interval;
+
+
+select:
+ SELECT select_item FROM join where order_by limit;
+
+select_item:
+ * | X . _field ;
+
+join:
+ _table AS X |
+ _table AS X LEFT JOIN _table AS Y ON ( X . _field = Y . _field ) ;
+
+where:
+ |
+ WHERE X . _field < value |
+ WHERE X . _field > value |
+ WHERE X . _field = value ;
+
+where_delete:
+ |
+ WHERE _field < value |
+ WHERE _field > value |
+ WHERE _field = value ;
+
+order_by:
+ | ORDER BY X . _field ;
+
+limit:
+ | LIMIT _digit ;
+
+insert:
+ INSERT INTO _table ( _field , _field ) VALUES ( value , value ) ;
+
+update:
+ UPDATE _table AS X SET _field = value where order_by limit ;
+
+delete:
+ DELETE FROM _table where_delete LIMIT _digit ;
+
+value:
+ ' _letter ' | _digit | _date | _datetime | _time | _english ;
+
+change_group_commit:
+ SET GLOBAL MARIA_GROUP_COMMIT=none_soft_hard;
+
+none_soft_hard:
+ NONE | SOFT | HARD;
+
+change_interval:
+ set_interval | set_interval | set_interval | set_interval |
+ drop_interval;
+
+set_interval:
+ SET GLOBAL MARIA_GROUP_COMMIT_INTERVAL=_tinyint_unsigned;
+
+drop_interval:
+ SET GLOBAL MARIA_GROUP_COMMIT_INTERVAL=0;
=== modified file 'storage/maria/ha_maria.cc'
--- a/storage/maria/ha_maria.cc 2009-12-03 11:34:11 +0000
+++ b/storage/maria/ha_maria.cc 2010-02-12 09:13:25 +0000
@@ -102,22 +102,40 @@
array_elements(maria_translog_purge_type_names) - 1, "",
maria_translog_purge_type_names, NULL
};
+
+/* transactional log directory sync */
const char *maria_sync_log_dir_names[]=
{
"NEVER", "NEWFILE", "ALWAYS", NullS
};
-
TYPELIB maria_sync_log_dir_typelib=
{
array_elements(maria_sync_log_dir_names) - 1, "",
maria_sync_log_dir_names, NULL
};
+/* transactional log group commit */
+const char *maria_group_commit_names[]=
+{
+ "none", "hard", "soft", NullS
+};
+TYPELIB maria_group_commit_typelib=
+{
+ array_elements(maria_group_commit_names) - 1, "",
+ maria_group_commit_names, NULL
+};
+
/** Interval between background checkpoints in seconds */
static ulong checkpoint_interval;
static void update_checkpoint_interval(MYSQL_THD thd,
struct st_mysql_sys_var *var,
void *var_ptr, const void *save);
+static void update_maria_group_commit(MYSQL_THD thd,
+ struct st_mysql_sys_var *var,
+ void *var_ptr, const void *save);
+static void update_maria_group_commit_interval(MYSQL_THD thd,
+ struct st_mysql_sys_var *var,
+ void *var_ptr, const void *save);
/** After that many consecutive recovery failures, remove logs */
static ulong force_start_after_recovery_failures;
static void update_log_file_size(MYSQL_THD thd,
@@ -164,6 +182,24 @@
NULL, update_log_file_size, TRANSLOG_FILE_SIZE,
TRANSLOG_MIN_FILE_SIZE, 0xffffffffL, TRANSLOG_PAGE_SIZE);
+static MYSQL_SYSVAR_ENUM(group_commit, maria_group_commit,
+ PLUGIN_VAR_RQCMDARG,
+ "Specifies maria group commit mode. "
+ "Possible values are \"none\" (no group commit), "
+ "\"hard\" (with waiting to actual commit), "
+ "\"soft\" (no wait for commit (DANGEROUS!!!))",
+ NULL, update_maria_group_commit,
+ TRANSLOG_GCOMMIT_NONE, &maria_group_commit_typelib);
+
+static MYSQL_SYSVAR_ULONG(group_commit_interval, maria_group_commit_interval,
+ PLUGIN_VAR_RQCMDARG,
+ "Interval between commite in microseconds (1/1000000c)."
+ " 0 stands for no waiting"
+ " for other threads to come and do a commit in \"hard\" mode and no"
+ " sync()/commit at all in \"soft\" mode. Option has only an effect"
+ " if maria_group_commit is used",
+ NULL, update_maria_group_commit_interval, 0, 0, UINT_MAX, 1);
+
static MYSQL_SYSVAR_ENUM(log_purge_type, log_purge_type,
PLUGIN_VAR_RQCMDARG,
"Specifies how maria transactional log will be purged. "
@@ -3275,6 +3311,8 @@
MYSQL_SYSVAR(block_size),
MYSQL_SYSVAR(checkpoint_interval),
MYSQL_SYSVAR(force_start_after_recovery_failures),
+ MYSQL_SYSVAR(group_commit),
+ MYSQL_SYSVAR(group_commit_interval),
MYSQL_SYSVAR(page_checksum),
MYSQL_SYSVAR(log_dir_path),
MYSQL_SYSVAR(log_file_size),
@@ -3306,6 +3344,92 @@
}
/**
+ @brief Updates group commit mode
+*/
+
+static void update_maria_group_commit(MYSQL_THD thd,
+ struct st_mysql_sys_var *var,
+ void *var_ptr, const void *save)
+{
+ ulong value= (ulong)*((long *)var_ptr);
+ DBUG_ENTER("update_maria_group_commit");
+ DBUG_PRINT("enter", ("old value: %lu new value %lu rate %lu",
+ value, (ulong)(*(long *)save),
+ maria_group_commit_interval));
+ /* old value */
+ switch (value) {
+ case TRANSLOG_GCOMMIT_NONE:
+ break;
+ case TRANSLOG_GCOMMIT_HARD:
+ translog_hard_group_commit(FALSE);
+ break;
+ case TRANSLOG_GCOMMIT_SOFT:
+ translog_soft_sync(FALSE);
+ if (maria_group_commit_interval)
+ translog_soft_sync_end();
+ break;
+ default:
+ DBUG_ASSERT(0); /* impossible */
+ }
+ value= *(ulong *)var_ptr= (ulong)(*(long *)save);
+ translog_sync();
+ /* new value */
+ switch (value) {
+ case TRANSLOG_GCOMMIT_NONE:
+ break;
+ case TRANSLOG_GCOMMIT_HARD:
+ translog_hard_group_commit(TRUE);
+ break;
+ case TRANSLOG_GCOMMIT_SOFT:
+ translog_soft_sync(TRUE);
+ /* variable change made under global lock so we can just read it */
+ if (maria_group_commit_interval)
+ translog_soft_sync_start();
+ break;
+ default:
+ DBUG_ASSERT(0); /* impossible */
+ }
+ DBUG_VOID_RETURN;
+}
+
+/**
+ @brief Updates group commit interval
+*/
+
+static void update_maria_group_commit_interval(MYSQL_THD thd,
+ struct st_mysql_sys_var *var,
+ void *var_ptr, const void *save)
+{
+ ulong new_value= (ulong)*((long *)save);
+ ulong *value_ptr= (ulong*) var_ptr;
+ DBUG_ENTER("update_maria_group_commit_interval");
+ DBUG_PRINT("enter", ("old value: %lu new value %lu group commit %lu",
+ *value_ptr, new_value, maria_group_commit));
+
+ /* variable change made under global lock so we can just read it */
+ switch (maria_group_commit) {
+ case TRANSLOG_GCOMMIT_NONE:
+ *value_ptr= new_value;
+ translog_set_group_commit_interval(new_value);
+ break;
+ case TRANSLOG_GCOMMIT_HARD:
+ *value_ptr= new_value;
+ translog_set_group_commit_interval(new_value);
+ break;
+ case TRANSLOG_GCOMMIT_SOFT:
+ if (*value_ptr)
+ translog_soft_sync_end();
+ translog_set_group_commit_interval(new_value);
+ if ((*value_ptr= new_value))
+ translog_soft_sync_start();
+ break;
+ default:
+ DBUG_ASSERT(0); /* impossible */
+ }
+ DBUG_VOID_RETURN;
+}
+
+/**
@brief Updates the transaction log file limit.
*/
@@ -3327,6 +3451,7 @@
{"Maria_pagecache_reads", (char*) &maria_pagecache_var.global_cache_read, SHOW_LONGLONG},
{"Maria_pagecache_write_requests", (char*) &maria_pagecache_var.global_cache_w_requests, SHOW_LONGLONG},
{"Maria_pagecache_writes", (char*) &maria_pagecache_var.global_cache_write, SHOW_LONGLONG},
+ {"Maria_transaction_log_syncs", (char*) &translog_syncs, SHOW_LONGLONG},
{NullS, NullS, SHOW_LONG}
};
=== modified file 'storage/maria/ma_init.c'
--- a/storage/maria/ma_init.c 2008-10-09 20:03:54 +0000
+++ b/storage/maria/ma_init.c 2010-02-12 09:13:25 +0000
@@ -82,6 +82,11 @@
maria_inited= maria_multi_threaded= FALSE;
ft_free_stopwords();
ma_checkpoint_end();
+ if (translog_status == TRANSLOG_OK)
+ {
+ translog_soft_sync_end();
+ translog_sync();
+ }
if ((trid= trnman_get_max_trid()) > max_trid_in_control_file)
{
/*
=== modified file 'storage/maria/ma_loghandler.c'
--- a/storage/maria/ma_loghandler.c 2010-01-06 21:27:53 +0000
+++ b/storage/maria/ma_loghandler.c 2010-02-12 09:13:25 +0000
@@ -18,6 +18,7 @@
#include "ma_blockrec.h" /* for some constants and in-write hooks */
#include "ma_key_recover.h" /* For some in-write hooks */
#include "ma_checkpoint.h"
+#include "ma_servicethread.h"
/*
On Windows, neither my_open() nor my_sync() work for directories.
@@ -47,6 +48,15 @@
#include <m_ctype.h>
#endif
+/** @brief protects checkpoint_in_progress */
+static pthread_mutex_t LOCK_soft_sync;
+/** @brief for killing the background checkpoint thread */
+static pthread_cond_t COND_soft_sync;
+/** @brief control structure for checkpoint background thread */
+static MA_SERVICE_THREAD_CONTROL soft_sync_control=
+ {THREAD_DEAD, FALSE, &LOCK_soft_sync, &COND_soft_sync};
+
+
/* transaction log file descriptor */
typedef struct st_translog_file
{
@@ -124,10 +134,24 @@
/* Previous buffer offset to detect it flush finish */
TRANSLOG_ADDRESS prev_buffer_offset;
/*
+ If the buffer was forced to close it save value of its horizon
+ otherwise LSN_IMPOSSIBLE
+ */
+ TRANSLOG_ADDRESS pre_force_close_horizon;
+ /*
How much is written (or will be written when copy_to_buffer_in_progress
become 0) to this buffer
*/
translog_size_t size;
+ /*
+ When moving from one log buffer to another, we write the last of the
+ previous buffer to file and then move to start using the new log
+ buffer. In the case of a part filed last page, this page is not moved
+ to the start of the new buffer but instead we set the 'skip_data'
+ variable to tell us how much data at the beginning of the buffer is not
+ relevant.
+ */
+ uint skipped_data;
/* File handler for this buffer */
TRANSLOG_FILE *file;
/* Threads which are waiting for buffer filling/freeing */
@@ -304,6 +328,7 @@
*/
pthread_mutex_t log_flush_lock;
pthread_cond_t log_flush_cond;
+ pthread_cond_t new_goal_cond;
/* Protects changing of headers of finished files (max_lsn) */
pthread_mutex_t file_header_lock;
@@ -344,13 +369,39 @@
ulong log_purge_type= TRANSLOG_PURGE_IMMIDIATE;
ulong log_file_size= TRANSLOG_FILE_SIZE;
+/* sync() of log files directory mode */
ulong sync_log_dir= TRANSLOG_SYNC_DIR_NEWFILE;
+ulong maria_group_commit= TRANSLOG_GCOMMIT_NONE;
+ulong maria_group_commit_interval= 0;
/* Marker for end of log */
static uchar end_of_log= 0;
#define END_OF_LOG &end_of_log
+/**
+ Switch for "soft" sync (no real sync() but periodical sync by service
+ thread)
+*/
+static volatile my_bool soft_sync= FALSE;
+/**
+ Switch for "hard" group commit mode
+*/
+static volatile my_bool hard_group_commit= FALSE;
+/**
+ File numbers interval which have to be sync()
+*/
+static uint32 soft_sync_min= 0;
+static uint32 soft_sync_max= 0;
+static uint32 soft_need_sync= 1;
+/**
+ stores interval in microseconds
+*/
+static uint32 group_commit_wait= 0;
enum enum_translog_status translog_status= TRANSLOG_UNINITED;
+ulonglong translog_syncs= 0; /* Number of sync()s */
+
+/* time of last flush */
+static ulonglong flush_start= 0;
/* chunk types */
#define TRANSLOG_CHUNK_LSN 0x00 /* 0 chunk refer as LSN (head or tail */
@@ -980,12 +1031,17 @@
static TRANSLOG_FILE *get_current_logfile()
{
TRANSLOG_FILE *file;
+ DBUG_ENTER("get_current_logfile");
rw_rdlock(&log_descriptor.open_files_lock);
+ DBUG_PRINT("info", ("max_file: %lu min_file: %lu open_files: %lu",
+ (ulong) log_descriptor.max_file,
+ (ulong) log_descriptor.min_file,
+ (ulong) log_descriptor.open_files.elements));
DBUG_ASSERT(log_descriptor.max_file - log_descriptor.min_file + 1 ==
log_descriptor.open_files.elements);
file= *dynamic_element(&log_descriptor.open_files, 0, TRANSLOG_FILE **);
rw_unlock(&log_descriptor.open_files_lock);
- return (file);
+ DBUG_RETURN(file);
}
uchar NEAR maria_trans_file_magic[]=
@@ -1069,6 +1125,7 @@
static my_bool translog_max_lsn_to_header(File file, LSN lsn)
{
uchar lsn_buff[LSN_STORE_SIZE];
+ my_bool rc;
DBUG_ENTER("translog_max_lsn_to_header");
DBUG_PRINT("enter", ("File descriptor: %ld "
"lsn: (%lu,0x%lx)",
@@ -1077,11 +1134,17 @@
lsn_store(lsn_buff, lsn);
- DBUG_RETURN(my_pwrite(file, lsn_buff,
- LSN_STORE_SIZE,
- (LOG_HEADER_DATA_SIZE - LSN_STORE_SIZE),
- log_write_flags) != 0 ||
- my_sync(file, MYF(MY_WME)) != 0);
+ rc= (my_pwrite(file, lsn_buff,
+ LSN_STORE_SIZE,
+ (LOG_HEADER_DATA_SIZE - LSN_STORE_SIZE),
+ log_write_flags) != 0 ||
+ my_sync(file, MYF(MY_WME)) != 0);
+ /*
+ We should not increase counter in case of error above, but it is so
+ unlikely that we can ignore this case
+ */
+ translog_syncs++;
+ DBUG_RETURN(rc);
}
@@ -1423,7 +1486,9 @@
static my_bool translog_buffer_init(struct st_translog_buffer *buffer, int num)
{
DBUG_ENTER("translog_buffer_init");
- buffer->prev_last_lsn= buffer->last_lsn= LSN_IMPOSSIBLE;
+ buffer->pre_force_close_horizon=
+ buffer->prev_last_lsn= buffer->last_lsn=
+ LSN_IMPOSSIBLE;
DBUG_PRINT("info", ("last_lsn and prev_last_lsn set to 0 buffer: 0x%lx",
(ulong) buffer));
@@ -1435,6 +1500,7 @@
memset(buffer->buffer, TRANSLOG_FILLER, TRANSLOG_WRITE_BUFFER);
/* Buffer size */
buffer->size= 0;
+ buffer->skipped_data= 0;
/* cond of thread which is waiting for buffer filling */
if (pthread_cond_init(&buffer->waiting_filling_buffer, 0))
DBUG_RETURN(1);
@@ -1489,7 +1555,10 @@
TODO: sync only we have changed the log
*/
if (!file->is_sync)
+ {
rc= my_sync(file->handler.file, MYF(MY_WME));
+ translog_syncs++;
+ }
rc|= my_close(file->handler.file, MYF(MY_WME));
my_free(file, MYF(0));
return test(rc);
@@ -2044,7 +2113,8 @@
(ulong) LSN_OFFSET(log_descriptor.horizon),
(ulong) LSN_OFFSET(log_descriptor.horizon)));
DBUG_ASSERT(buffer_no == buffer->buffer_no);
- buffer->prev_last_lsn= buffer->last_lsn= LSN_IMPOSSIBLE;
+ buffer->pre_force_close_horizon=
+ buffer->prev_last_lsn= buffer->last_lsn= LSN_IMPOSSIBLE;
DBUG_PRINT("info", ("last_lsn and prev_last_lsn set to 0 buffer: 0x%lx",
(ulong) buffer));
buffer->offset= log_descriptor.horizon;
@@ -2052,6 +2122,7 @@
buffer->file= get_current_logfile();
buffer->overlay= 0;
buffer->size= 0;
+ buffer->skipped_data= 0;
translog_cursor_init(cursor, buffer, buffer_no);
DBUG_PRINT("info", ("file: #%ld (%d) init cursor #%u: 0x%lx "
"chaser: %d Size: %lu (%lu)",
@@ -2523,6 +2594,7 @@
TRANSLOG_ADDRESS offset= buffer->offset;
TRANSLOG_FILE *file= buffer->file;
uint8 ver= buffer->ver;
+ uint skipped_data;
DBUG_ENTER("translog_buffer_flush");
DBUG_PRINT("enter",
("Buffer: #%u 0x%lx file: %d offset: (%lu,0x%lx) size: %lu",
@@ -2557,6 +2629,8 @@
disk
*/
file= buffer->file;
+ skipped_data= buffer->skipped_data;
+ DBUG_ASSERT(skipped_data < TRANSLOG_PAGE_SIZE);
for (i= 0, pg= LSN_OFFSET(buffer->offset) / TRANSLOG_PAGE_SIZE;
i < buffer->size;
i+= TRANSLOG_PAGE_SIZE, pg++)
@@ -2573,13 +2647,16 @@
DBUG_ASSERT(i + TRANSLOG_PAGE_SIZE <= buffer->size);
if (translog_status != TRANSLOG_OK && translog_status != TRANSLOG_SHUTDOWN)
DBUG_RETURN(1);
- if (pagecache_inject(log_descriptor.pagecache,
+ if (pagecache_write_part(log_descriptor.pagecache,
&file->handler, pg, 3,
buffer->buffer + i,
PAGECACHE_PLAIN_PAGE,
PAGECACHE_LOCK_LEFT_UNLOCKED,
- PAGECACHE_PIN_LEFT_UNPINNED, 0,
- LSN_IMPOSSIBLE))
+ PAGECACHE_PIN_LEFT_UNPINNED,
+ PAGECACHE_WRITE_DONE, 0,
+ LSN_IMPOSSIBLE,
+ skipped_data,
+ TRANSLOG_PAGE_SIZE - skipped_data))
{
DBUG_PRINT("error",
("Can't write page (%lu,0x%lx) to pagecache, error: %d",
@@ -2589,10 +2666,12 @@
translog_stop_writing();
DBUG_RETURN(1);
}
+ skipped_data= 0;
}
file->is_sync= 0;
- if (my_pwrite(file->handler.file, buffer->buffer,
- buffer->size, LSN_OFFSET(buffer->offset),
+ if (my_pwrite(file->handler.file, buffer->buffer + buffer->skipped_data,
+ buffer->size - buffer->skipped_data,
+ LSN_OFFSET(buffer->offset) + buffer->skipped_data,
log_write_flags))
{
DBUG_PRINT("error", ("Can't write buffer (%lu,0x%lx) size %lu "
@@ -2985,6 +3064,7 @@
uchar *from, *table= NULL;
int is_last_unfinished_page;
uint last_protected_sector= 0;
+ uint skipped_data= curr_buffer->skipped_data;
TRANSLOG_FILE file_copy;
uint8 ver= curr_buffer->ver;
translog_wait_for_writers(curr_buffer);
@@ -2997,7 +3077,38 @@
}
DBUG_ASSERT(LSN_FILE_NO(addr) == LSN_FILE_NO(curr_buffer->offset));
from= curr_buffer->buffer + (addr - curr_buffer->offset);
- memcpy(buffer, from, TRANSLOG_PAGE_SIZE);
+ if (skipped_data && addr == curr_buffer->offset)
+ {
+ /*
+ We read page part of which is not present in buffer,
+ so we should read absent part from file (page cache actually)
+ */
+ file= get_logfile_by_number(file_no);
+ DBUG_ASSERT(file != NULL);
+ /*
+ it's ok to not lock the page because:
+ - The log handler has it's own page cache.
+ - There is only one thread that can access the log
+ cache at a time
+ */
+ if (!(buffer= pagecache_read(log_descriptor.pagecache,
+ &file->handler,
+ LSN_OFFSET(addr) / TRANSLOG_PAGE_SIZE,
+ 3, buffer,
+ PAGECACHE_PLAIN_PAGE,
+ PAGECACHE_LOCK_LEFT_UNLOCKED,
+ NULL)))
+ DBUG_RETURN(NULL);
+ }
+ else
+ skipped_data= 0; /* Read after skipped in buffer data */
+ /*
+ Now we have correct data in buffer up to 'skipped_data'. The
+ following memcpy() will move the data from the internal buffer
+ that was not yet on disk.
+ */
+ memcpy(buffer + skipped_data, from + skipped_data,
+ TRANSLOG_PAGE_SIZE - skipped_data);
/*
We can use copy then in translog_page_validator() because it
do not put it permanently somewhere.
@@ -3291,6 +3402,7 @@
uint32 next_page_offset, page_rest;
uint32 i;
File fd;
+ int rc;
TRANSLOG_VALIDATOR_DATA data;
char path[FN_REFLEN];
uchar page_buff[TRANSLOG_PAGE_SIZE];
@@ -3316,14 +3428,19 @@
TRANSLOG_PAGE_SIZE);
page_rest= next_page_offset - LSN_OFFSET(addr);
memset(page_buff, TRANSLOG_FILLER, page_rest);
- if ((fd= open_logfile_by_number_no_cache(LSN_FILE_NO(addr))) < 0 ||
- ((my_chsize(fd, next_page_offset, TRANSLOG_FILLER, MYF(MY_WME)) ||
- (page_rest && my_pwrite(fd, page_buff, page_rest, LSN_OFFSET(addr),
- log_write_flags)) ||
- my_sync(fd, MYF(MY_WME))) |
- my_close(fd, MYF(MY_WME))) ||
- (sync_log_dir >= TRANSLOG_SYNC_DIR_ALWAYS &&
- sync_dir(log_descriptor.directory_fd, MYF(MY_WME | MY_IGNORE_BADFD))))
+ rc= ((fd= open_logfile_by_number_no_cache(LSN_FILE_NO(addr))) < 0 ||
+ ((my_chsize(fd, next_page_offset, TRANSLOG_FILLER, MYF(MY_WME)) ||
+ (page_rest && my_pwrite(fd, page_buff, page_rest, LSN_OFFSET(addr),
+ log_write_flags)) ||
+ my_sync(fd, MYF(MY_WME)))));
+ translog_syncs++;
+ rc|= (fd > 0 && my_close(fd, MYF(MY_WME)));
+ if (sync_log_dir >= TRANSLOG_SYNC_DIR_ALWAYS)
+ {
+ rc|= sync_dir(log_descriptor.directory_fd, MYF(MY_WME | MY_IGNORE_BADFD));
+ translog_syncs++;
+ }
+ if (rc)
DBUG_RETURN(1);
/* fix the horizon */
@@ -3483,7 +3600,10 @@
my_bool version_changed= 0;
DBUG_ENTER("translog_init_with_table");
+ translog_syncs= 0;
+ flush_start= 0;
id_to_share= NULL;
+
log_descriptor.directory_fd= -1;
log_descriptor.is_everything_flushed= 1;
log_descriptor.flush_in_progress= 0;
@@ -3511,6 +3631,7 @@
pthread_mutex_init(&log_descriptor.dirty_buffer_mask_lock,
MY_MUTEX_INIT_FAST) ||
pthread_cond_init(&log_descriptor.log_flush_cond, 0) ||
+ pthread_cond_init(&log_descriptor.new_goal_cond, 0) ||
my_rwlock_init(&log_descriptor.open_files_lock,
NULL) ||
my_init_dynamic_array(&log_descriptor.open_files,
@@ -3912,7 +4033,6 @@
log_descriptor.flushed= log_descriptor.horizon;
log_descriptor.in_buffers_only= log_descriptor.bc.buffer->offset;
log_descriptor.max_lsn= LSN_IMPOSSIBLE; /* set to 0 */
- log_descriptor.previous_flush_horizon= log_descriptor.horizon;
/*
Now 'flushed' is set to 'horizon' value, but 'horizon' is (potentially)
address of the next LSN and we want indicate that all LSNs that are
@@ -3995,6 +4115,10 @@
It is beginning of the log => there is no LSNs in the log =>
There is no harm in leaving it "as-is".
*/
+ log_descriptor.previous_flush_horizon= log_descriptor.horizon;
+ DBUG_PRINT("info", ("previous_flush_horizon: (%lu,0x%lx)",
+ LSN_IN_PARTS(log_descriptor.
+ previous_flush_horizon)));
DBUG_RETURN(0);
}
file_no--;
@@ -4070,6 +4194,9 @@
translog_free_record_header(&rec);
}
}
+ log_descriptor.previous_flush_horizon= log_descriptor.horizon;
+ DBUG_PRINT("info", ("previous_flush_horizon: (%lu,0x%lx)",
+ LSN_IN_PARTS(log_descriptor.previous_flush_horizon)));
DBUG_RETURN(0);
err:
ma_message_no_user(0, "log initialization failed");
@@ -4157,6 +4284,7 @@
pthread_mutex_destroy(&log_descriptor.log_flush_lock);
pthread_mutex_destroy(&log_descriptor.dirty_buffer_mask_lock);
pthread_cond_destroy(&log_descriptor.log_flush_cond);
+ pthread_cond_destroy(&log_descriptor.new_goal_cond);
rwlock_destroy(&log_descriptor.open_files_lock);
delete_dynamic(&log_descriptor.open_files);
delete_dynamic(&log_descriptor.unfinished_files);
@@ -6885,11 +7013,11 @@
{
translog_size_t res;
DBUG_ENTER("translog_read_record_header_from_buffer");
- DBUG_ASSERT(translog_is_LSN_chunk(page[page_offset]));
- DBUG_ASSERT(translog_status == TRANSLOG_OK ||
- translog_status == TRANSLOG_READONLY);
DBUG_PRINT("info", ("page byte: 0x%x offset: %u",
(uint) page[page_offset], (uint) page_offset));
+ DBUG_ASSERT(translog_is_LSN_chunk(page[page_offset]));
+ DBUG_ASSERT(translog_status == TRANSLOG_OK ||
+ translog_status == TRANSLOG_READONLY);
buff->type= (page[page_offset] & TRANSLOG_REC_TYPE);
buff->short_trid= uint2korr(page + page_offset + 1);
DBUG_PRINT("info", ("Type %u, Short TrID %u, LSN (%lu,0x%lx)",
@@ -7356,27 +7484,27 @@
"Buffer addr: (%lu,0x%lx) "
"Page addr: (%lu,0x%lx) "
"size: %lu (%lu) Pg: %u left: %u in progress %u",
- (uint) log_descriptor.bc.buffer_no,
- (ulong) log_descriptor.bc.buffer,
- LSN_IN_PARTS(log_descriptor.bc.buffer->offset),
+ (uint) old_buffer_no,
+ (ulong) old_buffer,
+ LSN_IN_PARTS(old_buffer->offset),
(ulong) LSN_FILE_NO(log_descriptor.horizon),
(ulong) (LSN_OFFSET(log_descriptor.horizon) -
log_descriptor.bc.current_page_fill),
- (ulong) log_descriptor.bc.buffer->size,
+ (ulong) old_buffer->size,
(ulong) (log_descriptor.bc.ptr -log_descriptor.bc.
buffer->buffer),
(uint) log_descriptor.bc.current_page_fill,
(uint) left,
- (uint) log_descriptor.bc.buffer->
+ (uint) old_buffer->
copy_to_buffer_in_progress));
translog_lock_assert_owner();
LINT_INIT(current_page_fill);
- new_buff_beginning= log_descriptor.bc.buffer->offset;
- new_buff_beginning+= log_descriptor.bc.buffer->size; /* increase offset */
+ new_buff_beginning= old_buffer->offset;
+ new_buff_beginning+= old_buffer->size; /* increase offset */
DBUG_ASSERT(log_descriptor.bc.ptr !=NULL);
DBUG_ASSERT(LSN_FILE_NO(log_descriptor.horizon) ==
- LSN_FILE_NO(log_descriptor.bc.buffer->offset));
+ LSN_FILE_NO(old_buffer->offset));
translog_check_cursor(&log_descriptor.bc);
DBUG_ASSERT(left < TRANSLOG_PAGE_SIZE);
if (left)
@@ -7387,18 +7515,20 @@
*/
DBUG_PRINT("info", ("left: %u", (uint) left));
+ old_buffer->pre_force_close_horizon=
+ old_buffer->offset + old_buffer->size;
/* decrease offset */
new_buff_beginning-= log_descriptor.bc.current_page_fill;
current_page_fill= log_descriptor.bc.current_page_fill;
memset(log_descriptor.bc.ptr, TRANSLOG_FILLER, left);
- log_descriptor.bc.buffer->size+= left;
+ old_buffer->size+= left;
DBUG_PRINT("info", ("Finish Page buffer #%u: 0x%lx "
"Size: %lu",
- (uint) log_descriptor.bc.buffer->buffer_no,
- (ulong) log_descriptor.bc.buffer,
- (ulong) log_descriptor.bc.buffer->size));
- DBUG_ASSERT(log_descriptor.bc.buffer->buffer_no ==
+ (uint) old_buffer->buffer_no,
+ (ulong) old_buffer,
+ (ulong) old_buffer->size));
+ DBUG_ASSERT(old_buffer->buffer_no ==
log_descriptor.bc.buffer_no);
}
else
@@ -7509,11 +7639,21 @@
if (left)
{
- /*
- TODO: do not copy beginning of the page if we have no CRC or sector
- checks on
- */
- memcpy(new_buffer->buffer, data, current_page_fill);
+ if (log_descriptor.flags &
+ (TRANSLOG_PAGE_CRC | TRANSLOG_SECTOR_PROTECTION))
+ memcpy(new_buffer->buffer, data, current_page_fill);
+ else
+ {
+ /*
+ This page header does not change if we add more data to the page so
+ we can not copy it and will not overwrite later
+ */
+ new_buffer->skipped_data= current_page_fill;
+#ifndef DBUG_OFF
+ memset(new_buffer->buffer, 0xa5, current_page_fill);
+#endif
+ DBUG_ASSERT(new_buffer->skipped_data < TRANSLOG_PAGE_SIZE);
+ }
}
old_buffer->next_buffer_offset= new_buffer->offset;
translog_buffer_lock(new_buffer);
@@ -7561,6 +7701,7 @@
{
log_descriptor.next_pass_max_lsn= lsn;
log_descriptor.max_lsn_requester= pthread_self();
+ pthread_cond_broadcast(&log_descriptor.new_goal_cond);
}
while (flush_no == log_descriptor.flush_no)
{
@@ -7572,66 +7713,78 @@
/**
- @brief Flush the log up to given LSN (included)
-
- @param lsn log record serial number up to which (inclusive)
- the log has to be flushed
-
- @return Operation status
+ @brief sync() range of files (inclusive) and directory (by request)
+
+ @param min min internal file number to flush
+ @param max max internal file number to flush
+ @param sync_dir need sync directory
+
+ return Operation status
@retval 0 OK
@retval 1 Error
-
-*/
-
-my_bool translog_flush(TRANSLOG_ADDRESS lsn)
-{
- LSN sent_to_disk= LSN_IMPOSSIBLE;
- TRANSLOG_ADDRESS flush_horizon;
- uint fn, i;
+*/
+
+static my_bool translog_sync_files(uint32 min, uint32 max,
+ my_bool sync_dir)
+{
+ uint fn;
+ my_bool rc= 0;
+ ulonglong flush_interval;
+ DBUG_ENTER("translog_sync_files");
+ DBUG_PRINT("info", ("min: %lu max: %lu sync dir: %d",
+ (ulong) min, (ulong) max, (int) sync_dir));
+ DBUG_ASSERT(min <= max);
+
+ flush_interval= group_commit_wait;
+ if (flush_interval)
+ flush_start= my_micro_time();
+ for (fn= min; fn <= max; fn++)
+ {
+ TRANSLOG_FILE *file= get_logfile_by_number(fn);
+ DBUG_ASSERT(file != NULL);
+ if (!file->is_sync)
+ {
+ if (my_sync(file->handler.file, MYF(MY_WME)))
+ {
+ rc= 1;
+ translog_stop_writing();
+ DBUG_RETURN(rc);
+ }
+ translog_syncs++;
+ file->is_sync= 1;
+ }
+ }
+
+ if (sync_dir)
+ {
+ if (!(rc= sync_dir(log_descriptor.directory_fd,
+ MYF(MY_WME | MY_IGNORE_BADFD))))
+ translog_syncs++;
+ }
+
+ DBUG_RETURN(rc);
+}
+
+
+/*
+ @brief Flushes buffers with LSNs in them less or equal address <lsn>
+
+ @param lsn address up to which all LSNs should be flushed,
+ can be reset to real last LSN address
+ @parem sent_to_disk returns 'sent to disk' position
+ @param flush_horizon returns horizon of the flush
+
+ @note About terminology see comment to translog_flush().
+*/
+
+void translog_flush_buffers(TRANSLOG_ADDRESS *lsn,
+ TRANSLOG_ADDRESS *sent_to_disk,
+ TRANSLOG_ADDRESS *flush_horizon)
+{
dirty_buffer_mask_t dirty_buffer_mask;
+ uint i;
uint8 last_buffer_no, start_buffer_no;
- my_bool rc= 0;
- DBUG_ENTER("translog_flush");
- DBUG_PRINT("enter", ("Flush up to LSN: (%lu,0x%lx)", LSN_IN_PARTS(lsn)));
- DBUG_ASSERT(translog_status == TRANSLOG_OK ||
- translog_status == TRANSLOG_READONLY);
- LINT_INIT(sent_to_disk);
-
- pthread_mutex_lock(&log_descriptor.log_flush_lock);
- DBUG_PRINT("info", ("Everything is flushed up to (%lu,0x%lx)",
- LSN_IN_PARTS(log_descriptor.flushed)));
- if (cmp_translog_addr(log_descriptor.flushed, lsn) >= 0)
- {
- pthread_mutex_unlock(&log_descriptor.log_flush_lock);
- DBUG_RETURN(0);
- }
- if (log_descriptor.flush_in_progress)
- {
- translog_flush_set_new_goal_and_wait(lsn);
- if (!pthread_equal(log_descriptor.max_lsn_requester, pthread_self()))
- {
- /* fix lsn if it was horizon */
- if (cmp_translog_addr(lsn, log_descriptor.bc.buffer->last_lsn) > 0)
- lsn= BUFFER_MAX_LSN(log_descriptor.bc.buffer);
- translog_flush_wait_for_end(lsn);
- pthread_mutex_unlock(&log_descriptor.log_flush_lock);
- DBUG_RETURN(0);
- }
- log_descriptor.next_pass_max_lsn= LSN_IMPOSSIBLE;
- }
- log_descriptor.flush_in_progress= 1;
- flush_horizon= log_descriptor.previous_flush_horizon;
- DBUG_PRINT("info", ("flush_in_progress is set"));
- pthread_mutex_unlock(&log_descriptor.log_flush_lock);
-
- translog_lock();
- if (log_descriptor.is_everything_flushed)
- {
- DBUG_PRINT("info", ("everything is flushed"));
- rc= (translog_status == TRANSLOG_READONLY);
- translog_unlock();
- goto out;
- }
+ DBUG_ENTER("translog_flush_buffers");
/*
We will recheck information when will lock buffers one by
@@ -7656,15 +7809,15 @@
/*
if LSN up to which we have to flush bigger then maximum LSN of previous
buffer and at least one LSN was saved in the current buffer (last_lsn !=
- LSN_IMPOSSIBLE) then we better finish the current buffer.
+ LSN_IMPOSSIBLE) then we have to close the current buffer.
*/
- if (cmp_translog_addr(lsn, log_descriptor.bc.buffer->prev_last_lsn) > 0 &&
+ if (cmp_translog_addr(*lsn, log_descriptor.bc.buffer->prev_last_lsn) > 0 &&
log_descriptor.bc.buffer->last_lsn != LSN_IMPOSSIBLE)
{
struct st_translog_buffer *buffer= log_descriptor.bc.buffer;
- lsn= log_descriptor.bc.buffer->last_lsn; /* fix lsn if it was horizon */
+ *lsn= log_descriptor.bc.buffer->last_lsn; /* fix lsn if it was horizon */
DBUG_PRINT("info", ("LSN to flush fixed to last lsn: (%lu,0x%lx)",
- LSN_IN_PARTS(log_descriptor.bc.buffer->last_lsn)));
+ LSN_IN_PARTS(log_descriptor.bc.buffer->last_lsn)));
last_buffer_no= log_descriptor.bc.buffer_no;
log_descriptor.is_everything_flushed= 1;
translog_force_current_buffer_to_finish();
@@ -7676,8 +7829,10 @@
TRANSLOG_BUFFERS_NO);
translog_unlock();
}
- sent_to_disk= translog_get_sent_to_disk();
- if (cmp_translog_addr(lsn, sent_to_disk) > 0)
+
+ /* flush buffers */
+ *sent_to_disk= translog_get_sent_to_disk();
+ if (cmp_translog_addr(*lsn, *sent_to_disk) > 0)
{
DBUG_PRINT("info", ("Start buffer #: %u last buffer #: %u",
@@ -7697,53 +7852,238 @@
LSN_IN_PARTS(buffer->last_lsn),
(buffer->file ?
"dirty" : "closed")));
- if (buffer->prev_last_lsn <= lsn &&
+ if (buffer->prev_last_lsn <= *lsn &&
buffer->file != NULL)
{
- DBUG_ASSERT(flush_horizon <= buffer->offset + buffer->size);
- flush_horizon= buffer->offset + buffer->size;
+ DBUG_ASSERT(*flush_horizon <= buffer->offset + buffer->size);
+ *flush_horizon= (buffer->pre_force_close_horizon != LSN_IMPOSSIBLE ?
+ buffer->pre_force_close_horizon :
+ buffer->offset + buffer->size);
+ /* pre_force_close_horizon is reset during new buffer start */
+ DBUG_PRINT("info", ("flush_horizon: (%lu,0x%lx)",
+ LSN_IN_PARTS(*flush_horizon)));
+ DBUG_ASSERT(*flush_horizon <= log_descriptor.horizon);
+
translog_buffer_flush(buffer);
}
translog_buffer_unlock(buffer);
i= (i + 1) % TRANSLOG_BUFFERS_NO;
} while (i != last_buffer_no);
- sent_to_disk= translog_get_sent_to_disk();
- }
-
- /* sync files from previous flush till current one */
- for (fn= LSN_FILE_NO(log_descriptor.flushed); fn <= LSN_FILE_NO(lsn); fn++)
- {
- TRANSLOG_FILE *file= get_logfile_by_number(fn);
- DBUG_ASSERT(file != NULL);
- if (!file->is_sync)
- {
- if (my_sync(file->handler.file, MYF(MY_WME)))
+ *sent_to_disk= translog_get_sent_to_disk();
+ }
+
+ DBUG_VOID_RETURN;
+}
+
+/**
+ @brief Flush the log up to given LSN (included)
+
+ @param lsn log record serial number up to which (inclusive)
+ the log has to be flushed
+
+ @return Operation status
+ @retval 0 OK
+ @retval 1 Error
+
+ @note
+
+ - Non group commit logic: Commits made in passes. Thread which started
+ flush first is performing actual flush, other threads sets new goal (LSN)
+ of the next pass (if it is maximum) and waits for the pass end or just
+ wait for the pass end.
+
+ - If hard group commit enabled and rate set to zero:
+ The first thread sends all changed buffers to disk. This is repeated
+ as long as there are new LSNs added. The process can not loop
+ forever because we have limited number of threads and they will wait
+ for the data to be synced.
+ Pseudo code:
+
+ do
+ send changed buffers to disk
+ while new_goal
+ sync
+
+ - If hard group commit switched ON and less than rate microseconds has
+ passed from last sync, then after buffers have been sent to disk
+ wait until rate microseconds has passed since last sync, do sync and return.
+ This ensures that if we call sync infrequently we don't do any waits.
+
+ - If soft group commit enabled everything works as with 'non group commit'
+ but the thread doesn't do any real sync(). If rate is not zero the
+ sync() will be performed by a service thread with the given rate
+ when needed (new LSN appears).
+
+ @note Terminology:
+ 'sent to disk' means written to disk but not sync()ed,
+ 'flushed' mean sent to disk and synced().
+*/
+
+my_bool translog_flush(TRANSLOG_ADDRESS lsn)
+{
+ struct timespec abstime;
+ ulonglong flush_interval;
+ ulonglong time_spent;
+ LSN sent_to_disk= LSN_IMPOSSIBLE;
+ TRANSLOG_ADDRESS flush_horizon;
+ my_bool rc= 0;
+ my_bool hgroup_commit_at_start;
+ DBUG_ENTER("translog_flush");
+ DBUG_PRINT("enter", ("Flush up to LSN: (%lu,0x%lx)", LSN_IN_PARTS(lsn)));
+ DBUG_ASSERT(translog_status == TRANSLOG_OK ||
+ translog_status == TRANSLOG_READONLY);
+ LINT_INIT(sent_to_disk);
+
+ pthread_mutex_lock(&log_descriptor.log_flush_lock);
+ DBUG_PRINT("info", ("Everything is flushed up to (%lu,0x%lx)",
+ LSN_IN_PARTS(log_descriptor.flushed)));
+ if (cmp_translog_addr(log_descriptor.flushed, lsn) >= 0)
+ {
+ pthread_mutex_unlock(&log_descriptor.log_flush_lock);
+ DBUG_RETURN(0);
+ }
+ if (log_descriptor.flush_in_progress)
+ {
+ translog_lock();
+ /* fix lsn if it was horizon */
+ if (cmp_translog_addr(lsn, log_descriptor.bc.buffer->last_lsn) > 0)
+ lsn= BUFFER_MAX_LSN(log_descriptor.bc.buffer);
+ translog_unlock();
+ translog_flush_set_new_goal_and_wait(lsn);
+ if (!pthread_equal(log_descriptor.max_lsn_requester, pthread_self()))
+ {
+ /*
+ translog_flush_wait_for_end() release log_flush_lock while is
+ waiting then acquire it again
+ */
+ translog_flush_wait_for_end(lsn);
+ pthread_mutex_unlock(&log_descriptor.log_flush_lock);
+ DBUG_RETURN(0);
+ }
+ log_descriptor.next_pass_max_lsn= LSN_IMPOSSIBLE;
+ }
+ log_descriptor.flush_in_progress= 1;
+ flush_horizon= log_descriptor.previous_flush_horizon;
+ DBUG_PRINT("info", ("flush_in_progress is set, flush_horizon: (%lu,0x%lx)",
+ LSN_IN_PARTS(flush_horizon)));
+ pthread_mutex_unlock(&log_descriptor.log_flush_lock);
+
+ hgroup_commit_at_start= hard_group_commit;
+ if (hgroup_commit_at_start)
+ flush_interval= group_commit_wait;
+
+ translog_lock();
+ if (log_descriptor.is_everything_flushed)
+ {
+ DBUG_PRINT("info", ("everything is flushed"));
+ translog_unlock();
+ pthread_mutex_lock(&log_descriptor.log_flush_lock);
+ goto out;
+ }
+
+ for (;;)
+ {
+ /* Following function flushes buffers and makes translog_unlock() */
+ translog_flush_buffers(&lsn, &sent_to_disk, &flush_horizon);
+
+ if (!hgroup_commit_at_start)
+ break; /* flush pass is ended */
+
+retest:
+ /*
+ We do not check time here because pthread_mutex_lock rarely takes
+ a lot of time so we can sacrifice a bit precision to performance
+ (taking into account that my_micro_time() might be expensive call).
+ */
+ if (flush_interval == 0)
+ break; /* flush pass is ended */
+
+ pthread_mutex_lock(&log_descriptor.log_flush_lock);
+ if (log_descriptor.next_pass_max_lsn == LSN_IMPOSSIBLE)
+ {
+ if (flush_interval == 0 ||
+ (time_spent= (my_micro_time() - flush_start)) >= flush_interval)
{
- rc= 1;
- translog_stop_writing();
- sent_to_disk= LSN_IMPOSSIBLE;
- goto out;
+ pthread_mutex_unlock(&log_descriptor.log_flush_lock);
+ break;
}
- file->is_sync= 1;
- }
- }
-
- if (sync_log_dir >= TRANSLOG_SYNC_DIR_ALWAYS &&
- (LSN_FILE_NO(log_descriptor.previous_flush_horizon) !=
- LSN_FILE_NO(flush_horizon) ||
- ((LSN_OFFSET(log_descriptor.previous_flush_horizon) - 1) /
- TRANSLOG_PAGE_SIZE) !=
- ((LSN_OFFSET(flush_horizon) - 1) / TRANSLOG_PAGE_SIZE)))
- rc|= sync_dir(log_descriptor.directory_fd, MYF(MY_WME | MY_IGNORE_BADFD));
+ DBUG_PRINT("info", ("flush waits: %llu interval: %llu spent: %llu",
+ flush_interval - time_spent,
+ flush_interval, time_spent));
+ /* wait time or next goal */
+ set_timespec_nsec(abstime, flush_interval - time_spent);
+ pthread_cond_timedwait(&log_descriptor.new_goal_cond,
+ &log_descriptor.log_flush_lock,
+ &abstime);
+ pthread_mutex_unlock(&log_descriptor.log_flush_lock);
+ DBUG_PRINT("info", ("retest conditions"));
+ goto retest;
+ }
+
+ /* take next goal */
+ lsn= log_descriptor.next_pass_max_lsn;
+ log_descriptor.next_pass_max_lsn= LSN_IMPOSSIBLE;
+ /* prevent other thread from continue */
+ log_descriptor.max_lsn_requester= pthread_self();
+ DBUG_PRINT("info", ("flush took next goal: (%lu,0x%lx)",
+ LSN_IN_PARTS(lsn)));
+ pthread_mutex_unlock(&log_descriptor.log_flush_lock);
+
+ /* next flush pass */
+ DBUG_PRINT("info", ("next flush pass"));
+ translog_lock();
+ }
+
+ /*
+ sync() files from previous flush till current one
+ */
+ if (!soft_sync || hgroup_commit_at_start)
+ {
+ if ((rc=
+ translog_sync_files(LSN_FILE_NO(log_descriptor.flushed),
+ LSN_FILE_NO(lsn),
+ sync_log_dir >= TRANSLOG_SYNC_DIR_ALWAYS &&
+ (LSN_FILE_NO(log_descriptor.
+ previous_flush_horizon) !=
+ LSN_FILE_NO(flush_horizon) ||
+ (LSN_OFFSET(log_descriptor.
+ previous_flush_horizon) /
+ TRANSLOG_PAGE_SIZE) !=
+ (LSN_OFFSET(flush_horizon) /
+ TRANSLOG_PAGE_SIZE)))))
+ {
+ sent_to_disk= LSN_IMPOSSIBLE;
+ pthread_mutex_lock(&log_descriptor.log_flush_lock);
+ goto out;
+ }
+ /* keep values for soft sync() and forced sync() actual */
+ {
+ uint32 fileno= LSN_FILE_NO(lsn);
+ my_atomic_rwlock_wrlock(&soft_sync_rwl);
+ my_atomic_store32(&soft_sync_min, fileno);
+ my_atomic_store32(&soft_sync_max, fileno);
+ my_atomic_rwlock_wrunlock(&soft_sync_rwl);
+ }
+ }
+ else
+ {
+ my_atomic_rwlock_wrlock(&soft_sync_rwl);
+ my_atomic_store32(&soft_sync_max, LSN_FILE_NO(lsn));
+ my_atomic_store32(&soft_need_sync, 1);
+ my_atomic_rwlock_wrunlock(&soft_sync_rwl);
+ }
+
+ DBUG_ASSERT(flush_horizon <= log_descriptor.horizon);
+
+ pthread_mutex_lock(&log_descriptor.log_flush_lock);
log_descriptor.previous_flush_horizon= flush_horizon;
out:
- pthread_mutex_lock(&log_descriptor.log_flush_lock);
if (sent_to_disk != LSN_IMPOSSIBLE)
log_descriptor.flushed= sent_to_disk;
log_descriptor.flush_in_progress= 0;
log_descriptor.flush_no++;
DBUG_PRINT("info", ("flush_in_progress is dropped"));
- pthread_mutex_unlock(&log_descriptor.log_flush_lock);\
+ pthread_mutex_unlock(&log_descriptor.log_flush_lock);
pthread_cond_broadcast(&log_descriptor.log_flush_cond);
DBUG_RETURN(rc);
}
@@ -8113,6 +8453,8 @@
my_bool translog_purge(TRANSLOG_ADDRESS low)
{
uint32 last_need_file= LSN_FILE_NO(low);
+ uint32 min_unsync;
+ int soft;
TRANSLOG_ADDRESS horizon= translog_get_horizon();
int rc= 0;
DBUG_ENTER("translog_purge");
@@ -8120,12 +8462,26 @@
DBUG_ASSERT(translog_status == TRANSLOG_OK ||
translog_status == TRANSLOG_READONLY);
+ soft= soft_sync;
+ my_atomic_rwlock_wrlock(&soft_sync_rwl);
+ min_unsync= my_atomic_load32(&soft_sync_min);
+ my_atomic_rwlock_wrunlock(&soft_sync_rwl);
+ DBUG_PRINT("info", ("min_unsync: %lu", (ulong) min_unsync));
+ if (soft && min_unsync < last_need_file)
+ {
+ last_need_file= min_unsync;
+ DBUG_PRINT("info", ("last_need_file set to %lu", (ulong)last_need_file));
+ }
+
pthread_mutex_lock(&log_descriptor.purger_lock);
+ DBUG_PRINT("info", ("last_lsn_checked file: %lu:",
+ (ulong) log_descriptor.last_lsn_checked));
if (LSN_FILE_NO(log_descriptor.last_lsn_checked) < last_need_file)
{
uint32 i;
uint32 min_file= translog_first_file(horizon, 1);
DBUG_ASSERT(min_file != 0); /* log is already started */
+ DBUG_PRINT("info", ("min_file: %lu:",(ulong) min_file));
for(i= min_file; i < last_need_file && rc == 0; i++)
{
LSN lsn= translog_get_file_max_lsn_stored(i);
@@ -8356,6 +8712,159 @@
}
+
+/**
+ Sets soft sync mode
+
+ @param mode TRUE if we need switch soft sync on else off
+*/
+
+void translog_soft_sync(my_bool mode)
+{
+ soft_sync= mode;
+}
+
+
+/**
+ Sets hard group commit
+
+ @param mode TRUE if we need switch hard group commit on else off
+*/
+
+void translog_hard_group_commit(my_bool mode)
+{
+ hard_group_commit= mode;
+}
+
+
+/**
+ @brief forced log sync (used when we are switching modes)
+*/
+
+void translog_sync()
+{
+ uint32 max= get_current_logfile()->number;
+ uint32 min;
+ DBUG_ENTER("ma_translog_sync");
+
+ my_atomic_rwlock_rdlock(&soft_sync_rwl);
+ min= my_atomic_load32(&soft_sync_min);
+ my_atomic_rwlock_rdunlock(&soft_sync_rwl);
+ if (!min)
+ min= max;
+
+ translog_sync_files(min, max, sync_log_dir >= TRANSLOG_SYNC_DIR_ALWAYS);
+
+ DBUG_VOID_RETURN;
+}
+
+
+/**
+ @brief set rate for group commit
+
+ @param interval interval to set.
+
+ @note We use this function with additional variable because have to
+ restart service thread with new value which we can't make inside changing
+ variable routine (update_maria_group_commit_interval)
+*/
+
+void translog_set_group_commit_interval(uint32 interval)
+{
+ DBUG_ENTER("translog_set_group_commit_interval");
+ group_commit_wait= interval;
+ DBUG_PRINT("info", ("wait: %llu",
+ (ulonglong)group_commit_wait));
+ DBUG_VOID_RETURN;
+}
+
+
+/**
+ @brief syncing service thread
+*/
+
+static pthread_handler_t
+ma_soft_sync_background( void *arg __attribute__((unused)))
+{
+
+ my_thread_init();
+ {
+ DBUG_ENTER("ma_soft_sync_background");
+ for(;;)
+ {
+ ulonglong prev_loop= my_micro_time();
+ ulonglong time, sleep;
+ uint32 min, max, sync_request;
+ my_atomic_rwlock_rdlock(&soft_sync_rwl);
+ min= my_atomic_load32(&soft_sync_min);
+ max= my_atomic_load32(&soft_sync_max);
+ sync_request= my_atomic_load32(&soft_need_sync);
+ my_atomic_store32(&soft_sync_min, max);
+ my_atomic_store32(&soft_need_sync, 0);
+ my_atomic_rwlock_rdunlock(&soft_sync_rwl);
+
+ sleep= group_commit_wait;
+ if (sync_request)
+ translog_sync_files(min, max, FALSE);
+ time= my_micro_time() - prev_loop;
+ if (time > sleep)
+ sleep= 0;
+ else
+ sleep-= time;
+ if (my_service_thread_sleep(&soft_sync_control, sleep))
+ break;
+ }
+ my_service_thread_signal_end(&soft_sync_control);
+ my_thread_end();
+ DBUG_RETURN(0);
+ }
+}
+
+
+/**
+ @brief Starts syncing thread
+*/
+
+int translog_soft_sync_start(void)
+{
+ pthread_t th;
+ int res= 0;
+ uint32 min, max;
+ DBUG_ENTER("translog_soft_sync_start");
+
+ /* check and init variables */
+ my_atomic_rwlock_rdlock(&soft_sync_rwl);
+ min= my_atomic_load32(&soft_sync_min);
+ max= my_atomic_load32(&soft_sync_max);
+ if (!max)
+ my_atomic_store32(&soft_sync_max, (max= get_current_logfile()->number));
+ if (!min)
+ my_atomic_store32(&soft_sync_min, max);
+ my_atomic_store32(&soft_need_sync, 1);
+ my_atomic_rwlock_rdunlock(&soft_sync_rwl);
+
+ if (!(res= ma_service_thread_control_init(&soft_sync_control)))
+ if (!(res= pthread_create(&th, NULL, ma_soft_sync_background, NULL)))
+ soft_sync_control.status= THREAD_RUNNING;
+ DBUG_RETURN(res);
+}
+
+
+/**
+ @brief Stops syncing thread
+*/
+
+void translog_soft_sync_end(void)
+{
+ DBUG_ENTER("translog_soft_sync_end");
+ if (soft_sync_control.inited)
+ {
+ ma_service_thread_control_end(&soft_sync_control);
+ }
+ DBUG_VOID_RETURN;
+}
+
+
#ifdef MARIA_DUMP_LOG
#include <my_getopt.h>
extern void translog_example_table_init();
=== modified file 'storage/maria/ma_loghandler.h'
--- a/storage/maria/ma_loghandler.h 2009-01-15 22:25:53 +0000
+++ b/storage/maria/ma_loghandler.h 2010-02-12 09:13:25 +0000
@@ -342,6 +342,14 @@
TRANSLOG_SHUTDOWN /* going to shutdown the loghandler */
};
extern enum enum_translog_status translog_status;
+extern ulonglong translog_syncs; /* Number of sync()s */
+
+void translog_soft_sync(my_bool mode);
+void translog_hard_group_commit(my_bool mode);
+int translog_soft_sync_start(void);
+void translog_soft_sync_end(void);
+void translog_sync();
+void translog_set_group_commit_interval(uint32 interval);
/*
all the rest added because of recovery; should we make
@@ -441,6 +449,14 @@
typedef enum
{
+ TRANSLOG_GCOMMIT_NONE,
+ TRANSLOG_GCOMMIT_HARD,
+ TRANSLOG_GCOMMIT_SOFT
+} enum_maria_group_commit;
+extern ulong maria_group_commit;
+extern ulong maria_group_commit_interval;
+typedef enum
+{
TRANSLOG_PURGE_IMMIDIATE,
TRANSLOG_PURGE_EXTERNAL,
TRANSLOG_PURGE_ONDEMAND
1
0
[Maria-developers] Rev 2740: Group commit for maria storage engine. in file:///Users/bell/maria/bzr/work-maria-5.2-groupcommit/
by sanja@askmonty.org 11 Feb '10
by sanja@askmonty.org 11 Feb '10
11 Feb '10
At file:///Users/bell/maria/bzr/work-maria-5.2-groupcommit/
------------------------------------------------------------
revno: 2740
revision-id: sanja(a)askmonty.org-20100212065247-vnhehxm6snm32c1j
parent: knielsen(a)knielsen-hq.org-20100201190519-b9uktnn90rwwiile
committer: sanja(a)askmonty.org
branch nick: work-maria-5.2-groupcommit
timestamp: Fri 2010-02-12 08:52:47 +0200
message:
Group commit for maria storage engine.
=== added file 'mysql-test/suite/maria/r/group_commit.result'
--- a/mysql-test/suite/maria/r/group_commit.result 1970-01-01 00:00:00 +0000
+++ b/mysql-test/suite/maria/r/group_commit.result 2010-02-12 06:52:47 +0000
@@ -0,0 +1,17 @@
+drop table if exists t1;
+create table t1 (a int);
+SET GLOBAL maria_group_commit="NONE";
+SET GLOBAL maria_group_commit_interval= 0;
+SET GLOBAL maria_group_commit="NONE";
+SET GLOBAL maria_group_commit_interval= 100;
+SET GLOBAL maria_group_commit="HARD";
+SET GLOBAL maria_group_commit_interval= 0;
+SET GLOBAL maria_group_commit="HARD";
+SET GLOBAL maria_group_commit_interval= 100;
+SET GLOBAL maria_group_commit="SOFT";
+SET GLOBAL maria_group_commit_interval= 0;
+SET GLOBAL maria_group_commit="SOFT";
+SET GLOBAL maria_group_commit_interval= 100;
+SET GLOBAL maria_group_commit="NONE";
+SET GLOBAL maria_group_commit_interval= 0;
+drop table t1;
=== modified file 'mysql-test/suite/maria/r/maria3.result'
--- a/mysql-test/suite/maria/r/maria3.result 2009-09-18 01:04:43 +0000
+++ b/mysql-test/suite/maria/r/maria3.result 2010-02-12 06:52:47 +0000
@@ -306,6 +306,8 @@
maria_block_size 8192
maria_checkpoint_interval 30
maria_force_start_after_recovery_failures 0
+maria_group_commit none
+maria_group_commit_interval 0
maria_log_file_size 4294959104
maria_log_purge_type immediate
maria_max_sort_file_size 9223372036853727232
@@ -328,6 +330,7 @@
Maria_pagecache_reads #
Maria_pagecache_write_requests #
Maria_pagecache_writes #
+Maria_transaction_log_syncs #
create table t1 (b char(0));
insert into t1 values(NULL),("");
select length(b) from t1;
=== added file 'mysql-test/suite/maria/t/group_commit.test'
--- a/mysql-test/suite/maria/t/group_commit.test 1970-01-01 00:00:00 +0000
+++ b/mysql-test/suite/maria/t/group_commit.test 2010-02-12 06:52:47 +0000
@@ -0,0 +1,71 @@
+# Test different ways of syncing (mostly syntax)
+
+--disable_warnings
+drop table if exists t1;
+--enable_warnings
+
+create table t1 (a int);
+
+SET GLOBAL maria_group_commit="NONE";
+SET GLOBAL maria_group_commit_interval= 0;
+--disable_query_log
+let $num = 5000;
+while ($num)
+{
+ insert into t1 values (1);
+ dec $num;
+}
+--enable_query_log
+SET GLOBAL maria_group_commit="NONE";
+SET GLOBAL maria_group_commit_interval= 100;
+--disable_query_log
+let $num = 5000;
+while ($num)
+{
+ insert into t1 values (1);
+ dec $num;
+}
+--enable_query_log
+SET GLOBAL maria_group_commit="HARD";
+SET GLOBAL maria_group_commit_interval= 0;
+--disable_query_log
+let $num = 5000;
+while ($num)
+{
+ insert into t1 values (1);
+ dec $num;
+}
+--enable_query_log
+SET GLOBAL maria_group_commit="HARD";
+SET GLOBAL maria_group_commit_interval= 100;
+--disable_query_log
+let $num = 5000;
+while ($num)
+{
+ insert into t1 values (1);
+ dec $num;
+}
+--enable_query_log
+SET GLOBAL maria_group_commit="SOFT";
+SET GLOBAL maria_group_commit_interval= 0;
+--disable_query_log
+let $num = 5000;
+while ($num)
+{
+ insert into t1 values (1);
+ dec $num;
+}
+--enable_query_log
+SET GLOBAL maria_group_commit="SOFT";
+SET GLOBAL maria_group_commit_interval= 100;
+--disable_query_log
+let $num = 5000;
+while ($num)
+{
+ insert into t1 values (1);
+ dec $num;
+}
+--enable_query_log
+SET GLOBAL maria_group_commit="NONE";
+SET GLOBAL maria_group_commit_interval= 0;
+drop table t1;
=== modified file 'storage/maria/ha_maria.cc'
--- a/storage/maria/ha_maria.cc 2009-12-03 11:34:11 +0000
+++ b/storage/maria/ha_maria.cc 2010-02-12 06:52:47 +0000
@@ -102,22 +102,40 @@
array_elements(maria_translog_purge_type_names) - 1, "",
maria_translog_purge_type_names, NULL
};
+
+/* transactional log directory sync */
const char *maria_sync_log_dir_names[]=
{
"NEVER", "NEWFILE", "ALWAYS", NullS
};
-
TYPELIB maria_sync_log_dir_typelib=
{
array_elements(maria_sync_log_dir_names) - 1, "",
maria_sync_log_dir_names, NULL
};
+/* transactional log group commit */
+const char *maria_group_commit_names[]=
+{
+ "none", "hard", "soft", NullS
+};
+TYPELIB maria_group_commit_typelib=
+{
+ array_elements(maria_group_commit_names) - 1, "",
+ maria_group_commit_names, NULL
+};
+
/** Interval between background checkpoints in seconds */
static ulong checkpoint_interval;
static void update_checkpoint_interval(MYSQL_THD thd,
struct st_mysql_sys_var *var,
void *var_ptr, const void *save);
+static void update_maria_group_commit(MYSQL_THD thd,
+ struct st_mysql_sys_var *var,
+ void *var_ptr, const void *save);
+static void update_maria_group_commit_interval(MYSQL_THD thd,
+ struct st_mysql_sys_var *var,
+ void *var_ptr, const void *save);
/** After that many consecutive recovery failures, remove logs */
static ulong force_start_after_recovery_failures;
static void update_log_file_size(MYSQL_THD thd,
@@ -164,6 +182,24 @@
NULL, update_log_file_size, TRANSLOG_FILE_SIZE,
TRANSLOG_MIN_FILE_SIZE, 0xffffffffL, TRANSLOG_PAGE_SIZE);
+static MYSQL_SYSVAR_ENUM(group_commit, maria_group_commit,
+ PLUGIN_VAR_RQCMDARG,
+ "Specifies maria group commit mode. "
+ "Possible values are \"none\" (no group commit), "
+ "\"hard\" (with waiting to actual commit), "
+ "\"soft\" (no wait for commit (DANGEROUS!!!))",
+ NULL, update_maria_group_commit,
+ TRANSLOG_GCOMMIT_NONE, &maria_group_commit_typelib);
+
+static MYSQL_SYSVAR_ULONG(group_commit_interval, maria_group_commit_interval,
+ PLUGIN_VAR_RQCMDARG,
+ "Interval between commite in microseconds (1/1000000c)."
+ " 0 stands for no waiting"
+ " for other threads to come and do a commit in \"hard\" mode and no"
+ " sync()/commit at all in \"soft\" mode. Option has only an effect"
+ " if maria_group_commit is used",
+ NULL, update_maria_group_commit_interval, 0, 0, UINT_MAX, 1);
+
static MYSQL_SYSVAR_ENUM(log_purge_type, log_purge_type,
PLUGIN_VAR_RQCMDARG,
"Specifies how maria transactional log will be purged. "
@@ -3275,6 +3311,8 @@
MYSQL_SYSVAR(block_size),
MYSQL_SYSVAR(checkpoint_interval),
MYSQL_SYSVAR(force_start_after_recovery_failures),
+ MYSQL_SYSVAR(group_commit),
+ MYSQL_SYSVAR(group_commit_interval),
MYSQL_SYSVAR(page_checksum),
MYSQL_SYSVAR(log_dir_path),
MYSQL_SYSVAR(log_file_size),
@@ -3306,6 +3344,92 @@
}
/**
+ @brief Updates group commit mode
+*/
+
+static void update_maria_group_commit(MYSQL_THD thd,
+ struct st_mysql_sys_var *var,
+ void *var_ptr, const void *save)
+{
+ ulong value= (ulong)*((long *)var_ptr);
+ DBUG_ENTER("update_maria_group_commit");
+ DBUG_PRINT("enter", ("old value: %lu new value %lu rate %lu",
+ value, (ulong)(*(long *)save),
+ maria_group_commit_interval));
+ /* old value */
+ switch (value) {
+ case TRANSLOG_GCOMMIT_NONE:
+ break;
+ case TRANSLOG_GCOMMIT_HARD:
+ translog_hard_group_commit(FALSE);
+ break;
+ case TRANSLOG_GCOMMIT_SOFT:
+ translog_soft_sync(FALSE);
+ if (maria_group_commit_interval)
+ translog_soft_sync_end();
+ break;
+ default:
+ DBUG_ASSERT(0); /* impossible */
+ }
+ value= *(ulong *)var_ptr= (ulong)(*(long *)save);
+ translog_sync();
+ /* new value */
+ switch (value) {
+ case TRANSLOG_GCOMMIT_NONE:
+ break;
+ case TRANSLOG_GCOMMIT_HARD:
+ translog_hard_group_commit(TRUE);
+ break;
+ case TRANSLOG_GCOMMIT_SOFT:
+ translog_soft_sync(TRUE);
+ /* variable change made under global lock so we can just read it */
+ if (maria_group_commit_interval)
+ translog_soft_sync_start();
+ break;
+ default:
+ DBUG_ASSERT(0); /* impossible */
+ }
+ DBUG_VOID_RETURN;
+}
+
+/**
+ @brief Updates group commit interval
+*/
+
+static void update_maria_group_commit_interval(MYSQL_THD thd,
+ struct st_mysql_sys_var *var,
+ void *var_ptr, const void *save)
+{
+ ulong new_value= (ulong)*((long *)save);
+ ulong *value_ptr= (ulong*) var_ptr;
+ DBUG_ENTER("update_maria_group_commit_interval");
+ DBUG_PRINT("enter", ("old value: %lu new value %lu group commit %lu",
+ *value_ptr, new_value, maria_group_commit));
+
+ /* variable change made under global lock so we can just read it */
+ switch (maria_group_commit) {
+ case TRANSLOG_GCOMMIT_NONE:
+ *value_ptr= new_value;
+ translog_set_group_commit_interval(new_value);
+ break;
+ case TRANSLOG_GCOMMIT_HARD:
+ *value_ptr= new_value;
+ translog_set_group_commit_interval(new_value);
+ break;
+ case TRANSLOG_GCOMMIT_SOFT:
+ if (*value_ptr)
+ translog_soft_sync_end();
+ translog_set_group_commit_interval(new_value);
+ if ((*value_ptr= new_value))
+ translog_soft_sync_start();
+ break;
+ default:
+ DBUG_ASSERT(0); /* impossible */
+ }
+ DBUG_VOID_RETURN;
+}
+
+/**
@brief Updates the transaction log file limit.
*/
@@ -3327,6 +3451,7 @@
{"Maria_pagecache_reads", (char*) &maria_pagecache_var.global_cache_read, SHOW_LONGLONG},
{"Maria_pagecache_write_requests", (char*) &maria_pagecache_var.global_cache_w_requests, SHOW_LONGLONG},
{"Maria_pagecache_writes", (char*) &maria_pagecache_var.global_cache_write, SHOW_LONGLONG},
+ {"Maria_transaction_log_syncs", (char*) &translog_syncs, SHOW_LONGLONG},
{NullS, NullS, SHOW_LONG}
};
=== modified file 'storage/maria/ma_init.c'
--- a/storage/maria/ma_init.c 2008-10-09 20:03:54 +0000
+++ b/storage/maria/ma_init.c 2010-02-12 06:52:47 +0000
@@ -82,6 +82,11 @@
maria_inited= maria_multi_threaded= FALSE;
ft_free_stopwords();
ma_checkpoint_end();
+ if (translog_status == TRANSLOG_OK)
+ {
+ translog_soft_sync_end();
+ translog_sync();
+ }
if ((trid= trnman_get_max_trid()) > max_trid_in_control_file)
{
/*
=== modified file 'storage/maria/ma_loghandler.c'
--- a/storage/maria/ma_loghandler.c 2010-01-06 21:27:53 +0000
+++ b/storage/maria/ma_loghandler.c 2010-02-12 06:52:47 +0000
@@ -18,6 +18,7 @@
#include "ma_blockrec.h" /* for some constants and in-write hooks */
#include "ma_key_recover.h" /* For some in-write hooks */
#include "ma_checkpoint.h"
+#include "ma_servicethread.h"
/*
On Windows, neither my_open() nor my_sync() work for directories.
@@ -47,6 +48,15 @@
#include <m_ctype.h>
#endif
+/** @brief protects checkpoint_in_progress */
+static pthread_mutex_t LOCK_soft_sync;
+/** @brief for killing the background checkpoint thread */
+static pthread_cond_t COND_soft_sync;
+/** @brief control structure for checkpoint background thread */
+static MA_SERVICE_THREAD_CONTROL soft_sync_control=
+ {THREAD_DEAD, FALSE, &LOCK_soft_sync, &COND_soft_sync};
+
+
/* transaction log file descriptor */
typedef struct st_translog_file
{
@@ -124,10 +134,24 @@
/* Previous buffer offset to detect it flush finish */
TRANSLOG_ADDRESS prev_buffer_offset;
/*
+ If the buffer was forced to close it save value of its horizon
+ otherwise LSN_IMPOSSIBLE
+ */
+ TRANSLOG_ADDRESS pre_force_close_horizon;
+ /*
How much is written (or will be written when copy_to_buffer_in_progress
become 0) to this buffer
*/
translog_size_t size;
+ /*
+ When moving from one log buffer to another, we write the last of the
+ previous buffer to file and then move to start using the new log
+ buffer. In the case of a part filed last page, this page is not moved
+ to the start of the new buffer but instead we set the 'skip_data'
+ variable to tell us how much data at the beginning of the buffer is not
+ relevant.
+ */
+ uint skipped_data;
/* File handler for this buffer */
TRANSLOG_FILE *file;
/* Threads which are waiting for buffer filling/freeing */
@@ -304,6 +328,7 @@
*/
pthread_mutex_t log_flush_lock;
pthread_cond_t log_flush_cond;
+ pthread_cond_t new_goal_cond;
/* Protects changing of headers of finished files (max_lsn) */
pthread_mutex_t file_header_lock;
@@ -344,13 +369,39 @@
ulong log_purge_type= TRANSLOG_PURGE_IMMIDIATE;
ulong log_file_size= TRANSLOG_FILE_SIZE;
+/* sync() of log files directory mode */
ulong sync_log_dir= TRANSLOG_SYNC_DIR_NEWFILE;
+ulong maria_group_commit= TRANSLOG_GCOMMIT_NONE;
+ulong maria_group_commit_interval= 0;
/* Marker for end of log */
static uchar end_of_log= 0;
#define END_OF_LOG &end_of_log
+/**
+ Switch for "soft" sync (no real sync() but periodical sync by service
+ thread)
+*/
+static volatile my_bool soft_sync= FALSE;
+/**
+ Switch for "hard" group commit mode
+*/
+static volatile my_bool hard_group_commit= FALSE;
+/**
+ File numbers interval which have to be sync()
+*/
+static uint32 soft_sync_min= 0;
+static uint32 soft_sync_max= 0;
+static uint32 soft_need_sync= 1;
+/**
+ stores interval in microseconds
+*/
+static uint32 group_commit_wait= 0;
enum enum_translog_status translog_status= TRANSLOG_UNINITED;
+ulonglong translog_syncs= 0; /* Number of sync()s */
+
+/* time of last flush */
+static ulonglong flush_start= 0;
/* chunk types */
#define TRANSLOG_CHUNK_LSN 0x00 /* 0 chunk refer as LSN (head or tail */
@@ -980,12 +1031,17 @@
static TRANSLOG_FILE *get_current_logfile()
{
TRANSLOG_FILE *file;
+ DBUG_ENTER("get_current_logfile");
rw_rdlock(&log_descriptor.open_files_lock);
+ DBUG_PRINT("info", ("max_file: %lu min_file: %lu open_files: %lu",
+ (ulong) log_descriptor.max_file,
+ (ulong) log_descriptor.min_file,
+ (ulong) log_descriptor.open_files.elements));
DBUG_ASSERT(log_descriptor.max_file - log_descriptor.min_file + 1 ==
log_descriptor.open_files.elements);
file= *dynamic_element(&log_descriptor.open_files, 0, TRANSLOG_FILE **);
rw_unlock(&log_descriptor.open_files_lock);
- return (file);
+ DBUG_RETURN(file);
}
uchar NEAR maria_trans_file_magic[]=
@@ -1069,6 +1125,7 @@
static my_bool translog_max_lsn_to_header(File file, LSN lsn)
{
uchar lsn_buff[LSN_STORE_SIZE];
+ my_bool rc;
DBUG_ENTER("translog_max_lsn_to_header");
DBUG_PRINT("enter", ("File descriptor: %ld "
"lsn: (%lu,0x%lx)",
@@ -1077,11 +1134,17 @@
lsn_store(lsn_buff, lsn);
- DBUG_RETURN(my_pwrite(file, lsn_buff,
- LSN_STORE_SIZE,
- (LOG_HEADER_DATA_SIZE - LSN_STORE_SIZE),
- log_write_flags) != 0 ||
- my_sync(file, MYF(MY_WME)) != 0);
+ rc= (my_pwrite(file, lsn_buff,
+ LSN_STORE_SIZE,
+ (LOG_HEADER_DATA_SIZE - LSN_STORE_SIZE),
+ log_write_flags) != 0 ||
+ my_sync(file, MYF(MY_WME)) != 0);
+ /*
+ We should not increase counter in case of error above, but it is so
+ unlikely that we can ignore this case
+ */
+ translog_syncs++;
+ DBUG_RETURN(rc);
}
@@ -1423,7 +1486,9 @@
static my_bool translog_buffer_init(struct st_translog_buffer *buffer, int num)
{
DBUG_ENTER("translog_buffer_init");
- buffer->prev_last_lsn= buffer->last_lsn= LSN_IMPOSSIBLE;
+ buffer->pre_force_close_horizon=
+ buffer->prev_last_lsn= buffer->last_lsn=
+ LSN_IMPOSSIBLE;
DBUG_PRINT("info", ("last_lsn and prev_last_lsn set to 0 buffer: 0x%lx",
(ulong) buffer));
@@ -1435,6 +1500,7 @@
memset(buffer->buffer, TRANSLOG_FILLER, TRANSLOG_WRITE_BUFFER);
/* Buffer size */
buffer->size= 0;
+ buffer->skipped_data= 0;
/* cond of thread which is waiting for buffer filling */
if (pthread_cond_init(&buffer->waiting_filling_buffer, 0))
DBUG_RETURN(1);
@@ -1489,7 +1555,10 @@
TODO: sync only we have changed the log
*/
if (!file->is_sync)
+ {
rc= my_sync(file->handler.file, MYF(MY_WME));
+ translog_syncs++;
+ }
rc|= my_close(file->handler.file, MYF(MY_WME));
my_free(file, MYF(0));
return test(rc);
@@ -2044,7 +2113,8 @@
(ulong) LSN_OFFSET(log_descriptor.horizon),
(ulong) LSN_OFFSET(log_descriptor.horizon)));
DBUG_ASSERT(buffer_no == buffer->buffer_no);
- buffer->prev_last_lsn= buffer->last_lsn= LSN_IMPOSSIBLE;
+ buffer->pre_force_close_horizon=
+ buffer->prev_last_lsn= buffer->last_lsn= LSN_IMPOSSIBLE;
DBUG_PRINT("info", ("last_lsn and prev_last_lsn set to 0 buffer: 0x%lx",
(ulong) buffer));
buffer->offset= log_descriptor.horizon;
@@ -2052,6 +2122,7 @@
buffer->file= get_current_logfile();
buffer->overlay= 0;
buffer->size= 0;
+ buffer->skipped_data= 0;
translog_cursor_init(cursor, buffer, buffer_no);
DBUG_PRINT("info", ("file: #%ld (%d) init cursor #%u: 0x%lx "
"chaser: %d Size: %lu (%lu)",
@@ -2523,6 +2594,7 @@
TRANSLOG_ADDRESS offset= buffer->offset;
TRANSLOG_FILE *file= buffer->file;
uint8 ver= buffer->ver;
+ uint skipped_data;
DBUG_ENTER("translog_buffer_flush");
DBUG_PRINT("enter",
("Buffer: #%u 0x%lx file: %d offset: (%lu,0x%lx) size: %lu",
@@ -2557,6 +2629,8 @@
disk
*/
file= buffer->file;
+ skipped_data= buffer->skipped_data;
+ DBUG_ASSERT(skipped_data < TRANSLOG_PAGE_SIZE);
for (i= 0, pg= LSN_OFFSET(buffer->offset) / TRANSLOG_PAGE_SIZE;
i < buffer->size;
i+= TRANSLOG_PAGE_SIZE, pg++)
@@ -2573,13 +2647,16 @@
DBUG_ASSERT(i + TRANSLOG_PAGE_SIZE <= buffer->size);
if (translog_status != TRANSLOG_OK && translog_status != TRANSLOG_SHUTDOWN)
DBUG_RETURN(1);
- if (pagecache_inject(log_descriptor.pagecache,
+ if (pagecache_write_part(log_descriptor.pagecache,
&file->handler, pg, 3,
buffer->buffer + i,
PAGECACHE_PLAIN_PAGE,
PAGECACHE_LOCK_LEFT_UNLOCKED,
- PAGECACHE_PIN_LEFT_UNPINNED, 0,
- LSN_IMPOSSIBLE))
+ PAGECACHE_PIN_LEFT_UNPINNED,
+ PAGECACHE_WRITE_DONE, 0,
+ LSN_IMPOSSIBLE,
+ skipped_data,
+ TRANSLOG_PAGE_SIZE - skipped_data))
{
DBUG_PRINT("error",
("Can't write page (%lu,0x%lx) to pagecache, error: %d",
@@ -2589,10 +2666,12 @@
translog_stop_writing();
DBUG_RETURN(1);
}
+ skipped_data= 0;
}
file->is_sync= 0;
- if (my_pwrite(file->handler.file, buffer->buffer,
- buffer->size, LSN_OFFSET(buffer->offset),
+ if (my_pwrite(file->handler.file, buffer->buffer + buffer->skipped_data,
+ buffer->size - buffer->skipped_data,
+ LSN_OFFSET(buffer->offset) + buffer->skipped_data,
log_write_flags))
{
DBUG_PRINT("error", ("Can't write buffer (%lu,0x%lx) size %lu "
@@ -2985,6 +3064,7 @@
uchar *from, *table= NULL;
int is_last_unfinished_page;
uint last_protected_sector= 0;
+ uint skipped_data= curr_buffer->skipped_data;
TRANSLOG_FILE file_copy;
uint8 ver= curr_buffer->ver;
translog_wait_for_writers(curr_buffer);
@@ -2997,7 +3077,38 @@
}
DBUG_ASSERT(LSN_FILE_NO(addr) == LSN_FILE_NO(curr_buffer->offset));
from= curr_buffer->buffer + (addr - curr_buffer->offset);
- memcpy(buffer, from, TRANSLOG_PAGE_SIZE);
+ if (skipped_data && addr == curr_buffer->offset)
+ {
+ /*
+ We read page part of which is not present in buffer,
+ so we should read absent part from file (page cache actually)
+ */
+ file= get_logfile_by_number(file_no);
+ DBUG_ASSERT(file != NULL);
+ /*
+ it's ok to not lock the page because:
+ - The log handler has it's own page cache.
+ - There is only one thread that can access the log
+ cache at a time
+ */
+ if (!(buffer= pagecache_read(log_descriptor.pagecache,
+ &file->handler,
+ LSN_OFFSET(addr) / TRANSLOG_PAGE_SIZE,
+ 3, buffer,
+ PAGECACHE_PLAIN_PAGE,
+ PAGECACHE_LOCK_LEFT_UNLOCKED,
+ NULL)))
+ DBUG_RETURN(NULL);
+ }
+ else
+ skipped_data= 0; /* Read after skipped in buffer data */
+ /*
+ Now we have correct data in buffer up to 'skipped_data'. The
+ following memcpy() will move the data from the internal buffer
+ that was not yet on disk.
+ */
+ memcpy(buffer + skipped_data, from + skipped_data,
+ TRANSLOG_PAGE_SIZE - skipped_data);
/*
We can use copy then in translog_page_validator() because it
do not put it permanently somewhere.
@@ -3291,6 +3402,7 @@
uint32 next_page_offset, page_rest;
uint32 i;
File fd;
+ int rc;
TRANSLOG_VALIDATOR_DATA data;
char path[FN_REFLEN];
uchar page_buff[TRANSLOG_PAGE_SIZE];
@@ -3316,14 +3428,19 @@
TRANSLOG_PAGE_SIZE);
page_rest= next_page_offset - LSN_OFFSET(addr);
memset(page_buff, TRANSLOG_FILLER, page_rest);
- if ((fd= open_logfile_by_number_no_cache(LSN_FILE_NO(addr))) < 0 ||
- ((my_chsize(fd, next_page_offset, TRANSLOG_FILLER, MYF(MY_WME)) ||
- (page_rest && my_pwrite(fd, page_buff, page_rest, LSN_OFFSET(addr),
- log_write_flags)) ||
- my_sync(fd, MYF(MY_WME))) |
- my_close(fd, MYF(MY_WME))) ||
- (sync_log_dir >= TRANSLOG_SYNC_DIR_ALWAYS &&
- sync_dir(log_descriptor.directory_fd, MYF(MY_WME | MY_IGNORE_BADFD))))
+ rc= ((fd= open_logfile_by_number_no_cache(LSN_FILE_NO(addr))) < 0 ||
+ ((my_chsize(fd, next_page_offset, TRANSLOG_FILLER, MYF(MY_WME)) ||
+ (page_rest && my_pwrite(fd, page_buff, page_rest, LSN_OFFSET(addr),
+ log_write_flags)) ||
+ my_sync(fd, MYF(MY_WME)))));
+ translog_syncs++;
+ rc|= (fd > 0 && my_close(fd, MYF(MY_WME)));
+ if (sync_log_dir >= TRANSLOG_SYNC_DIR_ALWAYS)
+ {
+ rc|= sync_dir(log_descriptor.directory_fd, MYF(MY_WME | MY_IGNORE_BADFD));
+ translog_syncs++;
+ }
+ if (rc)
DBUG_RETURN(1);
/* fix the horizon */
@@ -3483,7 +3600,10 @@
my_bool version_changed= 0;
DBUG_ENTER("translog_init_with_table");
+ translog_syncs= 0;
+ flush_start= 0;
id_to_share= NULL;
+
log_descriptor.directory_fd= -1;
log_descriptor.is_everything_flushed= 1;
log_descriptor.flush_in_progress= 0;
@@ -3511,6 +3631,7 @@
pthread_mutex_init(&log_descriptor.dirty_buffer_mask_lock,
MY_MUTEX_INIT_FAST) ||
pthread_cond_init(&log_descriptor.log_flush_cond, 0) ||
+ pthread_cond_init(&log_descriptor.new_goal_cond, 0) ||
my_rwlock_init(&log_descriptor.open_files_lock,
NULL) ||
my_init_dynamic_array(&log_descriptor.open_files,
@@ -3912,7 +4033,6 @@
log_descriptor.flushed= log_descriptor.horizon;
log_descriptor.in_buffers_only= log_descriptor.bc.buffer->offset;
log_descriptor.max_lsn= LSN_IMPOSSIBLE; /* set to 0 */
- log_descriptor.previous_flush_horizon= log_descriptor.horizon;
/*
Now 'flushed' is set to 'horizon' value, but 'horizon' is (potentially)
address of the next LSN and we want indicate that all LSNs that are
@@ -3995,6 +4115,10 @@
It is beginning of the log => there is no LSNs in the log =>
There is no harm in leaving it "as-is".
*/
+ log_descriptor.previous_flush_horizon= log_descriptor.horizon;
+ DBUG_PRINT("info", ("previous_flush_horizon: (%lu,0x%lx)",
+ LSN_IN_PARTS(log_descriptor.
+ previous_flush_horizon)));
DBUG_RETURN(0);
}
file_no--;
@@ -4070,6 +4194,9 @@
translog_free_record_header(&rec);
}
}
+ log_descriptor.previous_flush_horizon= log_descriptor.horizon;
+ DBUG_PRINT("info", ("previous_flush_horizon: (%lu,0x%lx)",
+ LSN_IN_PARTS(log_descriptor.previous_flush_horizon)));
DBUG_RETURN(0);
err:
ma_message_no_user(0, "log initialization failed");
@@ -4157,6 +4284,7 @@
pthread_mutex_destroy(&log_descriptor.log_flush_lock);
pthread_mutex_destroy(&log_descriptor.dirty_buffer_mask_lock);
pthread_cond_destroy(&log_descriptor.log_flush_cond);
+ pthread_cond_destroy(&log_descriptor.new_goal_cond);
rwlock_destroy(&log_descriptor.open_files_lock);
delete_dynamic(&log_descriptor.open_files);
delete_dynamic(&log_descriptor.unfinished_files);
@@ -6885,11 +7013,11 @@
{
translog_size_t res;
DBUG_ENTER("translog_read_record_header_from_buffer");
- DBUG_ASSERT(translog_is_LSN_chunk(page[page_offset]));
- DBUG_ASSERT(translog_status == TRANSLOG_OK ||
- translog_status == TRANSLOG_READONLY);
DBUG_PRINT("info", ("page byte: 0x%x offset: %u",
(uint) page[page_offset], (uint) page_offset));
+ DBUG_ASSERT(translog_is_LSN_chunk(page[page_offset]));
+ DBUG_ASSERT(translog_status == TRANSLOG_OK ||
+ translog_status == TRANSLOG_READONLY);
buff->type= (page[page_offset] & TRANSLOG_REC_TYPE);
buff->short_trid= uint2korr(page + page_offset + 1);
DBUG_PRINT("info", ("Type %u, Short TrID %u, LSN (%lu,0x%lx)",
@@ -7356,27 +7484,27 @@
"Buffer addr: (%lu,0x%lx) "
"Page addr: (%lu,0x%lx) "
"size: %lu (%lu) Pg: %u left: %u in progress %u",
- (uint) log_descriptor.bc.buffer_no,
- (ulong) log_descriptor.bc.buffer,
- LSN_IN_PARTS(log_descriptor.bc.buffer->offset),
+ (uint) old_buffer_no,
+ (ulong) old_buffer,
+ LSN_IN_PARTS(old_buffer->offset),
(ulong) LSN_FILE_NO(log_descriptor.horizon),
(ulong) (LSN_OFFSET(log_descriptor.horizon) -
log_descriptor.bc.current_page_fill),
- (ulong) log_descriptor.bc.buffer->size,
+ (ulong) old_buffer->size,
(ulong) (log_descriptor.bc.ptr -log_descriptor.bc.
buffer->buffer),
(uint) log_descriptor.bc.current_page_fill,
(uint) left,
- (uint) log_descriptor.bc.buffer->
+ (uint) old_buffer->
copy_to_buffer_in_progress));
translog_lock_assert_owner();
LINT_INIT(current_page_fill);
- new_buff_beginning= log_descriptor.bc.buffer->offset;
- new_buff_beginning+= log_descriptor.bc.buffer->size; /* increase offset */
+ new_buff_beginning= old_buffer->offset;
+ new_buff_beginning+= old_buffer->size; /* increase offset */
DBUG_ASSERT(log_descriptor.bc.ptr !=NULL);
DBUG_ASSERT(LSN_FILE_NO(log_descriptor.horizon) ==
- LSN_FILE_NO(log_descriptor.bc.buffer->offset));
+ LSN_FILE_NO(old_buffer->offset));
translog_check_cursor(&log_descriptor.bc);
DBUG_ASSERT(left < TRANSLOG_PAGE_SIZE);
if (left)
@@ -7387,18 +7515,20 @@
*/
DBUG_PRINT("info", ("left: %u", (uint) left));
+ old_buffer->pre_force_close_horizon=
+ old_buffer->offset + old_buffer->size;
/* decrease offset */
new_buff_beginning-= log_descriptor.bc.current_page_fill;
current_page_fill= log_descriptor.bc.current_page_fill;
memset(log_descriptor.bc.ptr, TRANSLOG_FILLER, left);
- log_descriptor.bc.buffer->size+= left;
+ old_buffer->size+= left;
DBUG_PRINT("info", ("Finish Page buffer #%u: 0x%lx "
"Size: %lu",
- (uint) log_descriptor.bc.buffer->buffer_no,
- (ulong) log_descriptor.bc.buffer,
- (ulong) log_descriptor.bc.buffer->size));
- DBUG_ASSERT(log_descriptor.bc.buffer->buffer_no ==
+ (uint) old_buffer->buffer_no,
+ (ulong) old_buffer,
+ (ulong) old_buffer->size));
+ DBUG_ASSERT(old_buffer->buffer_no ==
log_descriptor.bc.buffer_no);
}
else
@@ -7509,11 +7639,21 @@
if (left)
{
- /*
- TODO: do not copy beginning of the page if we have no CRC or sector
- checks on
- */
- memcpy(new_buffer->buffer, data, current_page_fill);
+ if (log_descriptor.flags &
+ (TRANSLOG_PAGE_CRC | TRANSLOG_SECTOR_PROTECTION))
+ memcpy(new_buffer->buffer, data, current_page_fill);
+ else
+ {
+ /*
+ This page header does not change if we add more data to the page so
+ we can not copy it and will not overwrite later
+ */
+ new_buffer->skipped_data= current_page_fill;
+#ifndef DBUG_OFF
+ memset(new_buffer->buffer, 0xa5, current_page_fill);
+#endif
+ DBUG_ASSERT(new_buffer->skipped_data < TRANSLOG_PAGE_SIZE);
+ }
}
old_buffer->next_buffer_offset= new_buffer->offset;
translog_buffer_lock(new_buffer);
@@ -7561,6 +7701,7 @@
{
log_descriptor.next_pass_max_lsn= lsn;
log_descriptor.max_lsn_requester= pthread_self();
+ pthread_cond_broadcast(&log_descriptor.new_goal_cond);
}
while (flush_no == log_descriptor.flush_no)
{
@@ -7572,66 +7713,78 @@
/**
- @brief Flush the log up to given LSN (included)
-
- @param lsn log record serial number up to which (inclusive)
- the log has to be flushed
-
- @return Operation status
+ @brief sync() range of files (inclusive) and directory (by request)
+
+ @param min min internal file number to flush
+ @param max max internal file number to flush
+ @param sync_dir need sync directory
+
+ return Operation status
@retval 0 OK
@retval 1 Error
-
-*/
-
-my_bool translog_flush(TRANSLOG_ADDRESS lsn)
-{
- LSN sent_to_disk= LSN_IMPOSSIBLE;
- TRANSLOG_ADDRESS flush_horizon;
- uint fn, i;
+*/
+
+static my_bool translog_sync_files(uint32 min, uint32 max,
+ my_bool sync_dir)
+{
+ uint fn;
+ my_bool rc= 0;
+ ulonglong flush_interval;
+ DBUG_ENTER("translog_sync_files");
+ DBUG_PRINT("info", ("min: %lu max: %lu sync dir: %d",
+ (ulong) min, (ulong) max, (int) sync_dir));
+ DBUG_ASSERT(min <= max);
+
+ flush_interval= group_commit_wait;
+ if (flush_interval)
+ flush_start= my_micro_time();
+ for (fn= min; fn <= max; fn++)
+ {
+ TRANSLOG_FILE *file= get_logfile_by_number(fn);
+ DBUG_ASSERT(file != NULL);
+ if (!file->is_sync)
+ {
+ if (my_sync(file->handler.file, MYF(MY_WME)))
+ {
+ rc= 1;
+ translog_stop_writing();
+ DBUG_RETURN(rc);
+ }
+ translog_syncs++;
+ file->is_sync= 1;
+ }
+ }
+
+ if (sync_dir)
+ {
+ if (!(rc= sync_dir(log_descriptor.directory_fd,
+ MYF(MY_WME | MY_IGNORE_BADFD))))
+ translog_syncs++;
+ }
+
+ DBUG_RETURN(rc);
+}
+
+
+/*
+ @brief Flushes buffers with LSNs in them less or equal address <lsn>
+
+ @param lsn address up to which all LSNs should be flushed,
+ can be reset to real last LSN address
+ @parem sent_to_disk returns 'sent to disk' position
+ @param flush_horizon returns horizon of the flush
+
+ @note About terminology see comment to translog_flush().
+*/
+
+void translog_flush_buffers(TRANSLOG_ADDRESS *lsn,
+ TRANSLOG_ADDRESS *sent_to_disk,
+ TRANSLOG_ADDRESS *flush_horizon)
+{
dirty_buffer_mask_t dirty_buffer_mask;
+ uint i;
uint8 last_buffer_no, start_buffer_no;
- my_bool rc= 0;
- DBUG_ENTER("translog_flush");
- DBUG_PRINT("enter", ("Flush up to LSN: (%lu,0x%lx)", LSN_IN_PARTS(lsn)));
- DBUG_ASSERT(translog_status == TRANSLOG_OK ||
- translog_status == TRANSLOG_READONLY);
- LINT_INIT(sent_to_disk);
-
- pthread_mutex_lock(&log_descriptor.log_flush_lock);
- DBUG_PRINT("info", ("Everything is flushed up to (%lu,0x%lx)",
- LSN_IN_PARTS(log_descriptor.flushed)));
- if (cmp_translog_addr(log_descriptor.flushed, lsn) >= 0)
- {
- pthread_mutex_unlock(&log_descriptor.log_flush_lock);
- DBUG_RETURN(0);
- }
- if (log_descriptor.flush_in_progress)
- {
- translog_flush_set_new_goal_and_wait(lsn);
- if (!pthread_equal(log_descriptor.max_lsn_requester, pthread_self()))
- {
- /* fix lsn if it was horizon */
- if (cmp_translog_addr(lsn, log_descriptor.bc.buffer->last_lsn) > 0)
- lsn= BUFFER_MAX_LSN(log_descriptor.bc.buffer);
- translog_flush_wait_for_end(lsn);
- pthread_mutex_unlock(&log_descriptor.log_flush_lock);
- DBUG_RETURN(0);
- }
- log_descriptor.next_pass_max_lsn= LSN_IMPOSSIBLE;
- }
- log_descriptor.flush_in_progress= 1;
- flush_horizon= log_descriptor.previous_flush_horizon;
- DBUG_PRINT("info", ("flush_in_progress is set"));
- pthread_mutex_unlock(&log_descriptor.log_flush_lock);
-
- translog_lock();
- if (log_descriptor.is_everything_flushed)
- {
- DBUG_PRINT("info", ("everything is flushed"));
- rc= (translog_status == TRANSLOG_READONLY);
- translog_unlock();
- goto out;
- }
+ DBUG_ENTER("translog_flush_buffers");
/*
We will recheck information when will lock buffers one by
@@ -7656,15 +7809,15 @@
/*
if LSN up to which we have to flush bigger then maximum LSN of previous
buffer and at least one LSN was saved in the current buffer (last_lsn !=
- LSN_IMPOSSIBLE) then we better finish the current buffer.
+ LSN_IMPOSSIBLE) then we have to close the current buffer.
*/
- if (cmp_translog_addr(lsn, log_descriptor.bc.buffer->prev_last_lsn) > 0 &&
+ if (cmp_translog_addr(*lsn, log_descriptor.bc.buffer->prev_last_lsn) > 0 &&
log_descriptor.bc.buffer->last_lsn != LSN_IMPOSSIBLE)
{
struct st_translog_buffer *buffer= log_descriptor.bc.buffer;
- lsn= log_descriptor.bc.buffer->last_lsn; /* fix lsn if it was horizon */
+ *lsn= log_descriptor.bc.buffer->last_lsn; /* fix lsn if it was horizon */
DBUG_PRINT("info", ("LSN to flush fixed to last lsn: (%lu,0x%lx)",
- LSN_IN_PARTS(log_descriptor.bc.buffer->last_lsn)));
+ LSN_IN_PARTS(log_descriptor.bc.buffer->last_lsn)));
last_buffer_no= log_descriptor.bc.buffer_no;
log_descriptor.is_everything_flushed= 1;
translog_force_current_buffer_to_finish();
@@ -7676,8 +7829,10 @@
TRANSLOG_BUFFERS_NO);
translog_unlock();
}
- sent_to_disk= translog_get_sent_to_disk();
- if (cmp_translog_addr(lsn, sent_to_disk) > 0)
+
+ /* flush buffers */
+ *sent_to_disk= translog_get_sent_to_disk();
+ if (cmp_translog_addr(*lsn, *sent_to_disk) > 0)
{
DBUG_PRINT("info", ("Start buffer #: %u last buffer #: %u",
@@ -7697,53 +7852,238 @@
LSN_IN_PARTS(buffer->last_lsn),
(buffer->file ?
"dirty" : "closed")));
- if (buffer->prev_last_lsn <= lsn &&
+ if (buffer->prev_last_lsn <= *lsn &&
buffer->file != NULL)
{
- DBUG_ASSERT(flush_horizon <= buffer->offset + buffer->size);
- flush_horizon= buffer->offset + buffer->size;
+ DBUG_ASSERT(*flush_horizon <= buffer->offset + buffer->size);
+ *flush_horizon= (buffer->pre_force_close_horizon != LSN_IMPOSSIBLE ?
+ buffer->pre_force_close_horizon :
+ buffer->offset + buffer->size);
+ /* pre_force_close_horizon is reset during new buffer start */
+ DBUG_PRINT("info", ("flush_horizon: (%lu,0x%lx)",
+ LSN_IN_PARTS(*flush_horizon)));
+ DBUG_ASSERT(*flush_horizon <= log_descriptor.horizon);
+
translog_buffer_flush(buffer);
}
translog_buffer_unlock(buffer);
i= (i + 1) % TRANSLOG_BUFFERS_NO;
} while (i != last_buffer_no);
- sent_to_disk= translog_get_sent_to_disk();
- }
-
- /* sync files from previous flush till current one */
- for (fn= LSN_FILE_NO(log_descriptor.flushed); fn <= LSN_FILE_NO(lsn); fn++)
- {
- TRANSLOG_FILE *file= get_logfile_by_number(fn);
- DBUG_ASSERT(file != NULL);
- if (!file->is_sync)
- {
- if (my_sync(file->handler.file, MYF(MY_WME)))
+ *sent_to_disk= translog_get_sent_to_disk();
+ }
+
+ DBUG_VOID_RETURN;
+}
+
+/**
+ @brief Flush the log up to given LSN (included)
+
+ @param lsn log record serial number up to which (inclusive)
+ the log has to be flushed
+
+ @return Operation status
+ @retval 0 OK
+ @retval 1 Error
+
+ @note
+
+ - Non group commit logic: Commits made in passes. Thread which started
+ flush first is performing actual flush, other threads sets new goal (LSN)
+ of the next pass (if it is maximum) and waits for the pass end or just
+ wait for the pass end.
+
+ - If hard group commit enabled and rate set to zero:
+ The first thread sends all changed buffers to disk. This is repeated
+ as long as there are new LSNs added. The process can not loop
+ forever because we have limited number of threads and they will wait
+ for the data to be synced.
+ Pseudo code:
+
+ do
+ send changed buffers to disk
+ while new_goal
+ sync
+
+ - If hard group commit switched ON and less than rate microseconds has
+ passed from last sync, then after buffers have been sent to disk
+ wait until rate microseconds has passed since last sync, do sync and return.
+ This ensures that if we call sync infrequently we don't do any waits.
+
+ - If soft group commit enabled everything works as with 'non group commit'
+ but the thread doesn't do any real sync(). If rate is not zero the
+ sync() will be performed by a service thread with the given rate
+ when needed (new LSN appears).
+
+ @note Terminology:
+ 'sent to disk' means written to disk but not sync()ed,
+ 'flushed' mean sent to disk and synced().
+*/
+
+my_bool translog_flush(TRANSLOG_ADDRESS lsn)
+{
+ struct timespec abstime;
+ ulonglong flush_interval;
+ ulonglong time_spent;
+ LSN sent_to_disk= LSN_IMPOSSIBLE;
+ TRANSLOG_ADDRESS flush_horizon;
+ my_bool rc= 0;
+ my_bool hgroup_commit_at_start;
+ DBUG_ENTER("translog_flush");
+ DBUG_PRINT("enter", ("Flush up to LSN: (%lu,0x%lx)", LSN_IN_PARTS(lsn)));
+ DBUG_ASSERT(translog_status == TRANSLOG_OK ||
+ translog_status == TRANSLOG_READONLY);
+ LINT_INIT(sent_to_disk);
+
+ pthread_mutex_lock(&log_descriptor.log_flush_lock);
+ DBUG_PRINT("info", ("Everything is flushed up to (%lu,0x%lx)",
+ LSN_IN_PARTS(log_descriptor.flushed)));
+ if (cmp_translog_addr(log_descriptor.flushed, lsn) >= 0)
+ {
+ pthread_mutex_unlock(&log_descriptor.log_flush_lock);
+ DBUG_RETURN(0);
+ }
+ if (log_descriptor.flush_in_progress)
+ {
+ translog_lock();
+ /* fix lsn if it was horizon */
+ if (cmp_translog_addr(lsn, log_descriptor.bc.buffer->last_lsn) > 0)
+ lsn= BUFFER_MAX_LSN(log_descriptor.bc.buffer);
+ translog_unlock();
+ translog_flush_set_new_goal_and_wait(lsn);
+ if (!pthread_equal(log_descriptor.max_lsn_requester, pthread_self()))
+ {
+ /*
+ translog_flush_wait_for_end() release log_flush_lock while is
+ waiting then acquire it again
+ */
+ translog_flush_wait_for_end(lsn);
+ pthread_mutex_unlock(&log_descriptor.log_flush_lock);
+ DBUG_RETURN(0);
+ }
+ log_descriptor.next_pass_max_lsn= LSN_IMPOSSIBLE;
+ }
+ log_descriptor.flush_in_progress= 1;
+ flush_horizon= log_descriptor.previous_flush_horizon;
+ DBUG_PRINT("info", ("flush_in_progress is set, flush_horizon: (%lu,0x%lx)",
+ LSN_IN_PARTS(flush_horizon)));
+ pthread_mutex_unlock(&log_descriptor.log_flush_lock);
+
+ hgroup_commit_at_start= hard_group_commit;
+ if (hgroup_commit_at_start)
+ flush_interval= group_commit_wait;
+
+ translog_lock();
+ if (log_descriptor.is_everything_flushed)
+ {
+ DBUG_PRINT("info", ("everything is flushed"));
+ translog_unlock();
+ pthread_mutex_lock(&log_descriptor.log_flush_lock);
+ goto out;
+ }
+
+ for (;;)
+ {
+ /* Following function flushes buffers and makes translog_unlock() */
+ translog_flush_buffers(&lsn, &sent_to_disk, &flush_horizon);
+
+ if (!hgroup_commit_at_start)
+ break; /* flush pass is ended */
+
+retest:
+ /*
+ We do not check time here because pthread_mutex_lock rarely takes
+ a lot of time so we can sacrifice a bit precision to performance
+ (taking into account that my_micro_time() might be expensive call).
+ */
+ if (flush_interval == 0)
+ break; /* flush pass is ended */
+
+ pthread_mutex_lock(&log_descriptor.log_flush_lock);
+ if (log_descriptor.next_pass_max_lsn == LSN_IMPOSSIBLE)
+ {
+ if (flush_interval == 0 ||
+ (time_spent= (my_micro_time() - flush_start)) >= flush_interval)
{
- rc= 1;
- translog_stop_writing();
- sent_to_disk= LSN_IMPOSSIBLE;
- goto out;
+ pthread_mutex_unlock(&log_descriptor.log_flush_lock);
+ break;
}
- file->is_sync= 1;
- }
- }
-
- if (sync_log_dir >= TRANSLOG_SYNC_DIR_ALWAYS &&
- (LSN_FILE_NO(log_descriptor.previous_flush_horizon) !=
- LSN_FILE_NO(flush_horizon) ||
- ((LSN_OFFSET(log_descriptor.previous_flush_horizon) - 1) /
- TRANSLOG_PAGE_SIZE) !=
- ((LSN_OFFSET(flush_horizon) - 1) / TRANSLOG_PAGE_SIZE)))
- rc|= sync_dir(log_descriptor.directory_fd, MYF(MY_WME | MY_IGNORE_BADFD));
+ DBUG_PRINT("info", ("flush waits: %llu interval: %llu spent: %llu",
+ flush_interval - time_spent,
+ flush_interval, time_spent));
+ /* wait time or next goal */
+ set_timespec_nsec(abstime, flush_interval - time_spent);
+ pthread_cond_timedwait(&log_descriptor.new_goal_cond,
+ &log_descriptor.log_flush_lock,
+ &abstime);
+ pthread_mutex_unlock(&log_descriptor.log_flush_lock);
+ DBUG_PRINT("info", ("retest conditions"));
+ goto retest;
+ }
+
+ /* take next goal */
+ lsn= log_descriptor.next_pass_max_lsn;
+ log_descriptor.next_pass_max_lsn= LSN_IMPOSSIBLE;
+ /* prevent other thread from continue */
+ log_descriptor.max_lsn_requester= pthread_self();
+ DBUG_PRINT("info", ("flush took next goal: (%lu,0x%lx)",
+ LSN_IN_PARTS(lsn)));
+ pthread_mutex_unlock(&log_descriptor.log_flush_lock);
+
+ /* next flush pass */
+ DBUG_PRINT("info", ("next flush pass"));
+ translog_lock();
+ }
+
+ /*
+ sync() files from previous flush till current one
+ */
+ if (!soft_sync || hgroup_commit_at_start)
+ {
+ if ((rc=
+ translog_sync_files(LSN_FILE_NO(log_descriptor.flushed),
+ LSN_FILE_NO(lsn),
+ sync_log_dir >= TRANSLOG_SYNC_DIR_ALWAYS &&
+ (LSN_FILE_NO(log_descriptor.
+ previous_flush_horizon) !=
+ LSN_FILE_NO(flush_horizon) ||
+ (LSN_OFFSET(log_descriptor.
+ previous_flush_horizon) /
+ TRANSLOG_PAGE_SIZE) !=
+ (LSN_OFFSET(flush_horizon) /
+ TRANSLOG_PAGE_SIZE)))))
+ {
+ sent_to_disk= LSN_IMPOSSIBLE;
+ pthread_mutex_lock(&log_descriptor.log_flush_lock);
+ goto out;
+ }
+ /* keep values for soft sync() and forced sync() actual */
+ {
+ uint32 fileno= LSN_FILE_NO(lsn);
+ my_atomic_rwlock_wrlock(&soft_sync_rwl);
+ my_atomic_store32(&soft_sync_min, fileno);
+ my_atomic_store32(&soft_sync_max, fileno);
+ my_atomic_rwlock_wrunlock(&soft_sync_rwl);
+ }
+ }
+ else
+ {
+ my_atomic_rwlock_wrlock(&soft_sync_rwl);
+ my_atomic_store32(&soft_sync_max, LSN_FILE_NO(lsn));
+ my_atomic_store32(&soft_need_sync, 1);
+ my_atomic_rwlock_wrunlock(&soft_sync_rwl);
+ }
+
+ DBUG_ASSERT(flush_horizon <= log_descriptor.horizon);
+
+ pthread_mutex_lock(&log_descriptor.log_flush_lock);
log_descriptor.previous_flush_horizon= flush_horizon;
out:
- pthread_mutex_lock(&log_descriptor.log_flush_lock);
if (sent_to_disk != LSN_IMPOSSIBLE)
log_descriptor.flushed= sent_to_disk;
log_descriptor.flush_in_progress= 0;
log_descriptor.flush_no++;
DBUG_PRINT("info", ("flush_in_progress is dropped"));
- pthread_mutex_unlock(&log_descriptor.log_flush_lock);\
+ pthread_mutex_unlock(&log_descriptor.log_flush_lock);
pthread_cond_broadcast(&log_descriptor.log_flush_cond);
DBUG_RETURN(rc);
}
@@ -8113,6 +8453,8 @@
my_bool translog_purge(TRANSLOG_ADDRESS low)
{
uint32 last_need_file= LSN_FILE_NO(low);
+ uint32 min_unsync;
+ int soft;
TRANSLOG_ADDRESS horizon= translog_get_horizon();
int rc= 0;
DBUG_ENTER("translog_purge");
@@ -8120,12 +8462,26 @@
DBUG_ASSERT(translog_status == TRANSLOG_OK ||
translog_status == TRANSLOG_READONLY);
+ soft= soft_sync;
+ my_atomic_rwlock_wrlock(&soft_sync_rwl);
+ min_unsync= my_atomic_load32(&soft_sync_min);
+ my_atomic_rwlock_wrunlock(&soft_sync_rwl);
+ DBUG_PRINT("info", ("min_unsync: %lu", (ulong) min_unsync));
+ if (soft && min_unsync < last_need_file)
+ {
+ last_need_file= min_unsync;
+ DBUG_PRINT("info", ("last_need_file set to %lu", (ulong)last_need_file));
+ }
+
pthread_mutex_lock(&log_descriptor.purger_lock);
+ DBUG_PRINT("info", ("last_lsn_checked file: %lu:",
+ (ulong) log_descriptor.last_lsn_checked));
if (LSN_FILE_NO(log_descriptor.last_lsn_checked) < last_need_file)
{
uint32 i;
uint32 min_file= translog_first_file(horizon, 1);
DBUG_ASSERT(min_file != 0); /* log is already started */
+ DBUG_PRINT("info", ("min_file: %lu:",(ulong) min_file));
for(i= min_file; i < last_need_file && rc == 0; i++)
{
LSN lsn= translog_get_file_max_lsn_stored(i);
@@ -8356,6 +8712,159 @@
}
+
+/**
+ Sets soft sync mode
+
+ @param mode TRUE if we need switch soft sync on else off
+*/
+
+void translog_soft_sync(my_bool mode)
+{
+ soft_sync= mode;
+}
+
+
+/**
+ Sets hard group commit
+
+ @param mode TRUE if we need switch hard group commit on else off
+*/
+
+void translog_hard_group_commit(my_bool mode)
+{
+ hard_group_commit= mode;
+}
+
+
+/**
+ @brief forced log sync (used when we are switching modes)
+*/
+
+void translog_sync()
+{
+ uint32 max= get_current_logfile()->number;
+ uint32 min;
+ DBUG_ENTER("ma_translog_sync");
+
+ my_atomic_rwlock_rdlock(&soft_sync_rwl);
+ min= my_atomic_load32(&soft_sync_min);
+ my_atomic_rwlock_rdunlock(&soft_sync_rwl);
+ if (!min)
+ min= max;
+
+ translog_sync_files(min, max, sync_log_dir >= TRANSLOG_SYNC_DIR_ALWAYS);
+
+ DBUG_VOID_RETURN;
+}
+
+
+/**
+ @brief set rate for group commit
+
+ @param interval interval to set.
+
+ @note We use this function with additional variable because have to
+ restart service thread with new value which we can't make inside changing
+ variable routine (update_maria_group_commit_interval)
+*/
+
+void translog_set_group_commit_interval(uint32 interval)
+{
+ DBUG_ENTER("translog_set_group_commit_interval");
+ group_commit_wait= interval;
+ DBUG_PRINT("info", ("wait: %llu",
+ (ulonglong)group_commit_wait));
+ DBUG_VOID_RETURN;
+}
+
+
+/**
+ @brief syncing service thread
+*/
+
+static pthread_handler_t
+ma_soft_sync_background( void *arg __attribute__((unused)))
+{
+
+ my_thread_init();
+ {
+ DBUG_ENTER("ma_soft_sync_background");
+ for(;;)
+ {
+ ulonglong prev_loop= my_micro_time();
+ ulonglong time, sleep;
+ uint32 min, max, sync_request;
+ my_atomic_rwlock_rdlock(&soft_sync_rwl);
+ min= my_atomic_load32(&soft_sync_min);
+ max= my_atomic_load32(&soft_sync_max);
+ sync_request= my_atomic_load32(&soft_need_sync);
+ my_atomic_store32(&soft_sync_min, max);
+ my_atomic_store32(&soft_need_sync, 0);
+ my_atomic_rwlock_rdunlock(&soft_sync_rwl);
+
+ sleep= group_commit_wait;
+ if (sync_request)
+ translog_sync_files(min, max, FALSE);
+ time= my_micro_time() - prev_loop;
+ if (time > sleep)
+ sleep= 0;
+ else
+ sleep-= time;
+ if (my_service_thread_sleep(&soft_sync_control, sleep))
+ break;
+ }
+ my_service_thread_signal_end(&soft_sync_control);
+ my_thread_end();
+ DBUG_RETURN(0);
+ }
+}
+
+
+/**
+ @brief Starts syncing thread
+*/
+
+int translog_soft_sync_start(void)
+{
+ pthread_t th;
+ int res= 0;
+ uint32 min, max;
+ DBUG_ENTER("translog_soft_sync_start");
+
+ /* check and init variables */
+ my_atomic_rwlock_rdlock(&soft_sync_rwl);
+ min= my_atomic_load32(&soft_sync_min);
+ max= my_atomic_load32(&soft_sync_max);
+ if (!max)
+ my_atomic_store32(&soft_sync_max, (max= get_current_logfile()->number));
+ if (!min)
+ my_atomic_store32(&soft_sync_min, max);
+ my_atomic_store32(&soft_need_sync, 1);
+ my_atomic_rwlock_rdunlock(&soft_sync_rwl);
+
+ if (!(res= ma_service_thread_control_init(&soft_sync_control)))
+ if (!(res= pthread_create(&th, NULL, ma_soft_sync_background, NULL)))
+ soft_sync_control.status= THREAD_RUNNING;
+ DBUG_RETURN(res);
+}
+
+
+/**
+ @brief Stops syncing thread
+*/
+
+void translog_soft_sync_end(void)
+{
+ DBUG_ENTER("translog_soft_sync_end");
+ if (soft_sync_control.inited)
+ {
+ ma_service_thread_control_end(&soft_sync_control);
+ }
+ DBUG_VOID_RETURN;
+}
+
+
#ifdef MARIA_DUMP_LOG
#include <my_getopt.h>
extern void translog_example_table_init();
=== modified file 'storage/maria/ma_loghandler.h'
--- a/storage/maria/ma_loghandler.h 2009-01-15 22:25:53 +0000
+++ b/storage/maria/ma_loghandler.h 2010-02-12 06:52:47 +0000
@@ -342,6 +342,14 @@
TRANSLOG_SHUTDOWN /* going to shutdown the loghandler */
};
extern enum enum_translog_status translog_status;
+extern ulonglong translog_syncs; /* Number of sync()s */
+
+void translog_soft_sync(my_bool mode);
+void translog_hard_group_commit(my_bool mode);
+int translog_soft_sync_start(void);
+void translog_soft_sync_end(void);
+void translog_sync();
+void translog_set_group_commit_interval(uint32 interval);
/*
all the rest added because of recovery; should we make
@@ -441,6 +449,14 @@
typedef enum
{
+ TRANSLOG_GCOMMIT_NONE,
+ TRANSLOG_GCOMMIT_HARD,
+ TRANSLOG_GCOMMIT_SOFT
+} enum_maria_group_commit;
+extern ulong maria_group_commit;
+extern ulong maria_group_commit_interval;
+typedef enum
+{
TRANSLOG_PURGE_IMMIDIATE,
TRANSLOG_PURGE_EXTERNAL,
TRANSLOG_PURGE_ONDEMAND
1
0
[Maria-developers] Rev 2756: BUG#31480: Incorrect result for nested subquery when executed via semi join in file:///home/psergey/dev/maria-5.3-subqueries-r6/
by Sergey Petrunya 11 Feb '10
by Sergey Petrunya 11 Feb '10
11 Feb '10
At file:///home/psergey/dev/maria-5.3-subqueries-r6/
------------------------------------------------------------
revno: 2756
revision-id: psergey(a)askmonty.org-20100211235958-p11o4e80dlrn2bsq
parent: psergey(a)askmonty.org-20100211223118-5fzuidow1pkubpzl
committer: Sergey Petrunya <psergey(a)askmonty.org>
branch nick: maria-5.3-subqueries-r6
timestamp: Fri 2010-02-12 02:59:58 +0300
message:
BUG#31480: Incorrect result for nested subquery when executed via semi join
- Variant #3 of the fix. It also
= Unifies code with table elimination's
= is able to handle FROM-subquery pullout.
=== modified file 'mysql-test/r/subselect_sj.result'
--- a/mysql-test/r/subselect_sj.result 2010-01-17 14:51:10 +0000
+++ b/mysql-test/r/subselect_sj.result 2010-02-11 23:59:58 +0000
@@ -779,3 +779,48 @@
1 PRIMARY it2 ALL NULL NULL NULL NULL 20 Using where; End temporary
DROP TABLE ot1, it1, it2;
# End of BUG#38075
+#
+# BUG#31480: Incorrect result for nested subquery when executed via semi join
+#
+create table t1 (a int not null, b int not null);
+create table t2 (c int not null, d int not null);
+create table t3 (e int not null);
+insert into t1 values (1,10);
+insert into t1 values (2,10);
+insert into t1 values (1,20);
+insert into t1 values (2,20);
+insert into t1 values (3,20);
+insert into t1 values (2,30);
+insert into t1 values (4,40);
+insert into t2 values (2,10);
+insert into t2 values (2,20);
+insert into t2 values (4,10);
+insert into t2 values (5,10);
+insert into t2 values (3,20);
+insert into t2 values (2,40);
+insert into t3 values (10);
+insert into t3 values (30);
+insert into t3 values (10);
+insert into t3 values (20);
+explain extended
+select a from t1
+where a in (select c from t2 where d >= some(select e from t3 where b=e));
+id select_type table type possible_keys key key_len ref rows filtered Extra
+1 PRIMARY t2 ALL NULL NULL NULL NULL 6 100.00 Start temporary
+1 PRIMARY t1 ALL NULL NULL NULL NULL 7 100.00 Using where; End temporary; Using join buffer
+3 DEPENDENT SUBQUERY t3 ALL NULL NULL NULL NULL 4 100.00 Using where
+Warnings:
+Note 1276 Field or reference 'test.t1.b' of SELECT #3 was resolved in SELECT #1
+Note 1003 select `test`.`t1`.`a` AS `a` from `test`.`t1` semi join (`test`.`t2`) where ((`test`.`t1`.`a` = `test`.`t2`.`c`) and <nop>(<in_optimizer>(`test`.`t2`.`d`,<exists>(select 1 AS `Not_used` from `test`.`t3` where ((`test`.`t1`.`b` = `test`.`t3`.`e`) and (<cache>(`test`.`t2`.`d`) >= `test`.`t3`.`e`))))))
+show warnings;
+Level Code Message
+Note 1276 Field or reference 'test.t1.b' of SELECT #3 was resolved in SELECT #1
+Note 1003 select `test`.`t1`.`a` AS `a` from `test`.`t1` semi join (`test`.`t2`) where ((`test`.`t1`.`a` = `test`.`t2`.`c`) and <nop>(<in_optimizer>(`test`.`t2`.`d`,<exists>(select 1 AS `Not_used` from `test`.`t3` where ((`test`.`t1`.`b` = `test`.`t3`.`e`) and (<cache>(`test`.`t2`.`d`) >= `test`.`t3`.`e`))))))
+select a from t1
+where a in (select c from t2 where d >= some(select e from t3 where b=e));
+a
+2
+2
+3
+2
+drop table t1, t2, t3;
=== modified file 'mysql-test/r/subselect_sj_jcl6.result'
--- a/mysql-test/r/subselect_sj_jcl6.result 2010-01-17 14:51:10 +0000
+++ b/mysql-test/r/subselect_sj_jcl6.result 2010-02-11 23:59:58 +0000
@@ -783,6 +783,51 @@
1 PRIMARY it2 ALL NULL NULL NULL NULL 20 Using where; End temporary; Using join buffer
DROP TABLE ot1, it1, it2;
# End of BUG#38075
+#
+# BUG#31480: Incorrect result for nested subquery when executed via semi join
+#
+create table t1 (a int not null, b int not null);
+create table t2 (c int not null, d int not null);
+create table t3 (e int not null);
+insert into t1 values (1,10);
+insert into t1 values (2,10);
+insert into t1 values (1,20);
+insert into t1 values (2,20);
+insert into t1 values (3,20);
+insert into t1 values (2,30);
+insert into t1 values (4,40);
+insert into t2 values (2,10);
+insert into t2 values (2,20);
+insert into t2 values (4,10);
+insert into t2 values (5,10);
+insert into t2 values (3,20);
+insert into t2 values (2,40);
+insert into t3 values (10);
+insert into t3 values (30);
+insert into t3 values (10);
+insert into t3 values (20);
+explain extended
+select a from t1
+where a in (select c from t2 where d >= some(select e from t3 where b=e));
+id select_type table type possible_keys key key_len ref rows filtered Extra
+1 PRIMARY t2 ALL NULL NULL NULL NULL 6 100.00 Start temporary
+1 PRIMARY t1 ALL NULL NULL NULL NULL 7 100.00 Using where; End temporary; Using join buffer
+3 DEPENDENT SUBQUERY t3 ALL NULL NULL NULL NULL 4 100.00 Using where
+Warnings:
+Note 1276 Field or reference 'test.t1.b' of SELECT #3 was resolved in SELECT #1
+Note 1003 select `test`.`t1`.`a` AS `a` from `test`.`t1` semi join (`test`.`t2`) where ((`test`.`t1`.`a` = `test`.`t2`.`c`) and <nop>(<in_optimizer>(`test`.`t2`.`d`,<exists>(select 1 AS `Not_used` from `test`.`t3` where ((`test`.`t1`.`b` = `test`.`t3`.`e`) and (<cache>(`test`.`t2`.`d`) >= `test`.`t3`.`e`))))))
+show warnings;
+Level Code Message
+Note 1276 Field or reference 'test.t1.b' of SELECT #3 was resolved in SELECT #1
+Note 1003 select `test`.`t1`.`a` AS `a` from `test`.`t1` semi join (`test`.`t2`) where ((`test`.`t1`.`a` = `test`.`t2`.`c`) and <nop>(<in_optimizer>(`test`.`t2`.`d`,<exists>(select 1 AS `Not_used` from `test`.`t3` where ((`test`.`t1`.`b` = `test`.`t3`.`e`) and (<cache>(`test`.`t2`.`d`) >= `test`.`t3`.`e`))))))
+select a from t1
+where a in (select c from t2 where d >= some(select e from t3 where b=e));
+a
+2
+2
+3
+2
+drop table t1, t2, t3;
set join_cache_level=default;
show variables like 'join_cache_level';
Variable_name Value
=== modified file 'mysql-test/t/subselect_sj.test'
--- a/mysql-test/t/subselect_sj.test 2010-01-17 14:51:10 +0000
+++ b/mysql-test/t/subselect_sj.test 2010-02-11 23:59:58 +0000
@@ -681,3 +681,41 @@
DROP TABLE ot1, it1, it2;
--echo # End of BUG#38075
+
+--echo #
+--echo # BUG#31480: Incorrect result for nested subquery when executed via semi join
+--echo #
+create table t1 (a int not null, b int not null);
+create table t2 (c int not null, d int not null);
+create table t3 (e int not null);
+
+insert into t1 values (1,10);
+insert into t1 values (2,10);
+insert into t1 values (1,20);
+insert into t1 values (2,20);
+insert into t1 values (3,20);
+insert into t1 values (2,30);
+insert into t1 values (4,40);
+
+insert into t2 values (2,10);
+insert into t2 values (2,20);
+insert into t2 values (4,10);
+insert into t2 values (5,10);
+insert into t2 values (3,20);
+insert into t2 values (2,40);
+
+insert into t3 values (10);
+insert into t3 values (30);
+insert into t3 values (10);
+insert into t3 values (20);
+
+explain extended
+select a from t1
+where a in (select c from t2 where d >= some(select e from t3 where b=e));
+show warnings;
+
+select a from t1
+where a in (select c from t2 where d >= some(select e from t3 where b=e));
+
+drop table t1, t2, t3;
+
=== modified file 'sql/item.cc'
--- a/sql/item.cc 2010-02-11 22:00:36 +0000
+++ b/sql/item.cc 2010-02-11 23:59:58 +0000
@@ -3647,7 +3647,7 @@
substitution)
*/
-static void mark_as_dependent(THD *thd, SELECT_LEX *last, SELECT_LEX *current,
+static bool mark_as_dependent(THD *thd, SELECT_LEX *last, SELECT_LEX *current,
Item_ident *resolved_item,
Item_ident *mark_item)
{
@@ -3658,7 +3658,9 @@
/* store pointer on SELECT_LEX from which item is dependent */
if (mark_item)
mark_item->depended_from= last;
- current->mark_as_dependent(last, resolved_item);
+ if (current->mark_as_dependent(thd, last, /** resolved_item psergey-thu
+ **/mark_item))
+ return TRUE;
if (thd->lex->describe & DESCRIBE_EXTENDED)
{
push_warning_printf(thd, MYSQL_ERROR::WARN_LEVEL_NOTE,
@@ -3668,6 +3670,7 @@
resolved_item->field_name,
current->select_number, last->select_number);
}
+ return FALSE;
}
@@ -4119,6 +4122,7 @@
((ref_type == REF_ITEM || ref_type == FIELD_ITEM) ?
(Item_ident*) (*reference) :
0));
+
/*
A reference to a view field had been found and we
substituted it instead of this Item (find_field_in_tables
@@ -4218,7 +4222,7 @@
return -1;
mark_as_dependent(thd, last_checked_context->select_lex,
- context->select_lex, this,
+ context->select_lex, rf,
rf);
return 0;
}
@@ -5998,7 +6002,7 @@
goto error;
thd->change_item_tree(reference, fld);
mark_as_dependent(thd, last_checked_context->select_lex,
- thd->lex->current_select, this, fld);
+ thd->lex->current_select, fld, fld);
/*
A reference is resolved to a nest level that's outer or the same as
the nest level of the enclosing set function : adjust the value of
@@ -6438,7 +6442,7 @@
if (depended_from == new_parent)
{
*ref= outer_ref;
- outer_ref->fix_after_pullout(new_parent, ref);
+ (*ref)->fix_after_pullout(new_parent, ref);
}
}
=== modified file 'sql/item.h'
--- a/sql/item.h 2010-02-11 22:00:36 +0000
+++ b/sql/item.h 2010-02-11 23:59:58 +0000
@@ -1115,7 +1115,9 @@
/*
- Class to be used to enumerate all field references in an item tree.
+ Class to be used to enumerate all field references in an item tree. This
+ includes references to outside but not fields of the tables within a
+ subquery.
Suggested usage:
class My_enumerator : public Field_enumerator
@@ -2377,6 +2379,8 @@
}
bool walk(Item_processor processor, bool walk_subquery, uchar *arg)
{ return (*ref)->walk(processor, walk_subquery, arg); }
+ bool enumerate_field_refs_processor(uchar *arg)
+ { return (*ref)->enumerate_field_refs_processor(arg); }
virtual void print(String *str, enum_query_type query_type);
bool result_as_longlong()
{
=== modified file 'sql/item_subselect.cc'
--- a/sql/item_subselect.cc 2010-01-28 13:48:33 +0000
+++ b/sql/item_subselect.cc 2010-02-11 23:59:58 +0000
@@ -39,8 +39,8 @@
Item_subselect::Item_subselect():
Item_result_field(), value_assigned(0), thd(0), substitution(0),
engine(0), old_engine(0), used_tables_cache(0), have_to_be_excluded(0),
- const_item_cache(1), in_fix_fields(0), engine_changed(0), changed(0),
- is_correlated(FALSE)
+ const_item_cache(1), inside_first_fix_fields(0), done_first_fix_fields(FALSE),
+ engine_changed(0), changed(0), is_correlated(FALSE)
{
with_subselect= 1;
reset();
@@ -167,18 +167,23 @@
DBUG_ASSERT(fixed == 0);
engine->set_thd((thd= thd_param));
- if (!in_fix_fields)
- refers_to.empty();
+ if (!done_first_fix_fields)
+ {
+ done_first_fix_fields= TRUE;
+ inside_first_fix_fields= TRUE;
+ }
+
eliminated= FALSE;
+ parent_select= thd_param->lex->current_select;
if (check_stack_overrun(thd, STACK_MIN_SIZE, (uchar*)&res))
return TRUE;
- in_fix_fields++;
res= engine->prepare();
// all transformation is done (used by prepared statements)
changed= 1;
+ inside_first_fix_fields= FALSE;
if (!res)
{
@@ -210,14 +215,12 @@
if (!(*ref)->fixed)
ret= (*ref)->fix_fields(thd, ref);
thd->where= save_where;
- in_fix_fields--;
return ret;
}
// Is it one field subselect?
if (engine->cols() > max_columns)
{
my_error(ER_OPERAND_COLUMNS, MYF(0), 1);
- in_fix_fields--;
return TRUE;
}
fix_length_and_dec();
@@ -234,7 +237,6 @@
fixed= 1;
err:
- in_fix_fields--;
thd->where= save_where;
return res;
}
@@ -242,11 +244,12 @@
bool Item_subselect::enumerate_field_refs_processor(uchar *arg)
{
- List_iterator<Item> it(refers_to);
- Item *item;
- while ((item= it++))
+ List_iterator<Ref_to_outside> it(upper_refs);
+ Ref_to_outside *upper;
+
+ while ((upper= it++))
{
- if (item->walk(&Item::enumerate_field_refs_processor, FALSE, arg))
+ if (upper->item->walk(&Item::enumerate_field_refs_processor, FALSE, arg))
return TRUE;
}
return FALSE;
@@ -258,6 +261,142 @@
return FALSE;
}
+
+bool Item_subselect::mark_as_dependent(THD *thd, st_select_lex *select,
+ Item *item)
+{
+ if (inside_first_fix_fields)
+ {
+ is_correlated= TRUE;
+ Ref_to_outside *upper;
+ if (!(upper= new (thd->stmt_arena->mem_root) Ref_to_outside()))
+ return TRUE;
+ upper->select= select;
+ upper->item= item;
+ if (upper_refs.push_back(upper, thd->stmt_arena->mem_root))
+ return TRUE;
+ }
+ return FALSE;
+}
+
+/*
+ Adjust attributes after our parent select has been merged into grandparent
+
+ DESCRIPTION
+ Subquery is a composite object which may be correlated, that is, it may
+ have
+ 1. references to tables of the parent select (i.e. one that has the clause
+ with the subquery predicate)
+ 2. references to tables of the grandparent select
+ 3. references to tables of further ancestors.
+
+ Before the pullout, this item indicates:
+ - #1 with table bits in used_tables()
+ - #2 and #3 with OUTER_REF_TABLE_BIT.
+
+ After parent has been merged with grandparent:
+ - references to parent and grandparent tables should be indicated with
+ table bits.
+ - references to greatgrandparent and further ancestors - with
+ OUTER_REF_TABLE_BIT.
+*/
+
+void Item_subselect::fix_after_pullout(st_select_lex *new_parent, Item **ref)
+{
+ recalc_used_tables(new_parent, TRUE);
+ parent_select= new_parent;
+}
+
+class Field_fixer: public Field_enumerator
+{
+public:
+ table_map used_tables; /* Collect used_tables here */
+ st_select_lex *new_parent; /* Select we're in */
+ virtual void visit_field(Field *field)
+ {
+ //for (TABLE_LIST *tbl= new_parent->leaf_tables; tbl; tbl= tbl->next_local)
+ //{
+ // if (tbl->table == field->table)
+ // {
+ used_tables|= field->table->map;
+ // return;
+ // }
+ //}
+ //used_tables |= OUTER_REF_TABLE_BIT;
+ }
+};
+
+
+/*
+ Recalculate used_tables_cache
+*/
+
+void Item_subselect::recalc_used_tables(st_select_lex *new_parent,
+ bool after_pullout)
+{
+ List_iterator<Ref_to_outside> it(upper_refs);
+ Ref_to_outside *upper;
+
+ used_tables_cache= 0;
+ while ((upper= it++))
+ {
+ bool found= FALSE;
+ /*
+ Check if
+ 1. the upper reference refers to the new immediate parent select, or
+ 2. one of the further ancestors.
+
+ We rely on the fact that the tree of selects is modified by some kind of
+ 'flattening', i.e. a process where child selects are merged into their
+ parents.
+ The merged selects are removed from the select tree but keep pointers to
+ their parents.
+ */
+ for (st_select_lex *sel= upper->select; sel; sel= sel->outer_select())
+ {
+ /*
+ If we've reached the new parent select by walking upwards from
+ reference's original select, this means that the reference is now
+ referring to the direct parent:
+ */
+ if (sel == new_parent)
+ {
+ found= TRUE;
+ /*
+ upper->item may be NULL when we've referred to a grouping function,
+ in which case we don't care about what it's table_map really is,
+ because item->with_sum_func==1 will ensure correct placement of the
+ item.
+ */
+ if (upper->item)
+ {
+ // Now, iterate over fields and collect used_tables() attribute:
+ Field_fixer fixer;
+ fixer.used_tables= 0;
+ fixer.new_parent= new_parent;
+ upper->item->walk(&Item::enumerate_field_refs_processor, FALSE,
+ (uchar*)&fixer);
+ used_tables_cache |= fixer.used_tables;
+ /*
+ if (after_pullout)
+ upper->item->fix_after_pullout(new_parent, &(upper->item));
+ upper->item->update_used_tables();
+ used_tables_cache |= upper->item->used_tables();
+ */
+ }
+ }
+ }
+ if (!found)
+ used_tables_cache|= OUTER_REF_TABLE_BIT;
+ }
+ /*
+ Don't update const_tables_cache yet as we don't yet know which of the
+ parent's tables are constant. Parent will call update_used_tables() after
+ he has done const table detection, and that will be our chance to update
+ const_tables_cache.
+ */
+}
+
bool Item_subselect::walk(Item_processor processor, bool walk_subquery,
uchar *argument)
{
@@ -397,6 +536,7 @@
void Item_subselect::update_used_tables()
{
+ recalc_used_tables(parent_select, FALSE);
if (!engine->uncacheable())
{
// did all used tables become static?
@@ -1843,6 +1983,18 @@
return result || Item_subselect::fix_fields(thd_arg, ref);
}
+void Item_in_subselect::fix_after_pullout(st_select_lex *new_parent, Item **ref)
+{
+ left_expr->fix_after_pullout(new_parent, &left_expr);
+ Item_subselect::fix_after_pullout(new_parent, ref);
+}
+
+void Item_in_subselect::update_used_tables()
+{
+ Item_subselect::update_used_tables();
+ left_expr->update_used_tables();
+ used_tables_cache |= left_expr->used_tables();
+}
/**
Try to create an engine to compute the subselect via materialization,
=== modified file 'sql/item_subselect.h'
--- a/sql/item_subselect.h 2010-01-28 13:48:33 +0000
+++ b/sql/item_subselect.h 2010-02-11 23:59:58 +0000
@@ -67,14 +67,32 @@
bool have_to_be_excluded;
/* cache of constant state */
bool const_item_cache;
-
+
+ bool inside_first_fix_fields;
+ bool done_first_fix_fields;
public:
- /*
- References from inside the subquery to the select that this predicate is
- in. References to parent selects not included.
+ /* A reference from inside subquery predicate to somewhere outside of it */
+ class Ref_to_outside : public Sql_alloc
+ {
+ public:
+ st_select_lex *select; /* Select where the reference is pointing to */
+ /*
+ What is being referred. This may be NULL when we're referring to an
+ aggregate function.
+ */
+ Item *item;
+ };
+ /*
+ References from within this subquery to somewhere outside of it (i.e. to
+ parent select, grandparent select, etc)
*/
- List<Item> refers_to;
- int in_fix_fields;
+ List<Ref_to_outside> upper_refs;
+ st_select_lex *parent_select;
+
+ /*
+ TRUE<=>Table Elimination has made it redundant to evaluate this select
+ (and so it is not part of QEP, etc)
+ */
bool eliminated;
/* changed engine indicator */
@@ -117,6 +135,9 @@
return null_value;
}
bool fix_fields(THD *thd, Item **ref);
+ bool mark_as_dependent(THD *thd, st_select_lex *select, Item *item);
+ void fix_after_pullout(st_select_lex *new_parent, Item **ref);
+ void recalc_used_tables(st_select_lex *new_parent, bool after_pullout);
virtual bool exec();
virtual void fix_length_and_dec();
table_map used_tables() const;
@@ -396,6 +417,8 @@
bool test_limit(st_select_lex_unit *unit);
virtual void print(String *str, enum_query_type query_type);
bool fix_fields(THD *thd, Item **ref);
+ void fix_after_pullout(st_select_lex *new_parent, Item **ref);
+ void update_used_tables();
bool setup_engine();
bool init_left_expr_cache();
bool is_expensive_processor(uchar *arg);
=== modified file 'sql/item_sum.cc'
--- a/sql/item_sum.cc 2009-10-15 21:38:29 +0000
+++ b/sql/item_sum.cc 2010-02-11 23:59:58 +0000
@@ -350,7 +350,7 @@
sl= sl->master_unit()->outer_select() )
sl->master_unit()->item->with_sum_func= 1;
}
- thd->lex->current_select->mark_as_dependent(aggr_sel, NULL);
+ thd->lex->current_select->mark_as_dependent(thd, aggr_sel, NULL);
return FALSE;
}
=== modified file 'sql/sql_lex.cc'
--- a/sql/sql_lex.cc 2010-01-28 13:48:33 +0000
+++ b/sql/sql_lex.cc 2010-02-11 23:59:58 +0000
@@ -1841,9 +1841,8 @@
'last' should be reachable from this st_select_lex_node
*/
-void st_select_lex::mark_as_dependent(st_select_lex *last, Item *dependency)
+bool st_select_lex::mark_as_dependent(THD *thd, st_select_lex *last, Item *dependency)
{
- SELECT_LEX *next_to_last;
/*
Mark all selects from resolved to 1 before select where was
found table as depended (of select where was found table)
@@ -1867,12 +1866,15 @@
sl->uncacheable|= UNCACHEABLE_UNITED;
}
}
- next_to_last= s;
+
+ Item_subselect *subquery_expr= s->master_unit()->item;
+ if (subquery_expr && subquery_expr->mark_as_dependent(thd, last,
+ dependency))
+ return TRUE;
}
is_correlated= TRUE;
this->master_unit()->item->is_correlated= TRUE;
- if (dependency)
- next_to_last->master_unit()->item->refers_to.push_back(dependency);
+ return FALSE;
}
bool st_select_lex_node::set_braces(bool value) { return 1; }
=== modified file 'sql/sql_lex.h'
--- a/sql/sql_lex.h 2010-01-28 13:48:33 +0000
+++ b/sql/sql_lex.h 2010-02-11 23:59:58 +0000
@@ -747,7 +747,7 @@
return master_unit()->return_after_parsing();
}
- void mark_as_dependent(st_select_lex *last, Item *dependency);
+ bool mark_as_dependent(THD *thd, st_select_lex *last, Item *dependency);
bool set_braces(bool value);
bool inc_in_sum_expr();
1
0