[Maria-developers] Implementing new "group commit" API in PBXT?
Hi Paul! I want to ask your opinion about implementing in PBXT an extension to the storage engine API that I am working on. There are lots of details in http://askmonty.org/worklog/Server-BackLog/?tid=116 (and even more details in other places), but I thought you would appreciate the short version :-) The idea is to get a well-defined ordering of commits in the server in an efficient way (eg. not break group commit like InnoDB does currently, Bug#13669). For this, two new (optional) storage engine methods are introduced: void (*prepare_ordered)(handlerton *hton, THD *thd, bool all); void (*commit_ordered)(handlerton *hton, THD *thd, bool all); The prepare_ordered() method is called after prepare(), as soon as commit order is decided. The commit_ordered() method is called before commit(), just after the transaction coordinator has made the final decision that the transaction will be durably committed (and not rolled back). The calls into commit_ordered() among different transactions will happen in the order that these transactions are committed, consistently across all engines and the binary log. Same for prepare_ordered(). The idea is that the storage engine should do the minimal amount of work in commit_ordered() necessary to make the commit visible to other threads. And to make sure commits appear to be done in the order of calls to these methods. Do you think either (or both) of these methods could be implemented in PBXT with reasonable effort (and if so, how)? ---- In InnoDB, this was trivial to do, as the InnoDB commit() method already had a "fast" part (which fixed the transaction log order (= "commit order") and made the transaction visible) and a "slow" part (which did the fsync() to make the transaction durable, and handled group commit). (It is necessary that commit_ordered() is fast, as it runs under a global lock. Ideally, it will just allocate an LSN in the transaction log to fix commit order, and perhaps whatever else already needs to happen serialised during engine commit). I hope my explanation was sufficiently clear for you to make a qualified answer. Maybe you can point me to where in the PBXT code a commit becomes visible and the commit order is fixed? In case you were wondering, here are some of the motivations for this feature: 1. For taking a hot backup, it is useful to have consistent commit order between binlog and storage engines. Without it, it can happen that the backed up state of the server has transaction A (but not B) committed in the storage engine, and transaction B (but not A) written to the binlog. Using such backup to provision a new master or slave would leave replication in an inconsistent state. 2. This feature implements working group commit for the binlog while still preserving consistent order as per (1). 3. This will allow to implement START TRANSACTION WITH CONSISTENT SNAPSHOT for multi-engine transactions that is truly consistent (currently is is possible for a transaction to be visible in one engine but not another in such "consistent" snapshot). 4. Galera relies on a consistent commit order, and I believe this feature will allow it to get this in a more engine-independent way. 5. We are planning to use consistent commit order to allow MySQL to recover after a crash transactions that were synced to disk in the binlog but not in the engine. This will allow to reduce the number of fsyncs() during prepare() / commit() from 3 to 1; it only needs to be done in the binlog (with group commit); the engine does not need to fsync(), as any lost transactions will be recovered from the binlog after crash. 6. The prepare_ordered() method is inspired by the Facebook patch to release InnoDB row-level read locks early (before syncing the binlog to disk) to improve performance in the presence of hot spots (probably does not apply to PBXT). - Kristian.
Hi Kristian, The easiest way to do this would be to add a parameter to xn_end_xact() that indicates that the log should not be written or flushed. In xn_end_xact(), the last parameter to the call to xt_xlog_log_data() determines what should happen: #define XT_XLOG_NO_WRITE_NO_FLUSH 0 #define XT_XLOG_WRITE_AND_FLUSH 1 #define XT_XLOG_WRITE_AND_NO_FLUSH 2 Without write or flush, this is a very fast operation. But the transaction is still committed and ordered, it is just not durable. Then, we have to make a note on the thread to flush the log when the actual commit is called. But this need not be a general flush. The thread only needs to flush the log past the point at which the commit record was written. The position is returned by xlog_append(), which was called by xt_xlog_log_data() above. At the moment, this return value is ignored. In the case of commit_ordered, this value must be stored. We then need to add the size of the COMMIT record to the offset. Then when actual commit is called, we check the current log flush position against the flush position we need. If it is passed our position then this is a NOP. If not, then we need to call xlog_append() with no data. This will do a group commit on the log. I was a bit difficult to explain, so please ask if anything is not clear. Best regards, Paul On Sep 29, 2010, at 11:45 AM, Kristian Nielsen wrote:
I want to ask your opinion about implementing in PBXT an extension to the storage engine API that I am working on.
There are lots of details in http://askmonty.org/worklog/Server-BackLog/?tid=116 (and even more details in other places), but I thought you would appreciate the short version :-)
The idea is to get a well-defined ordering of commits in the server in an efficient way (eg. not break group commit like InnoDB does currently, Bug#13669).
For this, two new (optional) storage engine methods are introduced:
void (*prepare_ordered)(handlerton *hton, THD *thd, bool all); void (*commit_ordered)(handlerton *hton, THD *thd, bool all);
The prepare_ordered() method is called after prepare(), as soon as commit order is decided. The commit_ordered() method is called before commit(), just after the transaction coordinator has made the final decision that the transaction will be durably committed (and not rolled back).
The calls into commit_ordered() among different transactions will happen in the order that these transactions are committed, consistently across all engines and the binary log. Same for prepare_ordered().
The idea is that the storage engine should do the minimal amount of work in commit_ordered() necessary to make the commit visible to other threads. And to make sure commits appear to be done in the order of calls to these methods.
Do you think either (or both) of these methods could be implemented in PBXT with reasonable effort (and if so, how)?
----
In InnoDB, this was trivial to do, as the InnoDB commit() method already had a "fast" part (which fixed the transaction log order (= "commit order") and made the transaction visible) and a "slow" part (which did the fsync() to make the transaction durable, and handled group commit).
(It is necessary that commit_ordered() is fast, as it runs under a global lock. Ideally, it will just allocate an LSN in the transaction log to fix commit order, and perhaps whatever else already needs to happen serialised during engine commit).
I hope my explanation was sufficiently clear for you to make a qualified answer. Maybe you can point me to where in the PBXT code a commit becomes visible and the commit order is fixed?
In case you were wondering, here are some of the motivations for this feature:
1. For taking a hot backup, it is useful to have consistent commit order between binlog and storage engines. Without it, it can happen that the backed up state of the server has transaction A (but not B) committed in the storage engine, and transaction B (but not A) written to the binlog. Using such backup to provision a new master or slave would leave replication in an inconsistent state.
2. This feature implements working group commit for the binlog while still preserving consistent order as per (1).
3. This will allow to implement START TRANSACTION WITH CONSISTENT SNAPSHOT for multi-engine transactions that is truly consistent (currently is is possible for a transaction to be visible in one engine but not another in such "consistent" snapshot).
4. Galera relies on a consistent commit order, and I believe this feature will allow it to get this in a more engine-independent way.
5. We are planning to use consistent commit order to allow MySQL to recover after a crash transactions that were synced to disk in the binlog but not in the engine. This will allow to reduce the number of fsyncs() during prepare() / commit() from 3 to 1; it only needs to be done in the binlog (with group commit); the engine does not need to fsync(), as any lost transactions will be recovered from the binlog after crash.
6. The prepare_ordered() method is inspired by the Facebook patch to release InnoDB row-level read locks early (before syncing the binlog to disk) to improve performance in the presence of hot spots (probably does not apply to PBXT).
- Kristian.
-- Paul McCullagh PrimeBase Technologies www.primebase.org www.blobstreaming.org pbxt.blogspot.com
Paul McCullagh <paul.mccullagh@primebase.org> writes:
Hi Kristian,
Hi Paul, thanks for your detailed answer!
I was a bit difficult to explain, so please ask if anything is not clear.
It seems pretty clear from your explanation. I will take a look into the sources, and will let you know if I have any questions.
Without write or flush, this is a very fast operation. But the transaction is still committed and ordered, it is just not durable.
Cool. Note that when using 2-phase commit (which happens if the binlog is enabled or if multiple engines participate in the transaction), the transaction is effectively durable at the server level (TC will recover it in case of crash), though at the engine level it is not. And in fact the transaction is already potentially visible to slaves. [This got me thinking about the case where there is no 2-phase commit (no binlog, and no other participating engines). In this case the transaction is _not_ durable at the server level until after commit in the engine. So I do not like enforcing visibility in commit_ordered() in this case. But when there is no 2-phase commit, there is no benefit in commit_ordered() anyway. So I think I will have the server only use commit_ordered() for 2-phase commit, similar as to prepare(). Slightly more book-keeping for the engines, but saner sematics in the end. Anyway, this is not really PBXT specific, but your mail inspired me to think about it, so thanks!] - Kristian.
Paul McCullagh <paul.mccullagh@primebase.org> writes:
The easiest way to do this would be to add a parameter to xn_end_xact() that indicates that the log should not be written or flushed.
Ok, I gave it a shot, but I had some problems due to not knowing the PBXT code sufficiently ...
In xn_end_xact(), the last parameter to the call to xt_xlog_log_data() determines what should happen:
#define XT_XLOG_NO_WRITE_NO_FLUSH 0 #define XT_XLOG_WRITE_AND_FLUSH 1 #define XT_XLOG_WRITE_AND_NO_FLUSH 2
Without write or flush, this is a very fast operation. But the transaction is still committed and ordered, it is just not durable.
I notice that xs_end_xact() does a number of things. I am wondering if all of these should be in the "fast" part in commit_ordered(), or if some should be done in the "slow" part along with the log flush? In particular this, flushing the data log (is this flush to disk?): if (!thread->st_dlog_buf.dlb_flush_log(TRUE, thread)) { ok = FALSE; status = XT_LOG_ENT_ABORT; } and this, at the end concerning the "sweeper": if (db->db_sw_faster) xt_wakeup_sweeper(db); /* Don't get too far ahead of the sweeper! */ if (writer) { ... Can you help suggest if these should be done in the "fast" part, or in the "slow" part? Also, this statement definitely needs to be postponed to the "slow" part I guess: thread->st_xact_data = NULL;
Then when actual commit is called, we check the current log flush position against the flush position we need. If it is passed our position then this is a NOP.
I think I can do this with a condition like this: if (xt_comp_log_pos(self->commit_fastpart_log_id, self->commit_fastpart_log_offset, xl_flush_log_id, xl_flush_log_offset) <= 0) But I am wondering if I need to take any locks around reading xl_flush_log_id and xl_flush_log_offset? Or can one argue that a dirty read could be ok (as long as it's atomic) as the values are probably monotonic?
If not, then we need to call xlog_append() with no data. This will do a group commit on the log.
Is it safe to call xlog_append() with no data even if the log has been flushed past the current position already? (else some locking seems definitely needed).
I was a bit difficult to explain, so please ask if anything is not clear.
Hopefully you can help with some of the above points, then I can give it another go with fresh eyes and maybe show you a patch. (If I get to that point, I will probably also need some advice on the proper error handling)... Anyway, from what you wrote and from what I see in the code, it seems the API I propose is general enough to fit well with PBXT, which is good and what I wanted to check (Even if xn_end_xact() may need to be taken apart a bit to properly split into a "fast" and a "slow" part). - Kristian.
On Oct 5, 2010, at 3:10 PM, Kristian Nielsen wrote:
Paul McCullagh <paul.mccullagh@primebase.org> writes:
The easiest way to do this would be to add a parameter to xn_end_xact() that indicates that the log should not be written or flushed.
Ok, I gave it a shot, but I had some problems due to not knowing the PBXT code sufficiently ...
In that case, judging by your questions, you catch on quick! :)
In xn_end_xact(), the last parameter to the call to xt_xlog_log_data() determines what should happen:
#define XT_XLOG_NO_WRITE_NO_FLUSH 0 #define XT_XLOG_WRITE_AND_FLUSH 1 #define XT_XLOG_WRITE_AND_NO_FLUSH 2
Without write or flush, this is a very fast operation. But the transaction is still committed and ordered, it is just not durable.
I notice that xs_end_xact() does a number of things. I am wondering if all of these should be in the "fast" part in commit_ordered(), or if some should be done in the "slow" part along with the log flush?
In particular this, flushing the data log (is this flush to disk?):
Yes, this is a flush to disk. This could be done in the slow part (obviously this would be ideal). But there is the following problem that should then be fixed. If we write the transaction log (i.e. commit the transaction), even if we do not flush the transaction log. It may be flushed by some other thread later. This will make the commit durable (in other words, on recovery, this transaction will be rolled forward). If we do not flush the data log, then there is a chance that such a commit transaction is incomplete, because the associated data log data has not been committed. The way to fix this problem is to check the extend of flushing of both the data and the transaction log on recovery. Simply put, on recover we check if the data log part of each record is completely flushed (is within the flush zone of the data log). If a data log record is missing, then recovery stops at that point in the transaction log. This will have to be built into the engine. And, it is easiest to do this in PBXT 1.5 which handle transaction logs and data logs identically.
if (!thread->st_dlog_buf.dlb_flush_log(TRUE, thread)) { ok = FALSE; status = XT_LOG_ENT_ABORT; }
and this, at the end concerning the "sweeper":
if (db->db_sw_faster) xt_wakeup_sweeper(db);
Yes, this could be taken out of the fast part, although it is not called all that often.
/* Don't get too far ahead of the sweeper! */ if (writer) { ...
Can you help suggest if these should be done in the "fast" part, or in the "slow" part?
Also, this statement definitely needs to be postponed to the "slow" part I guess:
thread->st_xact_data = NULL;
Actually, I don't think so. As far as PBXT is concerned, after the fast part has run, the transaction is committed. It is just not durable. This means that anything we do in the slow part should not need an explicit reference to the transaction.
Then when actual commit is called, we check the current log flush position against the flush position we need. If it is passed our position then this is a NOP.
I think I can do this with a condition like this:
if (xt_comp_log_pos(self->commit_fastpart_log_id, self-
commit_fastpart_log_offset, xl_flush_log_id, xl_flush_log_offset) <= 0)
Yes!
But I am wondering if I need to take any locks around reading xl_flush_log_id and xl_flush_log_offset? Or can one argue that a dirty read could be ok (as long as it's atomic) as the values are probably monotonic?
Basically yes. I believe I do this without lock elsewhere, and have taken care that this works. The flush log position is always increasing. Critical is when we switch logs, e.g. from log_id=100, log_offset=80000, to log_id=101, log_offset=0. I believe when this is done, the log_offset is first set to zero, then the log_id is incremented (should check this). This would mean that the comparing function would err on the side of flushing unnecessarily if the check comes between the to operations.
If not, then we need to call xlog_append() with no data. This will do a group commit on the log.
Is it safe to call xlog_append() with no data even if the log has been flushed past the current position already? (else some locking seems definitely needed).
Yes, it is safe. If there is nothing to do, xlog_append() will just return.
I was a bit difficult to explain, so please ask if anything is not clear.
Hopefully you can help with some of the above points, then I can give it another go with fresh eyes and maybe show you a patch.
(If I get to that point, I will probably also need some advice on the proper error handling)...
Yup, always the tricky part!
Anyway, from what you wrote and from what I see in the code, it seems the API I propose is general enough to fit well with PBXT, which is good and what I wanted to check (Even if xn_end_xact() may need to be taken apart a bit to properly split into a "fast" and a "slow" part).
I would actually recommend a "lazy" approach to the implementation. Simply add a boolean to the current commit, which indicates a fast commit should be done. Then we add a new "slow commit" function which does the parts not done by the slow commit. -- Paul McCullagh PrimeBase Technologies www.primebase.org www.blobstreaming.org pbxt.blogspot.com
Paul McCullagh <paul.mccullagh@primebase.org> writes:
In particular this, flushing the data log (is this flush to disk?):
if (!thread->st_dlog_buf.dlb_flush_log(TRUE, thread)) { ok = FALSE; status = XT_LOG_ENT_ABORT; }
Yes, this is a flush to disk.
This could be done in the slow part (obviously this would be ideal).
It occured to me that since we only do this (the new commit_ordered() API call) after having successfully run prepare(), the data log will already have been flushed to disk, right? So I suppose in this case, the data log flush will be a no-operation. In which case it is no problem to leave it in the "fast" part, or we could skip calling it.
But there is the following problem that should then be fixed.
If we write the transaction log (i.e. commit the transaction), even if we do not flush the transaction log. It may be flushed by some other thread later. This will make the commit durable (in other words, on recovery, this transaction will be rolled forward).
If we do not flush the data log, then there is a chance that such a commit transaction is incomplete, because the associated data log data has not been committed.
The way to fix this problem is to check the extend of flushing of both the data and the transaction log on recovery. Simply put, on recover we check if the data log part of each record is completely flushed (is within the flush zone of the data log).
If a data log record is missing, then recovery stops at that point in the transaction log.
Yes, I see, thanks for the explanation.
This will have to be built into the engine. And, it is easiest to do this in PBXT 1.5 which handle transaction logs and data logs identically.
Ok. Well, maybe it's not necessary as per above observation.
and this, at the end concerning the "sweeper":
if (db->db_sw_faster) xt_wakeup_sweeper(db);
Yes, this could be taken out of the fast part, although it is not called all that often.
Ok, I will omit it.
Also, this statement definitely needs to be postponed to the "slow" part I guess:
thread->st_xact_data = NULL;
Actually, I don't think so. As far as PBXT is concerned, after the fast part has run, the transaction is committed. It is just not durable.
This means that anything we do in the slow part should not need an explicit reference to the transaction.
Right, I see what you mean, I will keep it.
The flush log position is always increasing. Critical is when we switch logs, e.g. from log_id=100, log_offset=80000, to log_id=101, log_offset=0.
I believe when this is done, the log_offset is first set to zero, then the log_id is incremented (should check this).
This would mean that the comparing function would err on the side of flushing unnecessarily if the check comes between the to operations.
Yes, that should work. However, you need a write memory barrier when you update the position, and a read memory barrier when you read it: xl_flush_log_offset = 0; wmb(); xl_flush_log_id = old_id + 1; ... local_id = xl_flush_log_id; rmb(); local_offset = xl_flush_log_offset; Without this, the CPU may do the reads or writes in the opposite order (or more likely, the newest optimisations in GCC will do it).
I would actually recommend a "lazy" approach to the implementation.
Simply add a boolean to the current commit, which indicates a fast commit should be done.
Then we add a new "slow commit" function which does the parts not done by the slow commit.
Ok, thanks a lot for the advice, I will give it another shot. - Kristian.
Kristian Nielsen <knielsen@knielsen-hq.org> writes:
Ok, thanks a lot for the advice, I will give it another shot.
Thanks to your help, I got it working! It was _really_ nice to see that the new API applies well to PBXT also. As a bonus, we now get START TRANSACTION WITH CONSISTENT SNAPSHOT actually be consistent! In MySQL, this does not really do much except start a transaction in all engines, it certainly does not ensure any consistency between engines. With this change, it becomes consistent, I added a small Perl test program tests/consistent_snapshot.pl that shows this. I think this is particularly useful for backups; I plan to add a way to get the corresponding binlog position, so START TRANSACTION WITH CONSISTENT SNAPSHOT can be used to make a fully consistent and non-blocking backup (current mysqldump needs FLUSH TABLES WITH READ LOCK, which is not really non-blocking). I hope you can take a look at the patch (attached) when you get some time and let me know what you think, and if you see any mistakes. I did it a little differently from what we discussed, as I wanted to minimise the amount of work done while holding the global mutex around commit_ordered(). I also pushed the patch here, in case you want to see or run the full code: lp:~maria-captains/maria/mariadb-5.1-mwl116-pbxt It passes the test suite, but I did at one point see this in the log, which I am not sure what means, maybe you can help? void XTTabCache::xt_tc_release_page(XTOpenFile*, XTTabCachePage*, XTThread*)(tabcache_xt.cc:409) page->tcp_lock_count > 0 Finally a couple of questions:
In particular this, flushing the data log (is this flush to disk?):
if (!thread->st_dlog_buf.dlb_flush_log(TRUE, thread)) { ok = FALSE; status = XT_LOG_ENT_ABORT; }
Yes, this is a flush to disk.
This could be done in the slow part (obviously this would be ideal).
If we do not flush the data log, then there is a chance that such a commit transaction is incomplete, because the associated data log data has not been committed.
This is done in commit, but I could not see where similar data log flush is done in prepare(). It seems prepare() mostly adds a "prepare" record and flushes the transaction log. Is it correct that no data log flush happens in prepare? If so, don't we have the same problem? Suppose we prepare() in PBXT and write (and flush) the transaction into the binary log. Then we crash. When the server comes back up, it will try to recover the transaction inside PBXT, but that will not be possible if the data log was lost due to no flush, right? Final question: In commit() we call xt_tab_restrict_rows(). It seems to be delayed checking for defered foreign key constraints or something like that? If it is, then shouldn't it be done in prepare() (it's wrong to rollback with error in commit() after successful prepare)? I see the #ifdef XT_IMPLEMENT_NO_ACTION around the call, so I suppose this code is not actually used, but I just wondered ... - Kristian.
Hi Kristian, On Oct 15, 2010, at 4:07 PM, Kristian Nielsen wrote: > Kristian Nielsen <knielsen@knielsen-hq.org> writes: > >> Ok, thanks a lot for the advice, I will give it another shot. > > Thanks to your help, I got it working! It was _really_ nice to see > that the > new API applies well to PBXT also. Wow! That's great. > As a bonus, we now get START TRANSACTION WITH CONSISTENT SNAPSHOT > actually be > consistent! In MySQL, this does not really do much except start a > transaction > in all engines, it certainly does not ensure any consistency between > engines. > With this change, it becomes consistent, I added a small Perl test > program > tests/consistent_snapshot.pl that shows this. I think this is > particularly > useful for backups; That's cool. > I plan to add a way to get the corresponding binlog > position, so START TRANSACTION WITH CONSISTENT SNAPSHOT can be used > to make a > fully consistent and non-blocking backup (current mysqldump needs > FLUSH TABLES > WITH READ LOCK, which is not really non-blocking). > > I hope you can take a look at the patch (attached) when you get some > time and > let me know what you think, and if you see any mistakes. I did it a > little > differently from what we discussed, as I wanted to minimise the > amount of work > done while holding the global mutex around commit_ordered(). OK, I will check it out when I have time. > I also pushed the patch here, in case you want to see or run the > full code: > > lp:~maria-captains/maria/mariadb-5.1-mwl116-pbxt > > It passes the test suite, but I did at one point see this in the > log, which I > am not sure what means, maybe you can help? > > void XTTabCache::xt_tc_release_page(XTOpenFile*, XTTabCachePage*, > XTThread*)(tabcache_xt.cc:409) page->tcp_lock_count > 0 Hmmm. Not so good. > Finally a couple of questions: > >>>> In particular this, flushing the data log (is this flush to disk?): >> >>>> if (!thread->st_dlog_buf.dlb_flush_log(TRUE, thread)) { >>>> ok = FALSE; >>>> status = XT_LOG_ENT_ABORT; >>>> } >> >>> >>> Yes, this is a flush to disk. >>> >>> This could be done in the slow part (obviously this would be ideal). > >>> If we do not flush the data log, then there is a chance that such a >>> commit transaction is incomplete, because the associated data log >>> data >>> has not been committed. > > This is done in commit, but I could not see where similar data log > flush is > done in prepare(). It seems prepare() mostly adds a "prepare" record > and > flushes the transaction log. Yes, this is all it does.. > Is it correct that no data log flush happens in prepare? If so, > don't we have > the same problem? Oops, that looks like a bug... Prepare should also flush the data log. Well done! :) > Suppose we prepare() in PBXT and write (and flush) the transaction > into the > binary log. Then we crash. When the server comes back up, it will > try to > recover the transaction inside PBXT, but that will not be possible > if the data > log was lost due to no flush, right? Rollback would be possible, but commit may not be possible. Right. > Final question: > > In commit() we call xt_tab_restrict_rows(). It seems to be delayed > checking > for defered foreign key constraints or something like that? If it > is, then > shouldn't it be done in prepare() (it's wrong to rollback with error > in > commit() after successful prepare)? I see the #ifdef > XT_IMPLEMENT_NO_ACTION > around the call, so I suppose this code is not actually used, but I > just > wondered ... Yup. Right again, on all counts! :) There was/is a bug in MySQL that prevents me from activating this code. Best regards, Paul > ------------------------------------------------------------ > revno: 2852 > committer: knielsen@knielsen-hq.org > branch nick: work-5.1-pbxt-commit-ordered > timestamp: Fri 2010-10-15 15:42:06 +0200 > message: > MWL#116: Efficient group commit: PBXT part > > Implement the commit_ordered() API in PBXT, getting consistent > commit ordering > with other engines and binlog. > > Make pbxt_support_xa default in MariaDB debug build (as the bug > that causes > assert in MySQL is fixed in MariaDB). > diff: > === modified file 'storage/pbxt/src/ha_pbxt.cc' > --- storage/pbxt/src/ha_pbxt.cc 2010-09-28 13:05:45 +0000 > +++ storage/pbxt/src/ha_pbxt.cc 2010-10-15 13:42:06 +0000 > @@ -108,6 +108,9 @@ > static int pbxt_panic(handlerton *hton, enum ha_panic_function flag); > static void pbxt_drop_database(handlerton *hton, char *path); > static int pbxt_close_connection(handlerton *hton, THD* thd); > +#ifdef MARIADB_BASE_VERSION > +static void pbxt_commit_ordered(handlerton *hton, THD *thd, bool > all); > +#endif > static int pbxt_commit(handlerton *hton, THD *thd, bool all); > static int pbxt_rollback(handlerton *hton, THD *thd, bool all); > static int pbxt_prepare(handlerton *hton, THD *thd, bool all); > @@ -1147,6 +1150,9 @@ > pbxt_hton->state = SHOW_OPTION_YES; > pbxt_hton->db_type = DB_TYPE_PBXT; // Wow! I have my own! > pbxt_hton->close_connection = pbxt_close_connection; /* > close_connection, cleanup thread related data. */ > +#ifdef MARIADB_BASE_VERSION > + pbxt_hton->commit_ordered = pbxt_commit_ordered; > +#endif > pbxt_hton->commit = pbxt_commit; /* commit */ > pbxt_hton->rollback = pbxt_rollback; /* rollback */ > if (pbxt_support_xa) { > @@ -1484,6 +1490,29 @@ > return err; > } > > +#ifdef MARIADB_BASE_VERSION > +/* > + * Quickly commit the transaction to memory and make it visible to > others. > + * The remaining part of commit will happen later, in pbxt_commit(). > + */ > +static void pbxt_commit_ordered(handlerton *hton, THD *thd, bool all) > +{ > + XTThreadPtr self; > + > + if ((self = (XTThreadPtr) *thd_ha_data(thd, hton))) { > + XT_PRINT2(self, "%s pbxt_commit_ordered all=%d\n", all ? "END > CONN XACT" : "END STAT", all); > + > + if (self->st_xact_data) { > + if (all || self->st_auto_commit) { > + self->st_commit_ordered = TRUE; > + self->st_writer = self->st_xact_writer; > + self->st_delayed_error= !xt_xn_commit_fast(self, self- > >st_writer); > + } > + } > + } > +} > +#endif > + > /* > * Commit the PBXT transaction of the given thread. > * thd is the MySQL thread structure. > @@ -1512,7 +1541,13 @@ > if (all || self->st_auto_commit) { > XT_PRINT0(self, "xt_xn_commit in pbxt_commit\n"); > > - if (!xt_xn_commit(self)) > + if (self->st_commit_ordered) { > + self->st_commit_ordered = FALSE; > + err = !xt_xn_commit_slow(self, self->st_writer) || self- > >st_delayed_error; > + } else { > + err = !xt_xn_commit(self); > + } > + if (err) > err = xt_ha_pbxt_thread_error_for_mysql(thd, self, FALSE); > } > } > @@ -6064,7 +6099,7 @@ > NULL, NULL, 0, 0, 20000, 1); > #endif > > -#ifndef DEBUG > +#if !defined(DEBUG) || defined(MARIADB_BASE_VERSION) > static MYSQL_SYSVAR_BOOL(support_xa, pbxt_support_xa, > PLUGIN_VAR_OPCMDARG, > "Enable PBXT support for the XA two-phase commit, default is > enabled", > > === modified file 'storage/pbxt/src/thread_xt.h' > --- storage/pbxt/src/thread_xt.h 2010-05-05 10:59:57 +0000 > +++ storage/pbxt/src/thread_xt.h 2010-10-15 13:42:06 +0000 > @@ -299,6 +299,9 @@ > xtBool st_stat_ended; /* TRUE if the statement was ended. */ > xtBool st_stat_trans; /* TRUE if a statement transaction is > running (started on UPDATE). */ > xtBool st_stat_modify; /* TRUE if the statement is an > INSERT/UPDATE/DELETE */ > + xtBool st_commit_ordered; /* TRUE if we have run > commit_ordered() */ > + xtBool st_delayed_error; /* TRUE if we got an error in > commit_ordered() */ > + xtBool st_writer; /* Copy of thread->st_xact_writer > (which is clobbered by xlog_append()) */ > #ifdef XT_IMPLEMENT_NO_ACTION > XTBasicListRec st_restrict_list; /* These records have been > deleted and should have no reference. */ > #endif > > === modified file 'storage/pbxt/src/xaction_xt.cc' > --- storage/pbxt/src/xaction_xt.cc 2010-09-28 13:05:45 +0000 > +++ storage/pbxt/src/xaction_xt.cc 2010-10-15 13:42:06 +0000 > @@ -1287,27 +1287,61 @@ > return OK; > } > > -static xtBool xn_end_xact(XTThreadPtr thread, u_int status) > +static void xn_end_release_locks(XTThreadPtr thread) > +{ > + XTXactDataPtr xact = thread->st_xact_data; > + XTDatabaseHPtr db = thread->st_database; > + ASSERT_NS(xact); > + > + /* {REMOVE-LOCKS} Drop locks if you have any: */ > + thread->st_lock_list.xt_remove_all_locks(db, thread); > + > + /* Do this afterwards to make sure the sweeper > + * does not cleanup transactions start cleaning up > + * before any transactions that were waiting for > + * this transaction have completed! > + */ > + xact->xd_end_xn_id = db->db_xn_curr_id; > + > + /* Now you can sweep! */ > + xact->xd_flags |= XT_XN_XAC_SWEEP; > +} > + > +/* The commit is split into two phases: one "fast" for MariaDB > commit_ordered(), > + * and one "slow" for commit(). When not using internal 2pc, there > is only one > + * call combining both phases. > + */ > + > +enum { > + XN_END_PHASE_FAST = 1, > + XN_END_PHASE_SLOW = 2, > + XN_END_PHASE_BOTH = 3 > +}; > + > +static xtBool xn_end_xact(XTThreadPtr thread, u_int status, xtBool > writer, int phase) > { > XTXactDataPtr xact; > xtBool ok = TRUE; > + xtBool err; > > ASSERT_NS(thread->st_xact_data); > if ((xact = thread->st_xact_data)) { > XTDatabaseHPtr db = thread->st_database; > xtXactID xn_id = xact->xd_start_xn_id; > - xtBool writer; > > - if ((writer = thread->st_xact_writer)) { > + if (writer) { > /* The transaction wrote something: */ > XTXactEndEntryDRec entry; > xtWord4 sum; > > - sum = XT_CHECKSUM4_XACT(xn_id) ^ XT_CHECKSUM4_XACT(0); > - entry.xe_status_1 = status; > - entry.xe_checksum_1 = XT_CHECKSUM_1(sum); > - XT_SET_DISK_4(entry.xe_xact_id_4, xn_id); > - XT_SET_DISK_4(entry.xe_not_used_4, 0); > + if (phase & XN_END_PHASE_FAST) > + { > + sum = XT_CHECKSUM4_XACT(xn_id) ^ XT_CHECKSUM4_XACT(0); > + entry.xe_status_1 = status; > + entry.xe_checksum_1 = XT_CHECKSUM_1(sum); > + XT_SET_DISK_4(entry.xe_xact_id_4, xn_id); > + XT_SET_DISK_4(entry.xe_not_used_4, 0); > + } > > #ifdef XT_IMPLEMENT_NO_ACTION > /* This will check any resticts that have been delayed to the end > of the statement. */ > @@ -1319,20 +1353,35 @@ > } > #endif > > - /* Flush the data log: */ > - if (!thread->st_dlog_buf.dlb_flush_log(TRUE, thread)) { > + /* Flush the data log (in the "fast" case we already did it in > prepare: */ > + if ((phase & XN_END_PHASE_SLOW) && !thread- > >st_dlog_buf.dlb_flush_log(TRUE, thread)) { > ok = FALSE; > status = XT_LOG_ENT_ABORT; > } > > /* Write and flush the transaction log: */ > - if (!xt_xlog_log_data(thread, sizeof(XTXactEndEntryDRec), > (XTXactLogBufferDPtr) &entry, xt_db_flush_log_at_trx_commit)) { > + if (phase == XN_END_PHASE_FAST) { > + /* Fast phase, delay any write or flush to later. */ > + err = !xt_xlog_log_data(thread, sizeof(XTXactEndEntryDRec), > (XTXactLogBufferDPtr) &entry, XT_XLOG_NO_WRITE_NO_FLUSH); > + } else if (phase == XN_END_PHASE_SLOW) { > + /* We already appended the commit record in the fast phase. > + * Now just call with empty record to ensure we write/flush > + * the log as needed for this commit. > + */ > + err = !xt_xlog_log_data(thread, 0, NULL, > xt_db_flush_log_at_trx_commit); > + } else /* phase == XN_END_PHASE_BOTH */ { > + /* Both phases at once, append commit record and write/flush > normally. */ > + ASSERT_NS(phase == XN_END_PHASE_BOTH); > + err = !xt_xlog_log_data(thread, sizeof(XTXactEndEntryDRec), > (XTXactLogBufferDPtr) &entry, xt_db_flush_log_at_trx_commit); > + } > + > + if (err) { > ok = FALSE; > status = XT_LOG_ENT_ABORT; > /* Make sure this is done, if we failed to log > * the transction end! > */ > - if (thread->st_xact_writer) { > + if (writer) { > /* Adjust this in case of error, but don't forget > * to lock! > */ > @@ -1347,46 +1396,46 @@ > } > } > > - /* Setting this flag completes the transaction, > - * Do this before we release the locks, because > - * the unlocked transactions expect the > - * transaction they are waiting for to be > - * gone! > + if (phase & XN_END_PHASE_FAST) { > + /* Setting this flag completes the transaction, > + * Do this before we release the locks, because > + * the unlocked transactions expect the > + * transaction they are waiting for to be > + * gone! > + */ > + xact->xd_end_time = ++db->db_xn_end_time; > + if (status == XT_LOG_ENT_COMMIT) { > + thread->st_statistics.st_commits++; > + xact->xd_flags |= (XT_XN_XAC_COMMITTED | XT_XN_XAC_ENDED); > + } > + else { > + thread->st_statistics.st_rollbacks++; > + xact->xd_flags |= XT_XN_XAC_ENDED; > + } > + } > + > + /* Be as fast as possible in the "fast" path, as we want to be as > + * fast as possible here (we will release slow locks immediately > + * after in the "slow" part). > + * ToDo: If we ran the fast part, the slow part could release > locks > + * _before_ fsync(), rather than after. > */ > - xact->xd_end_time = ++db->db_xn_end_time; > - if (status == XT_LOG_ENT_COMMIT) { > - thread->st_statistics.st_commits++; > + if (!(phase & XN_END_PHASE_SLOW)) > + return ok; > + > + xn_end_release_locks(thread); > + } > + else { > + /* Read-only transaction can be removed, immediately */ > + if (phase & XN_END_PHASE_FAST) { > + xact->xd_end_time = ++db->db_xn_end_time; > xact->xd_flags |= (XT_XN_XAC_COMMITTED | XT_XN_XAC_ENDED); > - } > - else { > - thread->st_statistics.st_rollbacks++; > - xact->xd_flags |= XT_XN_XAC_ENDED; > - } > - > - /* {REMOVE-LOCKS} Drop locks is you have any: */ > - thread->st_lock_list.xt_remove_all_locks(db, thread); > - > - /* Do this afterwards to make sure the sweeper > - * does not cleanup transactions start cleaning up > - * before any transactions that were waiting for > - * this transaction have completed! > - */ > - xact->xd_end_xn_id = db->db_xn_curr_id; > - > - /* Now you can sweep! */ > - xact->xd_flags |= XT_XN_XAC_SWEEP; > - } > - else { > - /* Read-only transaction can be removed, immediately */ > - xact->xd_end_time = ++db->db_xn_end_time; > - xact->xd_flags |= (XT_XN_XAC_COMMITTED | XT_XN_XAC_ENDED); > - > - /* Drop locks is you have any: */ > - thread->st_lock_list.xt_remove_all_locks(db, thread); > - > - xact->xd_end_xn_id = db->db_xn_curr_id; > - > - xact->xd_flags |= XT_XN_XAC_SWEEP; > + > + if (!(phase & XN_END_PHASE_SLOW)) > + return ok; > + } > + > + xn_end_release_locks(thread); > > if (xt_xn_delete_xact(db, xn_id, thread)) { > if (db->db_xn_min_ram_id == xn_id) > @@ -1478,12 +1527,22 @@ > > xtPublic xtBool xt_xn_commit(XTThreadPtr thread) > { > - return xn_end_xact(thread, XT_LOG_ENT_COMMIT); > + return xn_end_xact(thread, XT_LOG_ENT_COMMIT, thread- > >st_xact_writer, XN_END_PHASE_BOTH); > +} > + > +xtPublic xtBool xt_xn_commit_fast(XTThreadPtr thread, xtBool writer) > +{ > + return xn_end_xact(thread, XT_LOG_ENT_COMMIT, writer, > XN_END_PHASE_FAST); > +} > + > +xtPublic xtBool xt_xn_commit_slow(XTThreadPtr thread, xtBool writer) > +{ > + return xn_end_xact(thread, XT_LOG_ENT_COMMIT, writer, > XN_END_PHASE_SLOW); > } > > xtPublic xtBool xt_xn_rollback(XTThreadPtr thread) > { > - return xn_end_xact(thread, XT_LOG_ENT_ABORT); > + return xn_end_xact(thread, XT_LOG_ENT_ABORT, thread- > >st_xact_writer, XN_END_PHASE_BOTH); > } > > xtPublic xtBool xt_xn_log_tab_id(XTThreadPtr self, xtTableID tab_id) > > === modified file 'storage/pbxt/src/xaction_xt.h' > --- storage/pbxt/src/xaction_xt.h 2010-05-05 10:59:57 +0000 > +++ storage/pbxt/src/xaction_xt.h 2010-10-15 13:42:06 +0000 > @@ -193,6 +193,8 @@ > > xtBool xt_xn_begin(struct XTThread *self); > xtBool xt_xn_commit(struct XTThread *self); > +xtBool xt_xn_commit_fast(struct XTThread *self, xtBool writer); > +xtBool xt_xn_commit_slow(struct XTThread *self, xtBool writer); > xtBool xt_xn_rollback(struct XTThread *self); > xtBool xt_xn_log_tab_id(struct XTThread *self, xtTableID tab_id); > int xt_xn_status(struct XTOpenTable *ot, xtXactID xn_id, > xtRecordID rec_id); > > === added file 'tests/consistent_snapshot.pl' > --- tests/consistent_snapshot.pl 1970-01-01 00:00:00 +0000 > +++ tests/consistent_snapshot.pl 2010-10-15 13:42:06 +0000 > @@ -0,0 +1,107 @@ > +#! /usr/bin/perl > + > +# Test START TRANSACTION WITH CONSISTENT SNAPSHOT. > +# With MWL#116, this is implemented so it is actually consistent. > + > +use strict; > +use warnings; > + > +use DBI; > + > +my $UPDATERS= 10; > +my $READERS= 5; > + > +my $ROWS= 50; > +my $DURATION= 20; > + > +my $stop_time= time() + $DURATION; > + > +sub my_connect { > + my $dbh= DBI->connect("dbi:mysql:mysql_socket=/tmp/ > mysql.sock;database=test", > + "root", undef, { RaiseError=>1, > PrintError=>0, AutoCommit=>0}); > + $dbh->do("SET SESSION TRANSACTION ISOLATION LEVEL REPEATABLE > READ"); > + $dbh->do("SET SESSION autocommit = 0"); > + return $dbh; > +} > + > +sub my_setup { > + my $dbh= my_connect(); > + > + $dbh->do("DROP TABLE IF EXISTS test_consistent_snapshot1, > test_consistent_snapshot2"); > + $dbh->do(<<TABLE); > +CREATE TABLE test_consistent_snapshot1 ( > + a INT PRIMARY KEY, > + b INT NOT NULL > +) ENGINE=InnoDB > +TABLE > + $dbh->do(<<TABLE); > +CREATE TABLE test_consistent_snapshot2( > + a INT PRIMARY KEY, > + b INT NOT NULL > +) ENGINE=PBXT > +TABLE > + > + for (my $i= 0; $i < $ROWS; $i++) { > + my $value= int(rand()*1000); > + $dbh->do("INSERT INTO test_consistent_snapshot1 VALUES (?, ?)", > undef, > + $i, $value); > + $dbh->do("INSERT INTO test_consistent_snapshot2 VALUES (?, ?)", > undef, > + $i, -$value); > + } > + $dbh->commit(); > + $dbh->disconnect(); > +} > + > +sub my_updater { > + my $dbh= my_connect(); > + > + while (time() < $stop_time) { > + my $i1= int(rand()*$ROWS); > + my $i2= int(rand()*$ROWS); > + my $v= int(rand()*99)-49; > + $dbh->do("UPDATE test_consistent_snapshot1 SET b = b + ? WHERE > a = ?", > + undef, $v, $i1); > + $dbh->do("UPDATE test_consistent_snapshot2 SET b = b - ? WHERE > a = ?", > + undef, $v, $i2); > + $dbh->commit(); > + } > + > + $dbh->disconnect(); > + exit(0); > +} > + > +sub my_reader { > + my $dbh= my_connect(); > + > + my $iteration= 0; > + while (time() < $stop_time) { > + $dbh->do("START TRANSACTION WITH CONSISTENT SNAPSHOT"); > + my $s1= $dbh->selectrow_arrayref("SELECT SUM(b) FROM > test_consistent_snapshot1"); > + $s1= $s1->[0]; > + my $s2= $dbh->selectrow_arrayref("SELECT SUM(b) FROM > test_consistent_snapshot2"); > + $s2= $s2->[0]; > + $dbh->commit(); > + if ($s1 + $s2 != 0) { > + print STDERR "Found inconsistency, s1=$s1 s2=$s2 iteration= > $iteration\n"; > + last; > + } > + ++$iteration; > + } > + > + $dbh->disconnect(); > + exit(0); > +} > + > +my_setup(); > + > +for (1 .. $UPDATERS) { > + fork() || my_updater(); > +} > + > +for (1 .. $READERS) { > + fork() || my_reader(); > +} > + > +waitpid(-1, 0) for (1 .. ($UPDATERS + $READERS)); > + > +print "All checks done\n"; > _______________________________________________ > Mailing list: https://launchpad.net/~maria-developers > Post to : maria-developers@lists.launchpad.net > Unsubscribe : https://launchpad.net/~maria-developers > More help : https://help.launchpad.net/ListHelp -- Paul McCullagh PrimeBase Technologies www.primebase.org www.blobstreaming.org pbxt.blogspot.com
Paul McCullagh <paul.mccullagh@primebase.org> writes:
On Oct 15, 2010, at 4:07 PM, Kristian Nielsen wrote:
Thanks to your help, I got it working! It was _really_ nice to see that the new API applies well to PBXT also.
Wow! That's great.
I hope you can take a look at the patch (attached) when you get some time and let me know what you think, and if you see any mistakes. I did it a
OK, I will check it out when I have time.
Any update on this? Are you still planning to look into it at some point? Anything I can do to help? - Kristian.
Hi Kristian, Sorry I have not had time for this. December is busy. I will try to allocate a few days in January. On Dec 7, 2010, at 10:00 AM, Kristian Nielsen wrote:
Paul McCullagh <paul.mccullagh@primebase.org> writes:
On Oct 15, 2010, at 4:07 PM, Kristian Nielsen wrote:
Thanks to your help, I got it working! It was _really_ nice to see that the new API applies well to PBXT also.
Wow! That's great.
I hope you can take a look at the patch (attached) when you get some time and let me know what you think, and if you see any mistakes. I did it a
OK, I will check it out when I have time.
Any update on this? Are you still planning to look into it at some point? Anything I can do to help?
- Kristian.
_______________________________________________ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp
-- Paul McCullagh PrimeBase Technologies www.primebase.org www.blobstreaming.org pbxt.blogspot.com
participants (2)
-
Kristian Nielsen
-
Paul McCullagh