On Oct 5, 2010, at 3:10 PM, Kristian Nielsen wrote:
Paul McCullagh <paul.mccullagh@primebase.org> writes:
The easiest way to do this would be to add a parameter to xn_end_xact() that indicates that the log should not be written or flushed.
Ok, I gave it a shot, but I had some problems due to not knowing the PBXT code sufficiently ...
In that case, judging by your questions, you catch on quick! :)
In xn_end_xact(), the last parameter to the call to xt_xlog_log_data() determines what should happen:
#define XT_XLOG_NO_WRITE_NO_FLUSH 0 #define XT_XLOG_WRITE_AND_FLUSH 1 #define XT_XLOG_WRITE_AND_NO_FLUSH 2
Without write or flush, this is a very fast operation. But the transaction is still committed and ordered, it is just not durable.
I notice that xs_end_xact() does a number of things. I am wondering if all of these should be in the "fast" part in commit_ordered(), or if some should be done in the "slow" part along with the log flush?
In particular this, flushing the data log (is this flush to disk?):
Yes, this is a flush to disk. This could be done in the slow part (obviously this would be ideal). But there is the following problem that should then be fixed. If we write the transaction log (i.e. commit the transaction), even if we do not flush the transaction log. It may be flushed by some other thread later. This will make the commit durable (in other words, on recovery, this transaction will be rolled forward). If we do not flush the data log, then there is a chance that such a commit transaction is incomplete, because the associated data log data has not been committed. The way to fix this problem is to check the extend of flushing of both the data and the transaction log on recovery. Simply put, on recover we check if the data log part of each record is completely flushed (is within the flush zone of the data log). If a data log record is missing, then recovery stops at that point in the transaction log. This will have to be built into the engine. And, it is easiest to do this in PBXT 1.5 which handle transaction logs and data logs identically.
if (!thread->st_dlog_buf.dlb_flush_log(TRUE, thread)) { ok = FALSE; status = XT_LOG_ENT_ABORT; }
and this, at the end concerning the "sweeper":
if (db->db_sw_faster) xt_wakeup_sweeper(db);
Yes, this could be taken out of the fast part, although it is not called all that often.
/* Don't get too far ahead of the sweeper! */ if (writer) { ...
Can you help suggest if these should be done in the "fast" part, or in the "slow" part?
Also, this statement definitely needs to be postponed to the "slow" part I guess:
thread->st_xact_data = NULL;
Actually, I don't think so. As far as PBXT is concerned, after the fast part has run, the transaction is committed. It is just not durable. This means that anything we do in the slow part should not need an explicit reference to the transaction.
Then when actual commit is called, we check the current log flush position against the flush position we need. If it is passed our position then this is a NOP.
I think I can do this with a condition like this:
if (xt_comp_log_pos(self->commit_fastpart_log_id, self-
commit_fastpart_log_offset, xl_flush_log_id, xl_flush_log_offset) <= 0)
Yes!
But I am wondering if I need to take any locks around reading xl_flush_log_id and xl_flush_log_offset? Or can one argue that a dirty read could be ok (as long as it's atomic) as the values are probably monotonic?
Basically yes. I believe I do this without lock elsewhere, and have taken care that this works. The flush log position is always increasing. Critical is when we switch logs, e.g. from log_id=100, log_offset=80000, to log_id=101, log_offset=0. I believe when this is done, the log_offset is first set to zero, then the log_id is incremented (should check this). This would mean that the comparing function would err on the side of flushing unnecessarily if the check comes between the to operations.
If not, then we need to call xlog_append() with no data. This will do a group commit on the log.
Is it safe to call xlog_append() with no data even if the log has been flushed past the current position already? (else some locking seems definitely needed).
Yes, it is safe. If there is nothing to do, xlog_append() will just return.
I was a bit difficult to explain, so please ask if anything is not clear.
Hopefully you can help with some of the above points, then I can give it another go with fresh eyes and maybe show you a patch.
(If I get to that point, I will probably also need some advice on the proper error handling)...
Yup, always the tricky part!
Anyway, from what you wrote and from what I see in the code, it seems the API I propose is general enough to fit well with PBXT, which is good and what I wanted to check (Even if xn_end_xact() may need to be taken apart a bit to properly split into a "fast" and a "slow" part).
I would actually recommend a "lazy" approach to the implementation. Simply add a boolean to the current commit, which indicates a fast commit should be done. Then we add a new "slow commit" function which does the parts not done by the slow commit. -- Paul McCullagh PrimeBase Technologies www.primebase.org www.blobstreaming.org pbxt.blogspot.com