Paul McCullagh <paul.mccullagh@primebase.org> writes:
In particular this, flushing the data log (is this flush to disk?):
if (!thread->st_dlog_buf.dlb_flush_log(TRUE, thread)) { ok = FALSE; status = XT_LOG_ENT_ABORT; }
Yes, this is a flush to disk.
This could be done in the slow part (obviously this would be ideal).
It occured to me that since we only do this (the new commit_ordered() API call) after having successfully run prepare(), the data log will already have been flushed to disk, right? So I suppose in this case, the data log flush will be a no-operation. In which case it is no problem to leave it in the "fast" part, or we could skip calling it.
But there is the following problem that should then be fixed.
If we write the transaction log (i.e. commit the transaction), even if we do not flush the transaction log. It may be flushed by some other thread later. This will make the commit durable (in other words, on recovery, this transaction will be rolled forward).
If we do not flush the data log, then there is a chance that such a commit transaction is incomplete, because the associated data log data has not been committed.
The way to fix this problem is to check the extend of flushing of both the data and the transaction log on recovery. Simply put, on recover we check if the data log part of each record is completely flushed (is within the flush zone of the data log).
If a data log record is missing, then recovery stops at that point in the transaction log.
Yes, I see, thanks for the explanation.
This will have to be built into the engine. And, it is easiest to do this in PBXT 1.5 which handle transaction logs and data logs identically.
Ok. Well, maybe it's not necessary as per above observation.
and this, at the end concerning the "sweeper":
if (db->db_sw_faster) xt_wakeup_sweeper(db);
Yes, this could be taken out of the fast part, although it is not called all that often.
Ok, I will omit it.
Also, this statement definitely needs to be postponed to the "slow" part I guess:
thread->st_xact_data = NULL;
Actually, I don't think so. As far as PBXT is concerned, after the fast part has run, the transaction is committed. It is just not durable.
This means that anything we do in the slow part should not need an explicit reference to the transaction.
Right, I see what you mean, I will keep it.
The flush log position is always increasing. Critical is when we switch logs, e.g. from log_id=100, log_offset=80000, to log_id=101, log_offset=0.
I believe when this is done, the log_offset is first set to zero, then the log_id is incremented (should check this).
This would mean that the comparing function would err on the side of flushing unnecessarily if the check comes between the to operations.
Yes, that should work. However, you need a write memory barrier when you update the position, and a read memory barrier when you read it: xl_flush_log_offset = 0; wmb(); xl_flush_log_id = old_id + 1; ... local_id = xl_flush_log_id; rmb(); local_offset = xl_flush_log_offset; Without this, the CPU may do the reads or writes in the opposite order (or more likely, the newest optimisations in GCC will do it).
I would actually recommend a "lazy" approach to the implementation.
Simply add a boolean to the current commit, which indicates a fast commit should be done.
Then we add a new "slow commit" function which does the parts not done by the slow commit.
Ok, thanks a lot for the advice, I will give it another shot. - Kristian.