Hello Serg,

I am sorry to bug you with this mail but I am not sure what discussions you and my mentors have had before I was failed in GSoC. I completed all the tasks I had mentioned in my proposal but the project is incomplete. No doubts about that, but it is so because I got stuck with a major task (added in the last month) of the project for which I received no help from my mentors. These are the conversations I had with my mentors recently regarding the same task.

Thanks,

Sumit

---------- Forwarded message ----------
From: Sumit Lakra <sumitlakradev@gmail.com>
Date: Tue, Aug 22, 2017 at 1:04 AM
Subject: GSoC Final Evaluation
To: Daniel Black <daniel.black@au1.ibm.com>, Jan Lindström <jan.lindstrom@mariadb.com>

Hello,

I went through the code that the InnoDB background threads use to pick tasks and execute them, in detail. This is how they seem to work.

The reader/writer threads are passed an integer which acts as a segment. They then call fil_aio_wait(segment) which calls os_aio_handler(segment, &node, &message, &type). The control then goes to the os_aio_simulated_handler(segment, m1, m2, request) where the code gets more complicated with AIO arrays and slots. It gets harder to understand how they choose their tasks. It is definitely not a simple queue structure from which they pick their tasks. Also, which buffer pool the task is related to can only be figured out quite later.. based on the value of m2, which stores the address of a bpage. A simple queue could have been easily replaced with multiple queues, i.e. a queue per numa-node like we had once discussed on IRC. Lastly, all these procedures are common for log threads as well.

Another thing, you mentioned more than once you wanted the reader threads to look for a bpage in their local nodes before looking them up in other nodes, but they use the bpage structure itself like I mentioned. Obviously neither of us had a proper understanding of how InnoDB worked in these aspects when we started the project. These threads seem to operate on bpages mostly rather than buf_pools, which makes numa mapping even harder (buf_pools to numa nodes would comparatively be easier), but is definitely more efficient than buf_pools and hence shouldn't be changed.

Then again there were cases when the tasks assigned to the background threads were done by other server threads as well, especially in case of flushing, but I successfully restricted them to their nodes. If we were to create queues per node, which queues would these other threads work on.

In other words the way in which InnoDB was initially written and expanded has made it significantly difficult to be adapted for explicit NUMA support. It wouldn't just require restructuring and changing the way these threads use parameters, pick tasks etc, but may also require re-ordering the order in which they are executed. For example, the background threads are created before the user threads, and trying to use the NUMA node from user THD later on would mean more system calls and probably thread migrations.

When you added the task in the spreadsheet, you were right to anticipate that it could require large cleanup of the InnoDB structure, but I am beginning to think it will be way more complicated. Also, as you once mentioned that most architectures have 2 to 4 NUMA nodes. No doubt, if we could implement a support for NUMA architectuure such that InnoDB would make the best use out of it, there would be a performance difference but it would still be quite negligible in a very fast computer, and I really don't think making such big changes to InnoDB would be worth the effort and risk. Trying to bring a major change to a working and verified code (written by someone else) is more error-prone than adding some modular code and functions to support a new feature. I hope you agree.

Last but not least, if you think it can still be done and have a idea, I will be more than willing to attempt it. After all, GSoC has only got me started to contribute to open-source and this evaluation won't be the end of it. And since there's no other database software out there which supports NUMA (let's ignore numa-interleave) and since I was the first to start working on this, I will be really proud to see it through to the end.

Some Evaluation stuff :

I see only one text field in the form for a link to the work done, and it takes a single url. Reserving this for the github link, where do I add a link to the spreadsheet or documentation ?

Assuming you have already verified the work done so far, I will be creating a pull request shortly, so I can add whether it was merged or not, in the evaluation.

Thanking You,

Sumit

---------- Forwarded message ----------
From: Daniel Black <daniel.black@au1.ibm.com>
Date: Thu, Aug 24, 2017 at 12:11 PM
Subject: Re: GSoC Final Evaluation
To: Sumit Lakra <sumitlakradev@gmail.com>, Jan Lindström <jan.lindstrom@mariadb.com>

On 22/08/17 05:34, Sumit Lakra wrote:
> Hello,
>
> I went through the code that the InnoDB background threads use to pick
> tasks and execute them, in detail. This is how they seem to work.
>
> The reader/writer threads are passed an integer which acts as a segment.
> They then call fil_aio_wait(segment) which calls os_aio_handler(segment,
> &node, &message, &type). The control then goes to the
> os_aio_simulated_handler(segment, m1, m2, request) where the code gets
> more complicated with AIO arrays and slots. It gets harder to understand
> how they choose their tasks. It is definitely not a simple queue
> structure from which they pick their tasks. Also, which buffer pool the
> task is related to can only be figured out quite later.. based on the
> value of m2, which stores the address of a bpage.

> A simple queue could
> have been easily replaced with multiple queues, i.e. a queue per
> numa-node like we had once discussed on IRC.

Yes, you know where requests are created and where the job message ends
up so you should be able to create an implementation. Even if it doesn't
preserver the existing behaviour it should be enough to test it.

> Lastly, all these
> procedures are common for log threads as well.

relevance?

> Another thing, you mentioned more than once you wanted the reader

SQL threads? Or is this the same thing?

> threads to look for a bpage in their local nodes before looking them up
> in other nodes, but they use the bpage structure itself like I
> mentioned
> Obviously neither of us had a proper understanding of how
> InnoDB worked in these aspects when we started the project.

No project knows all the details.

> These
> threads seem to operate on bpages mostly rather than buf_pools, which
> makes numa mapping even harder

You could add a node_id to the class buf_page_t?

> (buf_pools to numa nodes would
> comparatively be easier), but is definitely more efficient than
> buf_pools and hence shouldn't be changed.

> Then again there were cases when the tasks assigned to the background
> threads were done by other server threads as well, especially in case of
> flushing, but I successfully restricted them to their nodes. If we were
> to create queues per node, which queues would these other threads work on.

Yes, which is why this issue was raised with you months ago.

> In other words the way in which InnoDB was initially written and
> expanded has made it significantly difficult to be adapted for explicit
> NUMA support.

The split of buffer pool instances was in appreciation that there there
are hot locks causes by a large number of threads.

It just needed more thought than implementing a few thread bindings. I'm
sorry you though it was that easy.

It wouldn't just require restructuring and changing the
> way these threads use parameters, pick tasks etc,

You've changed very few parameters so far.

https://github.com/MariaDB/server/compare/10.2...theGodlessLakra:numa_one

but may also require
> re-ordering the order in which they are executed.

I can't think of one example of this.

> For example, the
> background threads are created before the user threads, and trying to
> use the NUMA node from user THD later on would mean more system calls
> and probably thread migrations.

Huh, what? "may..require..(bad design)".

Which is why a mapping of THD to known background threads per node was
desired, to avoid syscall thread migrations.
> When you added the task in the spreadsheet, you were right to anticipate
> that it could require large cleanup of the InnoDB structure, but I am
> beginning to think it will be way more complicated.

I mentioned in on 2017-07-10 too
(http://marialog.archivist.info/2017-07-10.txt), too and you said you'd
give it a try.

> Also, as you once
> mentioned that most architectures have 2 to 4 NUMA nodes. No doubt, if
> we could implement a support for NUMA architectuure such that InnoDB
> would make the best use out of it, there would be a performance
> difference but it would still be quite negligible in a very fast
> computer,

Cross node cache-coherency takes significant clock cycles compared to
local access. What you'd end up with is a system that has idle time that
can't be used.

By finishing this task this could be measured. local/remote perf access
can be measure (perhaps even with fake_numa).
https://joemario.github.io/blog/2016/09/01/c2c-blog/

> and I really don't think making such big changes to InnoDB
> would be worth the effort and risk.

There is no gain without risk. Small changes with careful review,
running full tests between them is they way large changes are made low risk.

> Trying to bring a major change to a
> working and verified code (written by someone else) is more error-prone
> than adding some modular code and functions to support a new feature. I
> hope you agree.

When developing you are always developing for current and future use. As
such architectural changes need to be commiserate to the functional
changes. Jan and I have placed emphasis on writing test cases. These
ensure that changes.

> Last but not least, if you think it can still be done and have a idea, I
> will be more than willing to attempt it. After all, GSoC has only got me
> started to contribute to open-source and this evaluation won't be the
> end of it.

> And since there's no other database software out there which
> supports NUMA (let's ignore numa-interleave) and since I was the first
> to start working on this

There is, found
https://github.com/aerospike/aerospike-server/blob/master/cf/src/hardware.c
the other day. Wouldn't have helped much as it has a different
architecture and getting on top of one was hard enough.

>, I will be really proud to see it through to
> the end.

This week looks like you've done one commit that is largely a revert
that Jan asked for. You have slowed down considerably.

> Some Evaluation stuff :
> I see only one text field in the form for a link to the work done, and
> it takes a single url. Reserving this for the github link, where do I
> add a link to the spreadsheet or documentation ?
>
> Assuming you have already verified the work done so far, I will be
> creating a pull request shortly, so I can add whether it was merged or
> not, in the evaluation.

Looks like the fake_numa stuff is incomplete. Pinning of background
threads looks odd as they aren't big allocators of memory and aren't
associated with SQL threads/buffer pool instance nodes.

I have looked though it. As a pull request 60 commits with 26(?) of them
fixing prior commits will look messy for anyone trying to look at the
changes. If you are up for trying to clean it up please give it a go.
Doing it in a copied repository is a good idea as you've seen before it
can get a bit hard to keep track of.

> Thanking You,
> Sumit
>

---------- Forwarded message ----------
From: Sumit Lakra <sumitlakradev@gmail.com>
Date: Fri, Aug 25, 2017 at 3:03 AM
Subject: Re: GSoC Final Evaluation
To: Daniel Black <daniel.black@au1.ibm.com>
Cc: Jan Lindström <jan.lindstrom@mariadb.com>

On Thu, Aug 24, 2017 at 12:11 PM, Daniel Black <daniel.black@au1.ibm.com> wrote:

On 22/08/17 05:34, Sumit Lakra wrote:
> Hello,
>
> I went through the code that the InnoDB background threads use to pick
> tasks and execute them, in detail. This is how they seem to work.
>
> The reader/writer threads are passed an integer which acts as a segment.
> They then call fil_aio_wait(segment) which calls os_aio_handler(segment,
> &node, &message, &type). The control then goes to the
> os_aio_simulated_handler(segment, m1, m2, request) where the code gets
> more complicated with AIO arrays and slots. It gets harder to understand
> how they choose their tasks. It is definitely not a simple queue
> structure from which they pick their tasks. Also, which buffer pool the
> task is related to can only be figured out quite later.. based on the
> value of m2, which stores the address of a bpage.

> A simple queue could
> have been easily replaced with multiple queues, i.e. a queue per
> numa-node like we had once discussed on IRC.

Yes, you know where requests are created and where the job message ends
up so you should be able to create an implementation. Even if it doesn't
preserver the existing behaviour it should be enough to test it.

I am sorry I am unable to create an implementation. It's not like I don't want to. I just can't completely understand how it has been implemented currently. I have tried to follow the code before, and I did again the whole day today, but I can't come up with an idea at all.

> Lastly, all these
> procedures are common for log threads as well.

relevance?

The reader/writer/log threads all start by executing fil_aio_wait(segment), in srv0start.cc

> Another thing, you mentioned more than once you wanted the reader

SQL threads? Or is this the same thing?

> threads to look for a bpage in their local nodes before looking them up
> in other nodes, but they use the bpage structure itself like I
> mentioned
> Obviously neither of us had a proper understanding of how
> InnoDB worked in these aspects when we started the project.

No project knows all the details.

> These
> threads seem to operate on bpages mostly rather than buf_pools, which
> makes numa mapping even harder

You could add a node_id to the class buf_page_t?

Okay, but how to use it ? where ? In which functions ?

> (buf_pools to numa nodes would
> comparatively be easier), but is definitely more efficient than
> buf_pools and hence shouldn't be changed.

> Then again there were cases when the tasks assigned to the background
> threads were done by other server threads as well, especially in case of
> flushing, but I successfully restricted them to their nodes. If we were
> to create queues per node, which queues would these other threads work on.

Yes, which is why this issue was raised with you months ago.

> In other words the way in which InnoDB was initially written and
> expanded has made it significantly difficult to be adapted for explicit
> NUMA support.

The split of buffer pool instances was in appreciation that there there
are hot locks causes by a large number of threads.

It just needed more thought than implementing a few thread bindings. I'm
sorry you though it was that easy.

The split of buffer pool is not problematic to NUMA implementation. Its the task queues for the background threads that's wrecking my brain.

It wouldn't just require restructuring and changing the
> way these threads use parameters, pick tasks etc,

You've changed very few parameters so far.

https://github.com/MariaDB/server/compare/10.2...theGodlessLakra:numa_one

but may also require
> re-ordering the order in which they are executed.

I can't think of one example of this.

> For example, the
> background threads are created before the user threads, and trying to
> use the NUMA node from user THD later on would mean more system calls
> and probably thread migrations.

Huh, what? "may..require..(bad design)".

One way (a really bad design), would be to create background threads when a request is made by a SQL thread, and then bind it to the node depending on the request i.e. the buffer pool instance in which the associated bpage belongs, which would mean more system calls and migrations. Obviously a very bad design.

Which is why a mapping of THD to known background threads per node was
desired, to avoid syscall thread migrations.

The SQL threads were expected to be bound to numa nodes, which I did, but I can't think of how to map them to background threads. Can you think of a way to do this ? Let me know how you think this can be done. I will implement it asap.

> When you added the task in the spreadsheet, you were right to anticipate
> that it could require large cleanup of the InnoDB structure, but I am
> beginning to think it will be way more complicated.

I mentioned in on 2017-07-10 too
(http://marialog.archivist.info/2017-07-10.txt), too and you said you'd
give it a try.

> Also, as you once
> mentioned that most architectures have 2 to 4 NUMA nodes. No doubt, if
> we could implement a support for NUMA architectuure such that InnoDB
> would make the best use out of it, there would be a performance
> difference but it would still be quite negligible in a very fast
> computer,

Cross node cache-coherency takes significant clock cycles compared to
local access. What you'd end up with is a system that has idle time that
can't be used.

By finishing this task this could be measured. local/remote perf access
can be measure (perhaps even with fake_numa).
https://joemario.github.io/blog/2016/09/01/c2c-blog/

> and I really don't think making such big changes to InnoDB
> would be worth the effort and risk.

There is no gain without risk. Small changes with careful review,
running full tests between them is they way large changes are made low risk.

Agreed, but I honestly can't make out the heads and tails of the changes that will be required to replace the present task queue with a queue-per-node structure :(

> Trying to bring a major change to a
> working and verified code (written by someone else) is more error-prone
> than adding some modular code and functions to support a new feature. I
> hope you agree.

When developing you are always developing for current and future use. As
such architectural changes need to be commiserate to the functional
changes. Jan and I have placed emphasis on writing test cases. These
ensure that changes.

Like I said, I am not unwilling to complete this task. But I can't even think of where to start. Since, the two of you have greater experience with InnoDB, maybe you should give it a shot. You will definitely understand the present structure better than I do. And if you can come up with even a verbal solution that sounds like it could work, I will implement it asap.

> Last but not least, if you think it can still be done and have a idea, I
> will be more than willing to attempt it. After all, GSoC has only got me
> started to contribute to open-source and this evaluation won't be the
> end of it.

> And since there's no other database software out there which
> supports NUMA (let's ignore numa-interleave) and since I was the first
> to start working on this

There is, found
https://github.com/aerospike/aerospike-server/blob/master/cf/src/hardware.c
the other day. Wouldn't have helped much as it has a different
architecture and getting on top of one was hard enough.

>, I will be really proud to see it through to
> the end.

This week looks like you've done one commit that is largely a revert
that Jan asked for. You have slowed down considerably.

You have a good reason to be upset at me and I am not complaining. But although it doesn't look from Github like much work due to no changes, I spent a good deal of time brainstorming on the implementation of the first task from the spreadsheet under the "should be done" title, unfortunately to no success.

> Some Evaluation stuff :
> I see only one text field in the form for a link to the work done, and
> it takes a single url. Reserving this for the github link, where do I
> add a link to the spreadsheet or documentation ?
>
> Assuming you have already verified the work done so far, I will be
> creating a pull request shortly, so I can add whether it was merged or
> not, in the evaluation.

Looks like the fake_numa stuff is incomplete. Pinning of background
threads looks odd as they aren't big allocators of memory and aren't
associated with SQL threads/buffer pool instance nodes.

Fixed the fake numa test which was failing in non-debug builds. Most of the memory allocation is done during the buffer pool creation, and these have been managed properly while creating buffer pools per numa node. What other allocators of memory do you wish to be managed ? and how ? (Non-pinned threads should probably be allowed to allocate memory from any node)

I have looked though it. As a pull request 60 commits with 26(?) of them
fixing prior commits will look messy for anyone trying to look at the
changes. If you are up for trying to clean it up please give it a go.
Doing it in a copied repository is a good idea as you've seen before it
can get a bit hard to keep track of.

I spent all day trying to think of any way to implement a queue-per-node thing, and I confess I failed to come up with one. So, in stead of spending more time on that, I will work on migrating these commits to a new branch today. And I also urge you to give this task a try yourselves.

Thanks,

Sumit

---------- Forwarded message ----------
From: Sumit Lakra <sumitlakradev@gmail.com>
Date: Fri, Aug 25, 2017 at 6:52 PM
Subject: Re: GSoC Final Evaluation
To: Daniel Black <daniel.black@au1.ibm.com>
Cc: Jan Lindström <jan.lindstrom@mariadb.com>

Hello,

I am currently working on copying the changes from numa_one branch to a different branch in a more orderly fashion. I haven't been able to implement the queue per node structure for background threads, and I am afraid I won't be able to do it without your help. It is beyond me. I gave it a few attempts and I have failed. It's not the implementation part but the 'coming up with a way to implement it' part that I haven't been able to figure out yet. I need you to know that I am trying my best here.

Back in http://marialog.archivist.info/2017-07-10.txt, you also said "warning the whole io read/write thread segements and scheduling is really messy and could do with a significant rework", and you also mentioned that you and Jan would have an attempt at it yourselves, if I am unable to do it and I move on. Well, I regret moving on to SQL threads back then without informing you that I wasn't successfull. A 'queue-per-numa-node' is easier said than done. But anyway I kindly request you two to try it out yourselves now. You don't have to do the work. I just urge you to take some time out this weekend and go through the code. Let me know how you think it may be possible to implement it. I will do it.

I have 4 days 2 hrs before the deadline to submit the evaluation, ends. If you can come up with an implementation plan within the next 2 or 2.5 days, I assure you I will code the implementation within a day of that and probably commit it before submitting the final evaluation as well. I am willing to take this risk.

However, if you are unable to come up with an implementation plan, we don't really have to hurry and stress ourselves. We can work together (I can't do this on my own) on this particular task later on and come up with something, assuming I am not failed in the final evaluation.

Happy Weekend,

Sumit

---------- Forwarded message ----------
From: Daniel Black <daniel.black@au1.ibm.com>
Date: Mon, Aug 28, 2017 at 7:56 AM
Subject: Re: GSoC Final Evaluation
To: Sumit Lakra <sumitlakradev@gmail.com>
Cc: Jan Lindström <jan.lindstrom@mariadb.com>

On 25/08/17 23:22, Sumit Lakra wrote:
> Hello,
>
> I am currently working on copying the changes from numa_one branch to a
> different branch in a more orderly fashion.

This is looking good:

https://github.com/theGodlessLakra/server/tree/NUMA_Support

Don't be afraid to be very descriptive in the commit message especially
about the problem being solved and what functionality it
implements/changes. Don't re-say what the code is doing, just cover its
meaning at a high level.

> I haven't been able to
> implement the queue per node structure for background threads, and I am
> afraid I won't be able to do it without your help. It is beyond me.

> I
> gave it a few attempts and I have failed. It's not the implementation
> part but the 'coming up with a way to implement it' part that I haven't
> been able to figure out yet. I need you to know that I am trying my best
> here.

Knowing what plans you had and how they failed would have been good to know.

> Back in http://marialog.archivist.info/2017-07-10.txt,
> you also said "warning the whole io read/write thread segements and
> scheduling is really messy and could do with a significant rework",

To do it properly yes.

> and
> you also mentioned that you and Jan would have an attempt at it
> yourselves, if I am unable to do it and I move on. Well, I regret moving
> on to SQL threads back then without informing you that I wasn't
> successfull.

I probably should of harassed you a bit too. Acknowledging failures is
an important part to learning, even if you just acknowledge to yourself.

How did SQL threads end up?

> A 'queue-per-numa-node' is easier said than done. But
> anyway I kindly request you two to try it out yourselves now. You don't
> have to do the work. I just urge you to take some time out this weekend
> and go through the code. Let me know how you think it may be possible to
> implement it. I will do it.

Having only just read this, this is a bit late notice. The aspect of
understanding the interoperation between parts is also a fundamental
aspect of getting a plan to implement it.

> I have 4 days 2 hrs before the deadline to submit the evaluation, ends.
> If you can come up with an implementation plan within the next 2 or 2.5
> days, I assure you I will code the implementation within a day of that
> and probably commit it before submitting the final evaluation as well. I
> am willing to take this risk.

I thought your submission of a URL left it open to still committing on
it. Please don't risk missing it. Making sure the existing
code/comments/documentation is as clean as possible is better than
tackling new work now.

> However, if you are unable to come up with an implementation plan, we
> don't really have to hurry and stress ourselves. We can work together (I
> can't do this on my own) on this particular task later on and come up
> with something, assuming I am not failed in the final evaluation.
>
> Happy Weekend,
> Sumit

---------- Forwarded message ----------
From: Sumit Lakra <sumitlakradev@gmail.com>
Date: Mon, Aug 28, 2017 at 9:50 PM
Subject: Re: GSoC Final Evaluation
To: Daniel Black <daniel.black@au1.ibm.com>
Cc: Jan Lindström <jan.lindstrom@mariadb.com>

Hello,

The following is a detailed explanation of what I did, how and why, while trying to come up with a solution to the task queue problem. However, before you go through this I would still request you to have an attempt at it yourselves, so your perspectives won't be affected by my ideas and you may come up with something different.

I had once tried this before in https://github.com/MariaDB/server/commit/4b8436d3f7c573668a0d37b54dab2d05e8e102be. I even pointed this commit out to once on IRC, (can't access logs of this time). Anyway I didn't put much focus on this as I was onto some other task.

Before trying to replace the present queue structure with a queue-per-node, I first tried to let all the threads access this queue as before but only pick tasks dealing with their local nodes. Neither this queue nor these tasks seem to be easily distinguishable in code. The threads start by running fil_aio_wait(segment). This function calls os_aio_handler(segment, &node, &message, &type). This calls os_aio_simulated_handler(segment, m1, m2, request). This calls AIO::get_array_and_local_segment(&array, global_segment), and also creates SimulatedAIOHandler handler(array, segment), followed by calling handler.check_pending(global_segment, event), handler.init(n_slots), and slot = handler.check_completed(&n_reserved).

Now, in order to skip a task to be done by other thread, what a thread must do is pick the task, check whether the task is related to the same node or not. If the task is concerned with the same node execute it or else leave it in the queue. This is where the problem occurs. In order to figure out whether a task is concerned with the same node or not, it must know which buffer pool instance the task is associated to. This can only be figured out after the task picks up the bpage it has to deal with. The bpage already has this information by now. Anyway, the bpage a thread has to deal with can only be found out by buf_page_t* bpage = static_cast<buf_page_t*>(*m2). Also, the task is not picked up at any particular place, but is picked up in parts in different functions. All these function calls seem to make some or other kind of changes to the segment/slot/array they are picking up the task from. So, when a thread finally gets a task and finds it concerned with a different node, then in order to safely return the task back to the queue, it must also undo the changes made by all these procedures so far. This is the difficult part. Compare this with pc_flush_slot() where I successfully implemented this method of skipping tasks if not related to local pool.

In the above commit you will find I have a few lines like 'ib::info() << thread_name << " accessing buf_page on node : " << node_of_pool << " in fil_aio_wait()";' If you compile and run this code you can see in real time which threads are accessing which pools. I have given proper names to them. Also, when I tried to skip tasks which were not of the same queue by returning from the function os_aio_simulated_handler(), which partly worked but on trying to return from the funtion fil_aio_wait(), there seems to be either an infinite loop or a deadlock somewhere, as the server stops midway (doesn't abort) while starting up. Hence I have commented out those lines on lines 5463 and 5464 in fil0fil.cc in the above commit.

Another thing is the role of segments here. Are these segments related to buffer pool instances somehow ? Because all these threads including the log threads start there work by executing fil_aio_wait(segment), but segment here seems to be just an incrementing integer. At one point assuming these segments could contain tasks related with different pools/nodes, I added a loop such that both the read and write threads loop around fil_aio_wait(segment) calling it with all valid segments. Didn't work out.

These were attempted by me when I pushed this commit weeks ago in the temp branch. And even though I pointed it out to you, I think I should have given much more importance to this task. Over the past few days, when I tried to give this task another try, I couldn't come up with any idea either to replace this queue with a queue-per-node thing or even to complete this skip-if-diff-node thing.