Hello Serg,
I am sorry to bug you with this mail but I am not sure what discussions you
and my mentors have had before I was failed in GSoC. I completed all the
tasks I had mentioned in my proposal but the project is incomplete. No
doubts about that, but it is so because I got stuck with a major task
(added in the last month) of the project for which I received no help from
my mentors. These are the conversations I had with my mentors recently
regarding the same task.
Thanks,
Sumit
---------- Forwarded message ----------
From: Sumit Lakra
Hello,
I went through the code that the InnoDB background threads use to pick tasks and execute them, in detail. This is how they seem to work.
The reader/writer threads are passed an integer which acts as a segment. They then call fil_aio_wait(segment) which calls os_aio_handler(segment, &node, &message, &type). The control then goes to the os_aio_simulated_handler(segment, m1, m2, request) where the code gets more complicated with AIO arrays and slots. It gets harder to understand how they choose their tasks. It is definitely not a simple queue structure from which they pick their tasks. Also, which buffer pool the task is related to can only be figured out quite later.. based on the value of m2, which stores the address of a bpage.
A simple queue could have been easily replaced with multiple queues, i.e. a queue per numa-node like we had once discussed on IRC.
Yes, you know where requests are created and where the job message ends up so you should be able to create an implementation. Even if it doesn't preserver the existing behaviour it should be enough to test it.
Lastly, all these procedures are common for log threads as well.
relevance?
Another thing, you mentioned more than once you wanted the reader
SQL threads? Or is this the same thing?
threads to look for a bpage in their local nodes before looking them up in other nodes, but they use the bpage structure itself like I mentioned Obviously neither of us had a proper understanding of how InnoDB worked in these aspects when we started the project.
No project knows all the details.
These threads seem to operate on bpages mostly rather than buf_pools, which makes numa mapping even harder
You could add a node_id to the class buf_page_t?
(buf_pools to numa nodes would comparatively be easier), but is definitely more efficient than buf_pools and hence shouldn't be changed.
Then again there were cases when the tasks assigned to the background threads were done by other server threads as well, especially in case of flushing, but I successfully restricted them to their nodes. If we were to create queues per node, which queues would these other threads work on.
Yes, which is why this issue was raised with you months ago.
In other words the way in which InnoDB was initially written and expanded has made it significantly difficult to be adapted for explicit NUMA support.
The split of buffer pool instances was in appreciation that there there are hot locks causes by a large number of threads. It just needed more thought than implementing a few thread bindings. I'm sorry you though it was that easy. It wouldn't just require restructuring and changing the
way these threads use parameters, pick tasks etc,
You've changed very few parameters so far. https://github.com/MariaDB/server/compare/10.2...theGodlessLakra:numa_one but may also require
re-ordering the order in which they are executed.
I can't think of one example of this.
For example, the background threads are created before the user threads, and trying to use the NUMA node from user THD later on would mean more system calls and probably thread migrations.
Huh, what? "may..require..(bad design)". Which is why a mapping of THD to known background threads per node was desired, to avoid syscall thread migrations.
When you added the task in the spreadsheet, you were right to anticipate that it could require large cleanup of the InnoDB structure, but I am beginning to think it will be way more complicated.
I mentioned in on 2017-07-10 too (http://marialog.archivist.info/2017-07-10.txt), too and you said you'd give it a try.
Also, as you once mentioned that most architectures have 2 to 4 NUMA nodes. No doubt, if we could implement a support for NUMA architectuure such that InnoDB would make the best use out of it, there would be a performance difference but it would still be quite negligible in a very fast computer,
Cross node cache-coherency takes significant clock cycles compared to local access. What you'd end up with is a system that has idle time that can't be used. By finishing this task this could be measured. local/remote perf access can be measure (perhaps even with fake_numa). https://joemario.github.io/blog/2016/09/01/c2c-blog/
and I really don't think making such big changes to InnoDB would be worth the effort and risk.
There is no gain without risk. Small changes with careful review, running full tests between them is they way large changes are made low risk.
Trying to bring a major change to a working and verified code (written by someone else) is more error-prone than adding some modular code and functions to support a new feature. I hope you agree.
When developing you are always developing for current and future use. As such architectural changes need to be commiserate to the functional changes. Jan and I have placed emphasis on writing test cases. These ensure that changes.
Last but not least, if you think it can still be done and have a idea, I will be more than willing to attempt it. After all, GSoC has only got me started to contribute to open-source and this evaluation won't be the end of it.
And since there's no other database software out there which supports NUMA (let's ignore numa-interleave) and since I was the first to start working on this
There is, found https://github.com/aerospike/aerospike-server/blob/master/cf/src/hardware.c the other day. Wouldn't have helped much as it has a different architecture and getting on top of one was hard enough.
, I will be really proud to see it through to the end.
This week looks like you've done one commit that is largely a revert that Jan asked for. You have slowed down considerably.
Some Evaluation stuff : I see only one text field in the form for a link to the work done, and it takes a single url. Reserving this for the github link, where do I add a link to the spreadsheet or documentation ?
Assuming you have already verified the work done so far, I will be creating a pull request shortly, so I can add whether it was merged or not, in the evaluation.
Looks like the fake_numa stuff is incomplete. Pinning of background threads looks odd as they aren't big allocators of memory and aren't associated with SQL threads/buffer pool instance nodes. I have looked though it. As a pull request 60 commits with 26(?) of them fixing prior commits will look messy for anyone trying to look at the changes. If you are up for trying to clean it up please give it a go. Doing it in a copied repository is a good idea as you've seen before it can get a bit hard to keep track of.
Thanking You, Sumit
---------- Forwarded message ----------
From: Sumit Lakra
On 22/08/17 05:34, Sumit Lakra wrote:
Hello,
I went through the code that the InnoDB background threads use to pick tasks and execute them, in detail. This is how they seem to work.
The reader/writer threads are passed an integer which acts as a segment. They then call fil_aio_wait(segment) which calls os_aio_handler(segment, &node, &message, &type). The control then goes to the os_aio_simulated_handler(segment, m1, m2, request) where the code gets more complicated with AIO arrays and slots. It gets harder to understand how they choose their tasks. It is definitely not a simple queue structure from which they pick their tasks. Also, which buffer pool the task is related to can only be figured out quite later.. based on the value of m2, which stores the address of a bpage.
A simple queue could have been easily replaced with multiple queues, i.e. a queue per numa-node like we had once discussed on IRC.
Yes, you know where requests are created and where the job message ends up so you should be able to create an implementation. Even if it doesn't preserver the existing behaviour it should be enough to test it.
I am sorry I am unable to create an implementation. It's not like I don't want to. I just can't completely understand how it has been implemented currently. I have tried to follow the code before, and I did again the whole day today, but I can't come up with an idea at all.
Lastly, all these procedures are common for log threads as well.
relevance?
The reader/writer/log threads all start by executing fil_aio_wait(segment), in srv0start.cc
Another thing, you mentioned more than once you wanted the reader
SQL threads? Or is this the same thing?
threads to look for a bpage in their local nodes before looking them up in other nodes, but they use the bpage structure itself like I mentioned Obviously neither of us had a proper understanding of how InnoDB worked in these aspects when we started the project.
No project knows all the details.
These threads seem to operate on bpages mostly rather than buf_pools, which makes numa mapping even harder
You could add a node_id to the class buf_page_t?
Okay, but how to use it ? where ? In which functions ?
(buf_pools to numa nodes would comparatively be easier), but is definitely more efficient than buf_pools and hence shouldn't be changed.
Then again there were cases when the tasks assigned to the background threads were done by other server threads as well, especially in case of flushing, but I successfully restricted them to their nodes. If we were to create queues per node, which queues would these other threads work on.
Yes, which is why this issue was raised with you months ago.
In other words the way in which InnoDB was initially written and expanded has made it significantly difficult to be adapted for explicit NUMA support.
The split of buffer pool instances was in appreciation that there there are hot locks causes by a large number of threads.
It just needed more thought than implementing a few thread bindings. I'm sorry you though it was that easy.
The split of buffer pool is not problematic to NUMA implementation. Its the task queues for the background threads that's wrecking my brain.
It wouldn't just require restructuring and changing the
way these threads use parameters, pick tasks etc,
You've changed very few parameters so far.
https://github.com/MariaDB/server/compare/10.2...theGodlessLakra:numa_one
but may also require
re-ordering the order in which they are executed.
I can't think of one example of this.
For example, the background threads are created before the user threads, and trying to use the NUMA node from user THD later on would mean more system calls and probably thread migrations.
Huh, what? "may..require..(bad design)".
One way (a really bad design), would be to create background threads when a request is made by a SQL thread, and then bind it to the node depending on the request i.e. the buffer pool instance in which the associated bpage belongs, which would mean more system calls and migrations. Obviously a very bad design.
Which is why a mapping of THD to known background threads per node was desired, to avoid syscall thread migrations.
The SQL threads were expected to be bound to numa nodes, which I did, but I can't think of how to map them to background threads. Can you think of a way to do this ? Let me know how you think this can be done. I will implement it asap.
When you added the task in the spreadsheet, you were right to anticipate
that it could require large cleanup of the InnoDB structure, but I am beginning to think it will be way more complicated.
I mentioned in on 2017-07-10 too (http://marialog.archivist.info/2017-07-10.txt), too and you said you'd give it a try.
Also, as you once mentioned that most architectures have 2 to 4 NUMA nodes. No doubt, if we could implement a support for NUMA architectuure such that InnoDB would make the best use out of it, there would be a performance difference but it would still be quite negligible in a very fast computer,
Cross node cache-coherency takes significant clock cycles compared to local access. What you'd end up with is a system that has idle time that can't be used.
By finishing this task this could be measured. local/remote perf access can be measure (perhaps even with fake_numa). https://joemario.github.io/blog/2016/09/01/c2c-blog/
and I really don't think making such big changes to InnoDB would be worth the effort and risk.
There is no gain without risk. Small changes with careful review, running full tests between them is they way large changes are made low risk.
Agreed, but I honestly can't make out the heads and tails of the changes that will be required to replace the present task queue with a queue-per-node structure :(
Trying to bring a major change to a working and verified code (written by someone else) is more error-prone than adding some modular code and functions to support a new feature. I hope you agree.
When developing you are always developing for current and future use. As such architectural changes need to be commiserate to the functional changes. Jan and I have placed emphasis on writing test cases. These ensure that changes.
Like I said, I am not unwilling to complete this task. But I can't even think of where to start. Since, the two of you have greater experience with InnoDB, maybe you should give it a shot. You will definitely understand the present structure better than I do. And if you can come up with even a verbal solution that sounds like it could work, I will implement it asap.
Last but not least, if you think it can still be done and have a idea, I will be more than willing to attempt it. After all, GSoC has only got me started to contribute to open-source and this evaluation won't be the end of it.
And since there's no other database software out there which supports NUMA (let's ignore numa-interleave) and since I was the first to start working on this
There is, found https://github.com/aerospike/aerospike-server/blob/master/cf /src/hardware.c the other day. Wouldn't have helped much as it has a different architecture and getting on top of one was hard enough.
, I will be really proud to see it through to the end.
This week looks like you've done one commit that is largely a revert that Jan asked for. You have slowed down considerably.
You have a good reason to be upset at me and I am not complaining. But although it doesn't look from Github like much work due to no changes, I spent a good deal of time brainstorming on the implementation of the first task from the spreadsheet under the "should be done" title, unfortunately to no success.
Some Evaluation stuff : I see only one text field in the form for a link to the work done, and it takes a single url. Reserving this for the github link, where do I add a link to the spreadsheet or documentation ?
Assuming you have already verified the work done so far, I will be creating a pull request shortly, so I can add whether it was merged or not, in the evaluation.
Looks like the fake_numa stuff is incomplete. Pinning of background threads looks odd as they aren't big allocators of memory and aren't associated with SQL threads/buffer pool instance nodes.
Fixed the fake numa test which was failing in non-debug builds. Most of the memory allocation is done during the buffer pool creation, and these have been managed properly while creating buffer pools per numa node. What other allocators of memory do you wish to be managed ? and how ? (Non-pinned threads should probably be allowed to allocate memory from any node)
I have looked though it. As a pull request 60 commits with 26(?) of them fixing prior commits will look messy for anyone trying to look at the changes. If you are up for trying to clean it up please give it a go. Doing it in a copied repository is a good idea as you've seen before it can get a bit hard to keep track of.
I spent all day trying to think of any way to implement a queue-per-node
thing, and I confess I failed to come up with one. So, in stead of spending
more time on that, I will work on migrating these commits to a new branch
today. And I also urge you to give this task a try yourselves.
Thanks,
Sumit
---------- Forwarded message ----------
From: Sumit Lakra
Hello,
I am currently working on copying the changes from numa_one branch to a different branch in a more orderly fashion.
This is looking good: https://github.com/theGodlessLakra/server/tree/NUMA_Support Don't be afraid to be very descriptive in the commit message especially about the problem being solved and what functionality it implements/changes. Don't re-say what the code is doing, just cover its meaning at a high level.
I haven't been able to implement the queue per node structure for background threads, and I am afraid I won't be able to do it without your help. It is beyond me.
I gave it a few attempts and I have failed. It's not the implementation part but the 'coming up with a way to implement it' part that I haven't been able to figure out yet. I need you to know that I am trying my best here.
Knowing what plans you had and how they failed would have been good to know.
Back in http://marialog.archivist.info/2017-07-10.txt, you also said "warning the whole io read/write thread segements and scheduling is really messy and could do with a significant rework",
To do it properly yes.
and you also mentioned that you and Jan would have an attempt at it yourselves, if I am unable to do it and I move on. Well, I regret moving on to SQL threads back then without informing you that I wasn't successfull.
I probably should of harassed you a bit too. Acknowledging failures is an important part to learning, even if you just acknowledge to yourself. How did SQL threads end up?
A 'queue-per-numa-node' is easier said than done. But anyway I kindly request you two to try it out yourselves now. You don't have to do the work. I just urge you to take some time out this weekend and go through the code. Let me know how you think it may be possible to implement it. I will do it.
Having only just read this, this is a bit late notice. The aspect of understanding the interoperation between parts is also a fundamental aspect of getting a plan to implement it.
I have 4 days 2 hrs before the deadline to submit the evaluation, ends. If you can come up with an implementation plan within the next 2 or 2.5 days, I assure you I will code the implementation within a day of that and probably commit it before submitting the final evaluation as well. I am willing to take this risk.
I thought your submission of a URL left it open to still committing on it. Please don't risk missing it. Making sure the existing code/comments/documentation is as clean as possible is better than tackling new work now.
However, if you are unable to come up with an implementation plan, we don't really have to hurry and stress ourselves. We can work together (I can't do this on my own) on this particular task later on and come up with something, assuming I am not failed in the final evaluation.
Happy Weekend, Sumit
---------- Forwarded message ----------
From: Sumit Lakra