developers
Threads by month
- ----- 2024 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2023 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2022 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2021 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2020 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2019 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2018 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2017 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2016 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2015 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2014 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2013 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2012 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2011 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2010 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2009 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
September 2017
- 12 participants
- 25 discussions
In ha_innobase::info_low() there is following dirty hack:
/*
The MySQL optimizer seems to assume in a left join that n_rows
is an accurate estimate if it is zero. Of course, it is not,
since we do not have any locks on the rows yet at this phase.
Since SHOW TABLE STATUS seems to call this function with the
HA_STATUS_TIME flag set, while the left join optimizer does not
set that flag, we add one to a zero value if the flag is not
set. That way SHOW TABLE STATUS will show the best estimate,
while the optimizer never sees the table empty. */
if (n_rows == 0 && !(flag & HA_STATUS_TIME)) {
n_rows++;
}
It is very old (from 5.0 or earlier) and bug-prone. Because in
ha_innobase::open():
info(HA_STATUS_NO_LOCK | HA_STATUS_VARIABLE | HA_STATUS_CONST);
every opened empty table will be non-empty! I don't know what is the problem
with join optimizer, but having storage engine to handle it seems not
the right thing to do. Moreover, relying on HA_STATUS_TIME in this is definitely
wrong. We can make join optimizer to ignore "0 rows case" for all
storage engines.
Is it big win from "1 row case" anyway? Or we can make new flag HA_JOIN_STAT
and use it in make_join_statistics().
2
3
Re: [Maria-developers] A problem with implementing Group Commit with Binlog with MyRocks
by Sergey Petrunia 11 Sep '17
by Sergey Petrunia 11 Sep '17
11 Sep '17
Hi Kristian,
On Mon, Sep 11, 2017 at 01:02:31PM +0200, Kristian Nielsen wrote:
> Sergey Petrunia <sergey(a)mariadb.com> writes:
>
> > This is about https://jira.mariadb.org/browse/MDEV-11934. I've encountered
> > an insteresting issue here, so I thought I would consult on both MyRocks and
> > MariaDB lists.
>
> Is your current code available somewhere?
>
No, it wasn't. Just pushed the current code into bb-10.2-mdev11934 branch:
https://github.com/MariaDB/server/tree/bb-10.2-mdev11934
All changes are in the last commit.
BR
Sergei
--
Sergei Petrunia, Software Developer
MariaDB Corporation | Skype: sergefp | Blog: http://s.petrunia.net/blog
1
0
The dates for the Developers Unconference in Shenzhen, China, are coming
up fast.
This is our first official event in Asia, and will be held from 13
November - 17 November.
13 November - new contributor day
14-15 November - traditional developers unconference
16-17 November - patch days
As always, it's free to attend, and open to anyone interested in MariaDB
development.
Details are at
https://mariadb.org/2017-2-developers-unconference-and-related-events-shenz…
and signup on the Meetup page:
https://www.meetup.com/MariaDB-Developers/events/241873269/
1
0
Hello,
I am not aware of what information regarding this project has been
exchanged between my mentor Daniel Black and others including serg, so I am
sharing this here for everyone to see. My last mail contains the
conversations I had with my mentors.
Some quotes from the final evaluation -
"Your questions in the last week have been exceptionally keen even though
these seem to be the same questions from 1-2 months ago"
Remind me when you (Daniel Black) answered them.
"Sumit, overall we are happy with the quality of work. Unfortunately the
quantity of work showed a lack of putting in hours and limited test cases
fell short of the project plan and organisational expectations"
What limited test cases ? I wrote all the test cases you (Daniel Black) and
JAN Lindstrom asked me to write. Which test case did I miss ? I even ran a
bench-marking test with all possible values and sent you the results.
However, testing the actual performance difference is something that can
only be done in a real NUMA Machine. Sorry, I don't have that. And have you
ever even tested my code in a real NUMA Machine ? I too deserve to know if
my work has made any real difference on the machine it is supposed to work
on. And I have asked you to run it on a real machine more than once over
the past months.
You are happy with the quality of the work. Good. You are unhappy with the
quantity. How do you expect me to increase the quantity when I already
completed the tasks I promised in my GSoC proposal except for those that
you removed from the list. What project plan are you referring to ?
My GSoC proposal -
https://docs.google.com/document/d/1UbOQysgOOzCM7z5FPC7gmuCpyxPmB7RGg7y9Ogq…
Tasks spreadsheet -
https://docs.google.com/spreadsheets/d/1nE-qFXhwwhF0hpOX3wUlczPjlPZTK_UzBu_…
The tasks can be categorised as follows.
[1] The tasks mentioned in my GSoC proposal which were not removed /
changed later. - I completed all these.
[2] The tasks mentioned in my GSoC proposal which were crossed out / (under
consideration to be dropped / unnecessary) / or were implemented
differently. - I either did not complete these for obvious reasons or
implemented some of them differently.
[3] The tasks you added (in the last month) to the spreadsheet originally
under the title 'Out of Time' in the last month. - I completed some of
these too.
Then when these were done you moved a task (the queue per node task) from
the 'Out of Time' to the third month group. We both agreed it was an
important part for the completion of the project and I tried my best to
implement it. I spent days trying to figure it out, asked you for help,
didn't get any, tried again and yet failed because I am no expert of either
InnoDB or SQl side of MariaDB.
Which organisational expectations did you ask me to fulfil ? When ?
I don't remember ever being told about any organisational expectations.
Perhaps these were your personal expectations of having a complete NUMA
Support in MariaDB by the end of GSoC. If so, then I am really sorry I
wasn't able to fulfill those expectations because I got stuck at an
important part of the project which also was a seemingly impossible task to
do and *YOU FAILED TO HELP ME* with it even though I repeatedly asked
(begged actually) you for help. You just kept reminding me that to do the
task was my job.
That's some real good mentoring right there.
Another quote from the final evaluation
"Passing to get your continued work isn't honoring the rules of the project
or being fair to Google"
Yeah, How about showing some good mentor ship by helping me to complete the
project while we had the time to do it within Google's timelines in stead.
Oh! wait a sec... you never went through the code.. until 3-4 days ago. I
was determined to complete it if you had helped me. I have been mentioning
it in all mails since the mail I wrote to you and Jan on :
22nd Aug - "Last but not least, if you think it can still be done and have
a idea, I will be more than willing to attempt it. After all, GSoC has only
got me started to contribute to open-source and this evaluation won't be
the end of it."
25th Aug - "I am sorry I am unable to create an implementation. It's not
like I don't want to. I just can't completely understand how it has been
implemented currently. I have tried to follow the code before, and I did
again the whole day today, but I can't come up with an idea at all."
25th Aug - "Like I said, I am not unwilling to complete this task. But I
can't even think of where to start. Since, the two of you have greater
experience with InnoDB, maybe you should give it a shot. You will
definitely understand the present structure better than I do. And if you
can come up with even a verbal solution that sounds like it could work, I
will implement it asap."
25th Aug - "I spent all day trying to think of any way to implement a
queue-per-node thing, and I confess I failed to come up with one. So, in
stead of spending more time on that, I will work on migrating these commits
to a new branch today. And I also urge you to give this task a try
yourselves."
25th Aug - " I haven't been able to implement the queue per node structure
for background threads, and I am afraid I won't be able to do it without
your help. It is beyond me. I gave it a few attempts and I have failed.
It's not the implementation part but the 'coming up with a way to implement
it' part that I haven't been able to figure out yet. I need you to know
that I am trying my best here."
25th Aug - "But anyway I kindly request you two to try it out yourselves
now. You don't have to do the work. I just urge you to take some time out
this weekend and go through the code. Let me know how you think it may be
possible to implement it. I will do it. I have 4 days 2 hrs before the
deadline to submit the evaluation, ends. If you can come up with an
implementation plan within the next 2 or 2.5 days, I assure you I will code
the implementation within a day of that and probably commit it before
submitting the final evaluation as well. I am willing to take this risk."
On Aug 28th you wrote a mail and asked me "Knowing what plans you had and
how they failed would have been good to know." So, on the same day I wrote
you a mail with a detailed explanation of the attempts I had made in trying
to complete the task, asking you again to try it yourself and help me if
you could come up with an idea. You never did.
31st Aug http://marialog.archivist.info/2017-08-31.txt
Me - "did you have an attempt at the task ? come up with any ideas yet ?"
dragonheart - been a bit busy with my own work so far. I know I'm going to
have to look at a lot of stuff to finish the final GSoC evaluation
4th Sept http://marialog.archivist.info/2017-09-04.txt
Me - "You must have seen my mails. Did you have an attempt at the queue per
node thing ? Any ideas ?"
dragonheart - well, implementing it was your task. but no, haven't found
the hours to look though the code again.
As per our conversation on IRC clearly Daniel Black hadn't even gone
through the code by 4th Sept or attempted to solve the problem. How did he
hope to complete the final evaluation. Even if he went through the code a
day before submitting his evaluation, isn't this too late for a mentor to
go through his/her student's code.
Here is a list of some questions I would like MariaDB staff to ponder upon
especially serg.
[1] Was the task doable at all ?
[A] Not unless you are an expert of both the SQL side and the InnoDB side,
expert enough to re-write a major part of the interfaace between them. Jan
himself acknowledged he wouldn't be able to help me with this, because he
has no experience with the interface. I really appreciate that.
[2] Even if the task was doable by a team of SQL and InnoDB experts, was it
doable by an individual student working on MariaDB for the past four months
?
[A] Nope.
[3] Why didn't Daniel Black help me with the task even when I clearly
mentioned I would need help with the task weeks ago ?
[A] ???
[4] Did my mentor Daniel Black even try to help me with this particular
task ? Ever ?
[A] Nope.
[5] Did Daniel Black evaluate me on the basis of the number of commits or
the tasks accomplished by those commits ?
[A] Based on the evaluation, he clearly has been unhappy with the decreased
number of commits, not the content, as all the tasks mentioned in the
original proposal have been achieved along with a few more.
[6] Can Daniel Black himself complete this task, and thus the project ?
[A] I bet not.
No doubt the project is incomplete because it involves doing a task that is
nearly impossible to do, at least for me, especially without any help. But
does GSoC means that the student has to necessarily complete the entire
project. I have gone through the code submitting guidelines plenty of
times. It clearly states that we must state which tasks were done, and
which tasks remain. Obviously not all projects are complete, some only
complete the project partially, but mention everything and continue working
on them. I did the same, and mentioned in my evaluation that almost all
tasks were complete, in stead of all, and I have always been willing to
complete this project but man, I can't do something impossible, especially
when the people I am supposed to ask for help themselves fail to help their
students.
Thanks,
Sumit
1
0
Hello Serg,
I am sorry to bug you with this mail but I am not sure what discussions you
and my mentors have had before I was failed in GSoC. I completed all the
tasks I had mentioned in my proposal but the project is incomplete. No
doubts about that, but it is so because I got stuck with a major task
(added in the last month) of the project for which I received no help from
my mentors. These are the conversations I had with my mentors recently
regarding the same task.
Thanks,
Sumit
---------- Forwarded message ----------
From: Sumit Lakra <sumitlakradev(a)gmail.com>
Date: Tue, Aug 22, 2017 at 1:04 AM
Subject: GSoC Final Evaluation
To: Daniel Black <daniel.black(a)au1.ibm.com>, Jan Lindström <
jan.lindstrom(a)mariadb.com>
Hello,
I went through the code that the InnoDB background threads use to pick
tasks and execute them, in detail. This is how they seem to work.
The reader/writer threads are passed an integer which acts as a segment.
They then call fil_aio_wait(segment) which calls os_aio_handler(segment,
&node, &message, &type). The control then goes to the
os_aio_simulated_handler(segment, m1, m2, request) where the code gets more
complicated with AIO arrays and slots. It gets harder to understand how
they choose their tasks. It is definitely not a simple queue structure from
which they pick their tasks. Also, which buffer pool the task is related to
can only be figured out quite later.. based on the value of m2, which
stores the address of a bpage. A simple queue could have been easily
replaced with multiple queues, i.e. a queue per numa-node like we had once
discussed on IRC. Lastly, all these procedures are common for log threads
as well.
Another thing, you mentioned more than once you wanted the reader threads
to look for a bpage in their local nodes before looking them up in other
nodes, but they use the bpage structure itself like I mentioned. Obviously
neither of us had a proper understanding of how InnoDB worked in these
aspects when we started the project. These threads seem to operate on
bpages mostly rather than buf_pools, which makes numa mapping even harder
(buf_pools to numa nodes would comparatively be easier), but is definitely
more efficient than buf_pools and hence shouldn't be changed.
Then again there were cases when the tasks assigned to the background
threads were done by other server threads as well, especially in case of
flushing, but I successfully restricted them to their nodes. If we were to
create queues per node, which queues would these other threads work on.
In other words the way in which InnoDB was initially written and expanded
has made it significantly difficult to be adapted for explicit NUMA
support. It wouldn't just require restructuring and changing the way these
threads use parameters, pick tasks etc, but may also require re-ordering
the order in which they are executed. For example, the background threads
are created before the user threads, and trying to use the NUMA node from
user THD later on would mean more system calls and probably thread
migrations.
When you added the task in the spreadsheet, you were right to anticipate
that it could require large cleanup of the InnoDB structure, but I am
beginning to think it will be way more complicated. Also, as you once
mentioned that most architectures have 2 to 4 NUMA nodes. No doubt, if we
could implement a support for NUMA architectuure such that InnoDB would
make the best use out of it, there would be a performance difference but it
would still be quite negligible in a very fast computer, and I really don't
think making such big changes to InnoDB would be worth the effort and risk.
Trying to bring a major change to a working and verified code (written by
someone else) is more error-prone than adding some modular code and
functions to support a new feature. I hope you agree.
Last but not least, if you think it can still be done and have a idea, I
will be more than willing to attempt it. After all, GSoC has only got me
started to contribute to open-source and this evaluation won't be the end
of it. And since there's no other database software out there which
supports NUMA (let's ignore numa-interleave) and since I was the first to
start working on this, I will be really proud to see it through to the end.
Some Evaluation stuff :
I see only one text field in the form for a link to the work done, and it
takes a single url. Reserving this for the github link, where do I add a
link to the spreadsheet or documentation ?
Assuming you have already verified the work done so far, I will be
creating a pull request shortly, so I can add whether it was merged or not,
in the evaluation.
Thanking You,
Sumit
---------- Forwarded message ----------
From: Daniel Black <daniel.black(a)au1.ibm.com>
Date: Thu, Aug 24, 2017 at 12:11 PM
Subject: Re: GSoC Final Evaluation
To: Sumit Lakra <sumitlakradev(a)gmail.com>, Jan Lindström <
jan.lindstrom(a)mariadb.com>
On 22/08/17 05:34, Sumit Lakra wrote:
> Hello,
>
> I went through the code that the InnoDB background threads use to pick
> tasks and execute them, in detail. This is how they seem to work.
>
> The reader/writer threads are passed an integer which acts as a segment.
> They then call fil_aio_wait(segment) which calls os_aio_handler(segment,
> &node, &message, &type). The control then goes to the
> os_aio_simulated_handler(segment, m1, m2, request) where the code gets
> more complicated with AIO arrays and slots. It gets harder to understand
> how they choose their tasks. It is definitely not a simple queue
> structure from which they pick their tasks. Also, which buffer pool the
> task is related to can only be figured out quite later.. based on the
> value of m2, which stores the address of a bpage.
> A simple queue could
> have been easily replaced with multiple queues, i.e. a queue per
> numa-node like we had once discussed on IRC.
Yes, you know where requests are created and where the job message ends
up so you should be able to create an implementation. Even if it doesn't
preserver the existing behaviour it should be enough to test it.
> Lastly, all these
> procedures are common for log threads as well.
relevance?
> Another thing, you mentioned more than once you wanted the reader
SQL threads? Or is this the same thing?
> threads to look for a bpage in their local nodes before looking them up
> in other nodes, but they use the bpage structure itself like I
> mentioned
> Obviously neither of us had a proper understanding of how
> InnoDB worked in these aspects when we started the project.
No project knows all the details.
> These
> threads seem to operate on bpages mostly rather than buf_pools, which
> makes numa mapping even harder
You could add a node_id to the class buf_page_t?
> (buf_pools to numa nodes would
> comparatively be easier), but is definitely more efficient than
> buf_pools and hence shouldn't be changed.
> Then again there were cases when the tasks assigned to the background
> threads were done by other server threads as well, especially in case of
> flushing, but I successfully restricted them to their nodes. If we were
> to create queues per node, which queues would these other threads work on.
Yes, which is why this issue was raised with you months ago.
> In other words the way in which InnoDB was initially written and
> expanded has made it significantly difficult to be adapted for explicit
> NUMA support.
The split of buffer pool instances was in appreciation that there there
are hot locks causes by a large number of threads.
It just needed more thought than implementing a few thread bindings. I'm
sorry you though it was that easy.
It wouldn't just require restructuring and changing the
> way these threads use parameters, pick tasks etc,
You've changed very few parameters so far.
https://github.com/MariaDB/server/compare/10.2...theGodlessLakra:numa_one
but may also require
> re-ordering the order in which they are executed.
I can't think of one example of this.
> For example, the
> background threads are created before the user threads, and trying to
> use the NUMA node from user THD later on would mean more system calls
> and probably thread migrations.
Huh, what? "may..require..(bad design)".
Which is why a mapping of THD to known background threads per node was
desired, to avoid syscall thread migrations.
> When you added the task in the spreadsheet, you were right to anticipate
> that it could require large cleanup of the InnoDB structure, but I am
> beginning to think it will be way more complicated.
I mentioned in on 2017-07-10 too
(http://marialog.archivist.info/2017-07-10.txt) too and you said you'd
give it a try.
> Also, as you once
> mentioned that most architectures have 2 to 4 NUMA nodes. No doubt, if
> we could implement a support for NUMA architectuure such that InnoDB
> would make the best use out of it, there would be a performance
> difference but it would still be quite negligible in a very fast
> computer,
Cross node cache-coherency takes significant clock cycles compared to
local access. What you'd end up with is a system that has idle time that
can't be used.
By finishing this task this could be measured. local/remote perf access
can be measure (perhaps even with fake_numa).
https://joemario.github.io/blog/2016/09/01/c2c-blog/
> and I really don't think making such big changes to InnoDB
> would be worth the effort and risk.
There is no gain without risk. Small changes with careful review,
running full tests between them is they way large changes are made low risk.
> Trying to bring a major change to a
> working and verified code (written by someone else) is more error-prone
> than adding some modular code and functions to support a new feature. I
> hope you agree.
When developing you are always developing for current and future use. As
such architectural changes need to be commiserate to the functional
changes. Jan and I have placed emphasis on writing test cases. These
ensure that changes.
> Last but not least, if you think it can still be done and have a idea, I
> will be more than willing to attempt it. After all, GSoC has only got me
> started to contribute to open-source and this evaluation won't be the
> end of it.
> And since there's no other database software out there which
> supports NUMA (let's ignore numa-interleave) and since I was the first
> to start working on this
There is, found
https://github.com/aerospike/aerospike-server/blob/master/cf/src/hardware.c
the other day. Wouldn't have helped much as it has a different
architecture and getting on top of one was hard enough.
>, I will be really proud to see it through to
> the end.
This week looks like you've done one commit that is largely a revert
that Jan asked for. You have slowed down considerably.
> Some Evaluation stuff :
> I see only one text field in the form for a link to the work done, and
> it takes a single url. Reserving this for the github link, where do I
> add a link to the spreadsheet or documentation ?
>
> Assuming you have already verified the work done so far, I will be
> creating a pull request shortly, so I can add whether it was merged or
> not, in the evaluation.
Looks like the fake_numa stuff is incomplete. Pinning of background
threads looks odd as they aren't big allocators of memory and aren't
associated with SQL threads/buffer pool instance nodes.
I have looked though it. As a pull request 60 commits with 26(?) of them
fixing prior commits will look messy for anyone trying to look at the
changes. If you are up for trying to clean it up please give it a go.
Doing it in a copied repository is a good idea as you've seen before it
can get a bit hard to keep track of.
> Thanking You,
> Sumit
>
---------- Forwarded message ----------
From: Sumit Lakra <sumitlakradev(a)gmail.com>
Date: Fri, Aug 25, 2017 at 3:03 AM
Subject: Re: GSoC Final Evaluation
To: Daniel Black <daniel.black(a)au1.ibm.com>
Cc: Jan Lindström <jan.lindstrom(a)mariadb.com>
On Thu, Aug 24, 2017 at 12:11 PM, Daniel Black <daniel.black(a)au1.ibm.com>
wrote:
>
>
> On 22/08/17 05:34, Sumit Lakra wrote:
> > Hello,
> >
> > I went through the code that the InnoDB background threads use to pick
> > tasks and execute them, in detail. This is how they seem to work.
> >
> > The reader/writer threads are passed an integer which acts as a segment.
> > They then call fil_aio_wait(segment) which calls os_aio_handler(segment,
> > &node, &message, &type). The control then goes to the
> > os_aio_simulated_handler(segment, m1, m2, request) where the code gets
> > more complicated with AIO arrays and slots. It gets harder to understand
> > how they choose their tasks. It is definitely not a simple queue
> > structure from which they pick their tasks. Also, which buffer pool the
> > task is related to can only be figured out quite later.. based on the
> > value of m2, which stores the address of a bpage.
>
> > A simple queue could
> > have been easily replaced with multiple queues, i.e. a queue per
> > numa-node like we had once discussed on IRC.
>
> Yes, you know where requests are created and where the job message ends
> up so you should be able to create an implementation. Even if it doesn't
> preserver the existing behaviour it should be enough to test it.
>
I am sorry I am unable to create an implementation. It's not like I don't
want to. I just can't completely understand how it has been implemented
currently. I have tried to follow the code before, and I did again the
whole day today, but I can't come up with an idea at all.
>
> > Lastly, all these
> > procedures are common for log threads as well.
>
> relevance?
The reader/writer/log threads all start by executing fil_aio_wait(segment),
in srv0start.cc
>
> > Another thing, you mentioned more than once you wanted the reader
>
> SQL threads? Or is this the same thing?
>
> > threads to look for a bpage in their local nodes before looking them up
> > in other nodes, but they use the bpage structure itself like I
> > mentioned
> > Obviously neither of us had a proper understanding of how
> > InnoDB worked in these aspects when we started the project.
>
> No project knows all the details.
>
> > These
> > threads seem to operate on bpages mostly rather than buf_pools, which
> > makes numa mapping even harder
>
> You could add a node_id to the class buf_page_t?
>
Okay, but how to use it ? where ? In which functions ?
>
> > (buf_pools to numa nodes would
> > comparatively be easier), but is definitely more efficient than
> > buf_pools and hence shouldn't be changed.
>
>
> > Then again there were cases when the tasks assigned to the background
> > threads were done by other server threads as well, especially in case of
> > flushing, but I successfully restricted them to their nodes. If we were
> > to create queues per node, which queues would these other threads work
> on.
>
> Yes, which is why this issue was raised with you months ago.
>
> > In other words the way in which InnoDB was initially written and
> > expanded has made it significantly difficult to be adapted for explicit
> > NUMA support.
>
> The split of buffer pool instances was in appreciation that there there
> are hot locks causes by a large number of threads.
>
> It just needed more thought than implementing a few thread bindings. I'm
> sorry you though it was that easy.
>
The split of buffer pool is not problematic to NUMA implementation. Its the
task queues for the background threads that's wrecking my brain.
>
> It wouldn't just require restructuring and changing the
> > way these threads use parameters, pick tasks etc,
>
>
> You've changed very few parameters so far.
>
> https://github.com/MariaDB/server/compare/10.2...theGodlessLakra:numa_one
>
> but may also require
> > re-ordering the order in which they are executed.
>
> I can't think of one example of this.
> > For example, the
> > background threads are created before the user threads, and trying to
> > use the NUMA node from user THD later on would mean more system calls
> > and probably thread migrations.
>
> Huh, what? "may..require..(bad design)".
>
One way (a really bad design), would be to create background threads when a
request is made by a SQL thread, and then bind it to the node depending on
the request i.e. the buffer pool instance in which the associated bpage
belongs, which would mean more system calls and migrations. Obviously a
very bad design.
>
> Which is why a mapping of THD to known background threads per node was
> desired, to avoid syscall thread migrations.
>
The SQL threads were expected to be bound to numa nodes, which I did, but I
can't think of how to map them to background threads. Can you think of a
way to do this ? Let me know how you think this can be done. I will
implement it asap.
> When you added the task in the spreadsheet, you were right to anticipate
> > that it could require large cleanup of the InnoDB structure, but I am
> > beginning to think it will be way more complicated.
>
> I mentioned in on 2017-07-10 too
> (http://marialog.archivist.info/2017-07-10.txt) too and you said you'd
> give it a try.
>
> > Also, as you once
> > mentioned that most architectures have 2 to 4 NUMA nodes. No doubt, if
> > we could implement a support for NUMA architectuure such that InnoDB
> > would make the best use out of it, there would be a performance
> > difference but it would still be quite negligible in a very fast
> > computer,
>
> Cross node cache-coherency takes significant clock cycles compared to
> local access. What you'd end up with is a system that has idle time that
> can't be used.
>
> By finishing this task this could be measured. local/remote perf access
> can be measure (perhaps even with fake_numa).
> https://joemario.github.io/blog/2016/09/01/c2c-blog/
>
> > and I really don't think making such big changes to InnoDB
> > would be worth the effort and risk.
>
> There is no gain without risk. Small changes with careful review,
> running full tests between them is they way large changes are made low
> risk.
>
Agreed, but I honestly can't make out the heads and tails of the changes
that will be required to replace the present task queue with a
queue-per-node structure :(
>
> > Trying to bring a major change to a
> > working and verified code (written by someone else) is more error-prone
> > than adding some modular code and functions to support a new feature. I
> > hope you agree.
>
> When developing you are always developing for current and future use. As
> such architectural changes need to be commiserate to the functional
> changes. Jan and I have placed emphasis on writing test cases. These
> ensure that changes.
>
Like I said, I am not unwilling to complete this task. But I can't even
think of where to start. Since, the two of you have greater experience with
InnoDB, maybe you should give it a shot. You will definitely understand the
present structure better than I do. And if you can come up with even a
verbal solution that sounds like it could work, I will implement it asap.
>
> > Last but not least, if you think it can still be done and have a idea, I
> > will be more than willing to attempt it. After all, GSoC has only got me
> > started to contribute to open-source and this evaluation won't be the
> > end of it.
>
>
> > And since there's no other database software out there which
> > supports NUMA (let's ignore numa-interleave) and since I was the first
> > to start working on this
>
> There is, found
> https://github.com/aerospike/aerospike-server/blob/master/cf
> /src/hardware.c
> the other day. Wouldn't have helped much as it has a different
> architecture and getting on top of one was hard enough.
>
> >, I will be really proud to see it through to
> > the end.
>
> This week looks like you've done one commit that is largely a revert
> that Jan asked for. You have slowed down considerably.
>
You have a good reason to be upset at me and I am not complaining. But
although it doesn't look from Github like much work due to no changes, I
spent a good deal of time brainstorming on the implementation of the first
task from the spreadsheet under the "should be done" title, unfortunately
to no success.
>
> > Some Evaluation stuff :
> > I see only one text field in the form for a link to the work done, and
> > it takes a single url. Reserving this for the github link, where do I
> > add a link to the spreadsheet or documentation ?
> >
> > Assuming you have already verified the work done so far, I will be
> > creating a pull request shortly, so I can add whether it was merged or
> > not, in the evaluation.
>
> Looks like the fake_numa stuff is incomplete. Pinning of background
> threads looks odd as they aren't big allocators of memory and aren't
> associated with SQL threads/buffer pool instance nodes.
>
Fixed the fake numa test which was failing in non-debug builds. Most of the
memory allocation is done during the buffer pool creation, and these have
been managed properly while creating buffer pools per numa node. What other
allocators of memory do you wish to be managed ? and how ? (Non-pinned
threads should probably be allowed to allocate memory from any node)
>
> I have looked though it. As a pull request 60 commits with 26(?) of them
> fixing prior commits will look messy for anyone trying to look at the
> changes. If you are up for trying to clean it up please give it a go.
> Doing it in a copied repository is a good idea as you've seen before it
> can get a bit hard to keep track of.
>
I spent all day trying to think of any way to implement a queue-per-node
thing, and I confess I failed to come up with one. So, in stead of spending
more time on that, I will work on migrating these commits to a new branch
today. And I also urge you to give this task a try yourselves.
Thanks,
Sumit
---------- Forwarded message ----------
From: Sumit Lakra <sumitlakradev(a)gmail.com>
Date: Fri, Aug 25, 2017 at 6:52 PM
Subject: Re: GSoC Final Evaluation
To: Daniel Black <daniel.black(a)au1.ibm.com>
Cc: Jan Lindström <jan.lindstrom(a)mariadb.com>
Hello,
I am currently working on copying the changes from numa_one branch to a
different branch in a more orderly fashion. I haven't been able to
implement the queue per node structure for background threads, and I am
afraid I won't be able to do it without your help. It is beyond me. I gave
it a few attempts and I have failed. It's not the implementation part but
the 'coming up with a way to implement it' part that I haven't been able to
figure out yet. I need you to know that I am trying my best here.
Back in http://marialog.archivist.info/2017-07-10.txt, you also said
"warning the whole io read/write thread segements and scheduling is really
messy and could do with a significant rework", and you also mentioned that
you and Jan would have an attempt at it yourselves, if I am unable to do it
and I move on. Well, I regret moving on to SQL threads back then without
informing you that I wasn't successfull. A 'queue-per-numa-node' is easier
said than done. But anyway I kindly request you two to try it out
yourselves now. You don't have to do the work. I just urge you to take some
time out this weekend and go through the code. Let me know how you think it
may be possible to implement it. I will do it.
I have 4 days 2 hrs before the deadline to submit the evaluation, ends. If
you can come up with an implementation plan within the next 2 or 2.5 days,
I assure you I will code the implementation within a day of that and
probably commit it before submitting the final evaluation as well. I am
willing to take this risk.
However, if you are unable to come up with an implementation plan, we don't
really have to hurry and stress ourselves. We can work together (I can't do
this on my own) on this particular task later on and come up with
something, assuming I am not failed in the final evaluation.
Happy Weekend,
Sumit
---------- Forwarded message ----------
From: Daniel Black <daniel.black(a)au1.ibm.com>
Date: Mon, Aug 28, 2017 at 7:56 AM
Subject: Re: GSoC Final Evaluation
To: Sumit Lakra <sumitlakradev(a)gmail.com>
Cc: Jan Lindström <jan.lindstrom(a)mariadb.com>
On 25/08/17 23:22, Sumit Lakra wrote:
> Hello,
>
> I am currently working on copying the changes from numa_one branch to a
> different branch in a more orderly fashion.
This is looking good:
https://github.com/theGodlessLakra/server/tree/NUMA_Support
Don't be afraid to be very descriptive in the commit message especially
about the problem being solved and what functionality it
implements/changes. Don't re-say what the code is doing, just cover its
meaning at a high level.
> I haven't been able to
> implement the queue per node structure for background threads, and I am
> afraid I won't be able to do it without your help. It is beyond me.
> I
> gave it a few attempts and I have failed. It's not the implementation
> part but the 'coming up with a way to implement it' part that I haven't
> been able to figure out yet. I need you to know that I am trying my best
> here.
Knowing what plans you had and how they failed would have been good to know.
> Back in http://marialog.archivist.info/2017-07-10.txt,
> you also said "warning the whole io read/write thread segements and
> scheduling is really messy and could do with a significant rework",
To do it properly yes.
> and
> you also mentioned that you and Jan would have an attempt at it
> yourselves, if I am unable to do it and I move on. Well, I regret moving
> on to SQL threads back then without informing you that I wasn't
> successfull.
I probably should of harassed you a bit too. Acknowledging failures is
an important part to learning, even if you just acknowledge to yourself.
How did SQL threads end up?
> A 'queue-per-numa-node' is easier said than done. But
> anyway I kindly request you two to try it out yourselves now. You don't
> have to do the work. I just urge you to take some time out this weekend
> and go through the code. Let me know how you think it may be possible to
> implement it. I will do it.
Having only just read this, this is a bit late notice. The aspect of
understanding the interoperation between parts is also a fundamental
aspect of getting a plan to implement it.
> I have 4 days 2 hrs before the deadline to submit the evaluation, ends.
> If you can come up with an implementation plan within the next 2 or 2.5
> days, I assure you I will code the implementation within a day of that
> and probably commit it before submitting the final evaluation as well. I
> am willing to take this risk.
I thought your submission of a URL left it open to still committing on
it. Please don't risk missing it. Making sure the existing
code/comments/documentation is as clean as possible is better than
tackling new work now.
> However, if you are unable to come up with an implementation plan, we
> don't really have to hurry and stress ourselves. We can work together (I
> can't do this on my own) on this particular task later on and come up
> with something, assuming I am not failed in the final evaluation.
>
> Happy Weekend,
> Sumit
---------- Forwarded message ----------
From: Sumit Lakra <sumitlakradev(a)gmail.com>
Date: Mon, Aug 28, 2017 at 9:50 PM
Subject: Re: GSoC Final Evaluation
To: Daniel Black <daniel.black(a)au1.ibm.com>
Cc: Jan Lindström <jan.lindstrom(a)mariadb.com>
Hello,
The following is a detailed explanation of what I did, how and why, while
trying to come up with a solution to the task queue problem. However,
before you go through this I would still request you to have an attempt at
it yourselves, so your perspectives won't be affected by my ideas and you
may come up with something different.
I had once tried this before in https://github.com/MariaDB/
server/commit/4b8436d3f7c573668a0d37b54dab2d05e8e102be. I even pointed this
commit out to once on IRC, (can't access logs of this time). Anyway I
didn't put much focus on this as I was onto some other task.
Before trying to replace the present queue structure with a queue-per-node,
I first tried to let all the threads access this queue as before but only
pick tasks dealing with their local nodes. Neither this queue nor these
tasks seem to be easily distinguishable in code. The threads start by
running fil_aio_wait(segment). This function calls os_aio_handler(segment,
&node, &message, &type). This calls os_aio_simulated_handler(segment, m1,
m2, request). This calls AIO::get_array_and_local_segment(&array,
global_segment), and also creates SimulatedAIOHandler handler(array,
segment), followed by calling handler.check_pending(global_segment, event),
handler.init(n_slots), and slot = handler.check_completed(&n_reserved).
Now, in order to skip a task to be done by other thread, what a thread must
do is pick the task, check whether the task is related to the same node or
not. If the task is concerned with the same node execute it or else leave
it in the queue. This is where the problem occurs. In order to figure out
whether a task is concerned with the same node or not, it must know which
buffer pool instance the task is associated to. This can only be figured
out after the task picks up the bpage it has to deal with. The bpage
already has this information by now. Anyway, the bpage a thread has to deal
with can only be found out by buf_page_t* bpage =
static_cast<buf_page_t*>(*m2). Also, the task is not picked up at any
particular place, but is picked up in parts in different functions. All
these function calls seem to make some or other kind of changes to the
segment/slot/array they are picking up the task from. So, when a thread
finally gets a task and finds it concerned with a different node, then in
order to safely return the task back to the queue, it must also undo the
changes made by all these procedures so far. This is the difficult part.
Compare this with pc_flush_slot() where I successfully implemented this
method of skipping tasks if not related to local pool.
In the above commit you will find I have a few lines like 'ib::info() <<
thread_name << " accessing buf_page on node : " << node_of_pool << " in
fil_aio_wait()";' If you compile and run this code you can see in real time
which threads are accessing which pools. I have given proper names to them.
Also, when I tried to skip tasks which were not of the same queue by
returning from the function os_aio_simulated_handler(), which partly worked
but on trying to return from the funtion fil_aio_wait(), there seems to be
either an infinite loop or a deadlock somewhere, as the server stops midway
(doesn't abort) while starting up. Hence I have commented out those lines
on lines 5463 and 5464 in fil0fil.cc in the above commit.
Another thing is the role of segments here. Are these segments related to
buffer pool instances somehow ? Because all these threads including the log
threads start there work by executing fil_aio_wait(segment), but segment
here seems to be just an incrementing integer. At one point assuming these
segments could contain tasks related with different pools/nodes, I added a
loop such that both the read and write threads loop around
fil_aio_wait(segment) calling it with all valid segments. Didn't work out.
These were attempted by me when I pushed this commit weeks ago in the temp
branch. And even though I pointed it out to you, I think I should have
given much more importance to this task. Over the past few days, when I
tried to give this task another try, I couldn't come up with any idea
either to replace this queue with a queue-per-node thing or even to
complete this skip-if-diff-node thing.
1
0