[Maria-developers] Rev 3712: MDEV-4338 : Support atomic option on directFS/FusionIO
Hi, Vladislav! On Mar 29, Vladislav Vaintroub wrote:
------------------------------------------------------------ revno: 3712 revision-id: wlad@montyprogram.com-20130328203348-1qe9u9nlq6ol9b8l parent: igor@askmonty.org-20130329021836-x6swah73a59im9tn fixes bug: https://mariadb.atlassian.net/browse/MDEV-4338 committer: Vladislav Vaintroub <wlad@montyprogram.com> branch nick: 5.5 timestamp: Thu 2013-03-28 21:33:48 +0100 message: MDEV-4338 : Support atomic option on directFS/FusionIO.
=== modified file 'storage/xtradb/handler/ha_innodb.cc' --- a/storage/xtradb/handler/ha_innodb.cc 2013-03-08 18:08:45 +0000 +++ b/storage/xtradb/handler/ha_innodb.cc 2013-03-28 20:33:48 +0000 @@ -185,6 +185,7 @@ static my_bool innobase_file_format_chec static my_bool innobase_log_archive = FALSE; static char* innobase_log_arch_dir = NULL; #endif /* UNIV_LOG_ARCHIVE */ +static my_bool innobase_use_atomic_writes = TRUE; static my_bool innobase_use_doublewrite = TRUE; static my_bool innobase_use_checksums = TRUE; static my_bool innobase_fast_checksum = FALSE; @@ -3057,6 +3058,35 @@ innobase_init( srv_kill_idle_transaction = 0; #endif
+ srv_use_atomic_writes = (ibool) innobase_use_atomic_writes;
Is that the typical usage pattern in the innodb code? Because it looks a bit silly to me, why couldn't you create a sysvar directly on srv_use_atomic_writes, why an intermediate innobase_use_atomic_writes variable?
+ if (innobase_use_atomic_writes) { + fprintf(stderr, "InnoDB: using atomic writes.\n"); + + /* Force doublewrite buffer off, atomic writes replace it. */ + if (srv_use_doublewrite_buf) { + fprintf(stderr, "InnoDB: Switching off doublewrite buffer " + "because of atomic writes.\n"); + innobase_use_doublewrite = srv_use_doublewrite_buf = FALSE; + } + + /* Force O_DIRECT on Unixes (on Windows writes are always unbuffered)*/ +#ifndef _WIN32 + if(!innobase_file_flush_method || + !strstr(innobase_file_flush_method, "O_DIRECT")) { + innobase_file_flush_method = + srv_file_flush_method_str = (char*)"O_DIRECT"; + fprintf(stderr, "InnoDB: using O_DIRECT due to atomic writes.\n"); + } +#endif +#ifdef HAVE_POSIX_FALLOCATE + /* Due to a bug in directFS, using atomics needs + * posix_fallocate to extend the file + * pwrite() past end of the file won't work + */ + srv_use_posix_fallocate = TRUE; +#endif + }
it'd be nice if you could try os_file_set_atomic_writes() here to see if that works. This function creates and opens quite a few files, you could use one of them to mark it for atomic writes, and if that would fail, you'd disable atomic writes, and wouldn't change innobase_file_flush_method and innobase_use_doublewrite. Btw, why not to use posix_fallocate whenever it's available? Or, at least, with its own --innodb-use-fallocate option? Regards, Sergei
-----Original Message----- From: Sergei Golubchik [mailto:serg@askmonty.org] Sent: Donnerstag, 4. April 2013 17:32 To: maria-developers@lists.launchpad.net Cc: wlad@mariadb.org Subject: Rev 3712: MDEV-4338 : Support atomic option on directFS/FusionIO
Hi Serg,
Hi, Vladislav!
+ srv_use_atomic_writes = (ibool) innobase_use_atomic_writes;
Is that the typical usage pattern in the innodb code? Because it looks a bit silly to me, why couldn't you create a sysvar directly on srv_use_atomic_writes, why an intermediate innobase_use_atomic_writes variable?
Yes, this is typical usage pattern in Innodb. Bit weird, I agree. Every innobase_xxx variable has an srv_xxx counterpart. I have no clue why it is so, but I did not want to disturb their existing conventions:) <skip>
it'd be nice if you could try os_file_set_atomic_writes() here to see if that works. This function creates and opens quite a few files, you could use one of them to mark it for atomic writes, and if that would fail, you'd disable atomic writes, and wouldn't change innobase_file_flush_method and innobase_use_doublewrite.
It is tricky to do it before files are opened the first time , without moving lot of code around - there is non-trivial parsing of tablespace names later on in open_or_create_data_files(), just to figure out directories and filenames. On the other hand, I think it is acceptable to fail fast during start, as long as option is not default, and good diagnostics are written into error log . look into error log, reset parameter to default. Currently there is a single filesystem with atomic capability, so people who use innodb_use_atomic_writes will probably know what they are doing. If/once the option becomes popular so it makes sense to make it default, then I would agree with you, a detection/fallback is very appropriate.
Btw, why not to use posix_fallocate whenever it's available? Or, at least, with its own --innodb-use-fallocate option?
Yes, I guess it (the new option) is a good idea. I created a followup patch that introduces the option http://lists.askmonty.org/pipermail/commits/2013-April/004569.html . I set default to ON. What do you think? Thanks a lot for looking at the patch, Wlad
+ srv_use_atomic_writes = (ibool) innobase_use_atomic_writes;
Is that the typical usage pattern in the innodb code? Because it looks a bit silly to me, why couldn't you create a sysvar directly on srv_use_atomic_writes, why an intermediate innobase_use_atomic_writes variable?
Yes, this is typical usage pattern in Innodb. Bit weird, I agree. Every innobase_xxx variable has an srv_xxx counterpart. I have no clue why it is so, but I did not want to disturb their existing conventions:)
FWIW not necessarily. Look at innodb_io_capacity, innodb_purge_batch_size, innodb_rollback_segments, innodb_purge_threads just name the first few where a srv_* variable is specified in MYSQL_SYSVAR_* directly. Note that the srv_* = innobase_* assignments have typecasts on them, so I haven't checked, but it might be that using the srv_ in a MYSQL_SYSVAR_ directly would have resulted in compilation warnings otherwise. Just 2c, Laurynas
-----Original Message----- From: Laurynas Biveinis [mailto:laurynas.biveinis@gmail.com] Sent: Donnerstag, 4. April 2013 21:59 To: Vladislav Vaintroub Cc: Sergei Golubchik; maria-developers@lists.launchpad.net; wlad@mariadb.org Subject: Re: [Maria-developers] Rev 3712: MDEV-4338 : Support atomic option on directFS/FusionIO
+ srv_use_atomic_writes = (ibool) innobase_use_atomic_writes;
Is that the typical usage pattern in the innodb code? Because it looks a bit silly to me, why couldn't you create a sysvar directly on srv_use_atomic_writes, why an intermediate innobase_use_atomic_writes variable?
Yes, this is typical usage pattern in Innodb. Bit weird, I agree. Every innobase_xxx variable has an srv_xxx counterpart. I have no clue why it is so, but I did not want to disturb their existing conventions:)
FWIW not necessarily. Look at innodb_io_capacity, innodb_purge_batch_size, innodb_rollback_segments, innodb_purge_threads just name the first few where a srv_* variable is specified in MYSQL_SYSVAR_* directly. Note that the srv_* = innobase_* assignments have typecasts on them, so I haven't checked, but it might be that using the srv_ in a MYSQL_SYSVAR_ directly would have resulted in compilation warnings otherwise.
Thanks, Laurinas, this is interesting. I tested it , using srv_use_atomic_writes directly. This gives me compiler error, not warning :) 6>..\..\..\storage\xtradb\handler\ha_innodb.cc(12654): error C2440: 'initializing' : cannot convert from 'ulint *' to 'char *' 6> Types pointed to are unrelated; conversion requires reinterpret_cast, C-style cast or function-style cast Indeed, boolean values are handled consequently as "char" in MySQL (i.e my_bool is char ), and as "ulint" in innodb (ibool is ulint)
Just 2c, Laurynas
Hi, Vladislav! On Apr 04, Vladislav Vaintroub wrote:
Hi Serg,
Hi, Vladislav!
+ srv_use_atomic_writes = (ibool) innobase_use_atomic_writes;
Is that the typical usage pattern in the innodb code? Because it looks a bit silly to me, why couldn't you create a sysvar directly on srv_use_atomic_writes, why an intermediate innobase_use_atomic_writes variable?
Yes, this is typical usage pattern in Innodb. Bit weird, I agree. Every innobase_xxx variable has an srv_xxx counterpart. I have no clue why it is so, but I did not want to disturb their existing conventions:)
Ok, good that you and Laurinas have figured that out :) Not that I understand why bool == ulint, but I'm not suggesting to break this convention.
it'd be nice if you could try os_file_set_atomic_writes() here to see if that works. This function creates and opens quite a few files, you could use one of them to mark it for atomic writes, and if that would fail, you'd disable atomic writes, and wouldn't change innobase_file_flush_method and innobase_use_doublewrite.
It is tricky to do it before files are opened the first time , without moving lot of code around - there is non-trivial parsing of tablespace names later on in open_or_create_data_files(), just to figure out directories and filenames.
I mean, you can try your function on any other file, not necessarily on the tablespace. Even on a temporary file, like with int fd = mysql_tmpfile("ib"); if (os_file_set_atomic_writes(fd)) ... my_close(fd);
Btw, why not to use posix_fallocate whenever it's available? Or, at least, with its own --innodb-use-fallocate option?
Yes, I guess it (the new option) is a good idea. I created a followup patch that introduces the option http://lists.askmonty.org/pipermail/commits/2013-April/004569.html . I set default to ON. What do you think?
There are two related threads: https://lists.launchpad.net/maria-developers/msg05068.html and the one in internals@ mysql list, starting from http://lists.mysql.com/internals/38679 In particular, I noticed this part " I relied on that fallocate does not need fsync since metadata is protected by filesystem journal. But I am not confident whether it is true. I'm wondering if this patch may lead InnoDB committing schema to not function normally. " What do you know about it, does one need to sync after posix_fallocate()? What if the filesystem is not journalling? Regards, Sergei
-----Original Message----- From: Sergei Golubchik [mailto:serg@askmonty.org] Sent: Donnerstag, 4. April 2013 23:12 To: Vladislav Vaintroub Cc: maria-developers@lists.launchpad.net Subject: Re: Rev 3712: MDEV-4338 : Support atomic option on directFS/FusionIO
Hi Serg, <skip>
it'd be nice if you could try os_file_set_atomic_writes() here to see if that works. This function creates and opens quite a few files, you could use one of them to mark it for atomic writes, and if that would fail, you'd disable atomic writes, and wouldn't change innobase_file_flush_method and innobase_use_doublewrite.
It is tricky to do it before files are opened the first time , without moving lot of code around - there is non-trivial parsing of tablespace names later on in open_or_create_data_files(), just to figure out directories and filenames.
I mean, you can try your function on any other file, not necessarily on the tablespace. Even on a temporary file, like with
int fd = mysql_tmpfile("ib"); if (os_file_set_atomic_writes(fd)) ... my_close(fd);
Btw, why not to use posix_fallocate whenever it's available? Or, at least, with its own --innodb-use-fallocate option?
Yes, I guess it (the new option) is a good idea. I created a followup
Right, but the subtlety here is that test file needs to be on the same device as ibdata1 tablespace, and figuring out the correct directory is not trivial . Apart from easy default situation, where ibdata lands in the into datadir, there is innodb_data_home_dir , as well as innodb_data_file_path (this one needs to be parsed). Not to forget possible symbolic links . Imagine ibdata1 is placed on atomic-capable filesystem, and symlink to it into datadir. So I'd still say this is tricky.. patch
that introduces the option http://lists.askmonty.org/pipermail/commits/2013-April/004569.html . I set default to ON. What do you think?
There are two related threads:
https://lists.launchpad.net/maria-developers/msg05068.html and the one in internals@ mysql list, starting from http://lists.mysql.com/internals/38679
In particular, I noticed this part " I relied on that fallocate does not need fsync since metadata is protected by filesystem journal. But I am not confident whether it is true. I'm wondering if this patch may lead InnoDB committing schema to not function normally.
What do you know about it, does one need to sync after posix_fallocate()? What if the filesystem is not journalling?
This is a good question, and I do not think I have enough knowledge to answer that one :) note, that fsync() is done anyway almost everywhere after os_set_file_size() 1. os_set_file_size during creation single tablespace (file-per-table I believe) is followed almost immediately by os_file_flush() 2. os_set_file_size when tablespace is extended, is followed by fil_flush. When innodb starts up for the first time (bootstrap), it is a little bit different - log file or tablespace are created , os_set_file_size() is called, and then the file is closed, but not flushed. My feeling is that it is ok, hoping at least on close() the metadata will be flushed. And even if not, in case of the probably worst scenario that machine crashes during bootstrap, - in this case user data is not lost , as there is no user data yet. Having said all this, I would not mind to adding an extra fsync() to the function, if this makes feel safer for someone. I think the overhead of it would be minimal.
Regards, Sergei
Hi, Vladislav! On Apr 05, Vladislav Vaintroub wrote:
it'd be nice if you could try os_file_set_atomic_writes() here to see if that works.
I mean, you can try your function on any other file, not necessarily on the tablespace. Even on a temporary file
Right, but the subtlety here is that test file needs to be on the same device as ibdata1 tablespace, and figuring out the correct directory is not trivial .
Apart from easy default situation, where ibdata lands in the into datadir, there is innodb_data_home_dir , as well as innodb_data_file_path (this one needs to be parsed). Not to forget possible symbolic links . Imagine ibdata1 is placed on atomic-capable filesystem, and symlink to it into datadir.
So I'd still say this is tricky..
Okay, let's do it your way. If it will cause problems for users we can always reconsider.
What do you know about it, does one need to sync after posix_fallocate()?
This is a good question, and I do not think I have enough knowledge to answer that one :)
note, that fsync() is done anyway almost everywhere after os_set_file_size() 1. os_set_file_size during creation single tablespace (file-per-table I believe) is followed almost immediately by os_file_flush() 2. os_set_file_size when tablespace is extended, is followed by fil_flush.
When innodb starts up for the first time (bootstrap), it is a little bit different - log file or tablespace are created , os_set_file_size() is called, and then the file is closed, but not flushed. My feeling is that it is ok, hoping at least on close() the metadata will be flushed. And even if not, in case of the probably worst scenario that machine crashes during bootstrap, - in this case user data is not lost , as there is no user data yet.
Right, and as soon as there are any data there - on any transaction - there will be a flush.
Having said all this, I would not mind to adding an extra fsync() to the function, if this makes feel safer for someone. I think the overhead of it would be minimal.
No, I think it's fine the way it is. As you didn't remove any fsyncs and InnoDB does sync after resizing, it should be ok. Regards, Sergei
participants (3)
-
Laurynas Biveinis
-
Sergei Golubchik
-
Vladislav Vaintroub