Hi Sergei, thanks for looking into this.

On Wed, Aug 28, 2019 at 7:54 PM Sergei Golubchik <serg@mariadb.org> wrote:
Hi, Robert!

On Aug 28, Robert Bindar wrote:
> revision-id: 5ba0b11c5d1 (mariadb-10.4.7-35-g5ba0b11c5d1)
> parent(s): 4a490d1a993
> author: Robert Bindar <robert@mariadb.org>
> committer: Robert Bindar <robert@mariadb.org>
> timestamp: 2019-08-27 11:38:52 +0300
> message:
>
> MDEV-20257 Fix crash in Grant_table_base::init_read_record
>
> The bug shows up when a 10.4+ server starts on a <10.4 datadir
> with a crashed mysql.user table.
> Grant_tables::open_and_lock tries to open the list of requested
> tables (host, db, global_priv,..) and because global_priv doesn't
> exist (pre-10.4 datadir), it falls back to opening mysql.user.
> The first open_tables call for [host, db, global_priv,..] works
> just fine. The second open_tables call (trying to open mysql.user
> alone) sees the crashed user table and performs 3 steps:
> - closes all the open tables
> - attempts a table fix for mysql.user (succeeds)
> - opens the tables initially passed as arguments
> In an ideal world, the first step should only close the tables
> passed as arguments (i.e. mysql.user), but for some reasons
> close_tables_for_reopen makes a close_thread_tables call which
> closes everything in THD::open_tables (nevertheless, this is
> probably the intended behavior).
> This side effect causes all the tables opened in the first
> open_tables call to now be closed and Grant_table_base::init_read_record
> tries to read from a released table.
>
> diff --git a/mysql-test/main/mdev20257.test b/mysql-test/main/mdev20257.test
> new file mode 100644
> index 00000000000..8506292d57c
> --- /dev/null
> +++ b/mysql-test/main/mdev20257.test
> @@ -0,0 +1,51 @@
> +--source include/not_embedded.inc
> +--echo #
> +--echo # MDEV 20257 Server crashes in Grant_table_base::init_read_record
> +--echo # upon crash-upgrade
> +--echo #
> +
> +let $MYSQLD_DATADIR= `select @@datadir`;
> +
> +rename table mysql.global_priv to mysql.global_priv_bak;
> +rename table mysql.user to mysql.user_bak;
> +rename table mysql.db to mysql.db_bak;
> +rename table mysql.proxies_priv to mysql.proxies_priv_bak;
> +rename table mysql.roles_mapping to mysql.roles_mapping_bak;
> +
> +--source include/shutdown_mysqld.inc
> +
> +# Bring in a crashed user table
> +# Ideally we should've copied only the crashed mysql.user, but this
> +# would make mysqld crash in some other place before hitting the
> +# crash spot described in MDEV-20257 which is what we're trying to
> +# test against.

I don't understand that. Where does it crash then?
What difference does it make that you replace all tables - are they all
corrupted?
Only mysql.user is corrupted. Assuming we take out the fix in acl, if we don't replace
the db table with the 10.2 version provided by Elena, the server crashes a bit earlier
in lock_tables (called from Grant_tables::open_and_lock) for reasons I don't yet understand,
my guess is it has something to do with the aria/myisam change between 10.2 and 10.4.
The host table needs to be copied because the reported crash happened when acl_load tried
to read the host table and in 10.4 it is not there anymore.
And for proxies_priv and roles_mapping, it is enough if we just hide them (otherwise the same
Aria crash happens).
> +--copy_file std_data/mdev20257/user.frm $MYSQLD_DATADIR/mysql/user.frm
> +--copy_file std_data/mdev20257/user.MYD $MYSQLD_DATADIR/mysql/user.MYD
> +--copy_file std_data/mdev20257/user.MYI $MYSQLD_DATADIR/mysql/user.MYI
> +
> +--copy_file std_data/mdev20257/host.frm $MYSQLD_DATADIR/mysql/host.frm
> +--copy_file std_data/mdev20257/host.MYD $MYSQLD_DATADIR/mysql/host.MYD
> +--copy_file std_data/mdev20257/host.MYI $MYSQLD_DATADIR/mysql/host.MYI
> +
> +--copy_file std_data/mdev20257/db.frm $MYSQLD_DATADIR/mysql/db.frm
> +--copy_file std_data/mdev20257/db.MYD $MYSQLD_DATADIR/mysql/db.MYD
> +--copy_file std_data/mdev20257/db.MYI $MYSQLD_DATADIR/mysql/db.MYI
> +--copy_file std_data/mdev20257/db.opt $MYSQLD_DATADIR/mysql/db.opt
> +
> +--source include/start_mysqld.inc
> +
> +call mtr.add_suppression("Table \'\..mysql\.user\' is marked as crashed and should be repaired");
> +call mtr.add_suppression("Checking table:   \'\..mysql\.user\'");
> +call mtr.add_suppression("mysql.user: 1 client is using or hasn't closed the table properly");
> +call mtr.add_suppression("Missing system table mysql..*; please run mysql_upgrade to create it");
> +
> +# Cleanup
> +drop table mysql.user;
> +drop table mysql.db;
> +drop table mysql.host;
> +rename table mysql.global_priv_bak to mysql.global_priv;
> +rename table mysql.user_bak to mysql.user;
> +rename table mysql.db_bak to mysql.db;
> +rename table mysql.proxies_priv_bak to mysql.proxies_priv;
> +rename table mysql.roles_mapping_bak to mysql.roles_mapping;
> +
> diff --git a/mysql-test/std_data/mdev20257/db.opt b/mysql-test/std_data/mdev20257/db.opt
> --- /dev/null
> +++ b/mysql-test/std_data/mdev20257/db.opt
> @@ -0,0 +1,2 @@
> +default-character-set=latin1
> +default-collation=latin1_swedish_ci

why is that?
I agree we don't need db.opt at all.

> diff --git a/sql/sql_acl.cc b/sql/sql_acl.cc
> index 847d2bd777b..e028d084703 100644
> --- a/sql/sql_acl.cc
> +++ b/sql/sql_acl.cc
> -      DBUG_RETURN(res);
> +    first = build_open_list(tables, which_tables, lock_type, true);
> +
> +    uint counter;
> +    if (int rv= really_open(thd, first, &counter))
> +      DBUG_RETURN(rv);
> +
> +    TABLE *user_table= tables[USER_TABLE].table;
> +    if ((which_tables & Table_user) && !user_table)
> +    {
> +      close_thread_tables(thd);
> +      first = build_open_list(tables, which_tables, lock_type, false);

Hmm... Here you always close/reopen all tables, just to account for the
case when the user table is found corrupted.

I'd argue that a corrupted user table happens rarely (compared to the
non-corrupted), so it makes sense to optimize for the normal use case.

That is, only open the user table as before. After it's opened you
check if other tables are still opened - if they aren't you can close
and reopen everything. This way you only close/reopen all tables if the
user table was corrupted.

Makes sense?

It makes sense. Is there any other way to check if a table is opened except for going
through thd->open_tables and see if the table is there? Touching TABLE_LIST::table that
we have in open_and_lock might not be a good idea if the table instance is already released
to table cache.
> +      if (int rv= really_open(thd, first, &counter))
> +        DBUG_RETURN(rv);
> +
> +      p_user_table= &m_user_table_tabular;
> +      user_table= tables[USER_TABLE + 1].table;
> +    }

Regards,
Sergei
VP of MariaDB Server Engineering
and security@mariadb.org