Hi Sergei, On Fri, Dec 14, 2012 at 12:26:24PM +0100, Sergei Golubchik wrote:
The handler is quite ok. I've only had few mostly cosmetical comments.
Addressed most of them, but some remain. Can we discuss them on IRC?
=== modified file 'cmake/build_configurations/mysql_release.cmake' --- cmake/build_configurations/mysql_release.cmake 2012-06-06 12:15:29 +0000 +++ cmake/build_configurations/mysql_release.cmake 2012-11-12 15:11:23 +0000 @@ -46,6 +46,8 @@ SET(FEATURE_SET_large 5) SET(FEATURE_SET_xlarge 6) SET(FEATURE_SET_community 7)
+SET(WITH_CASSANDRA_STORAGE_ENGINE ON) +
This probably should be reverted before pushing into main, right?
Yes. I suppose, this will be changed to some other setting which builds Cassandra in certain configurations.
IF(FEATURE_SET) STRING(TOLOWER ${FEATURE_SET} feature_set) SET(num ${FEATURE_SET_${feature_set}})
=== modified file 'sql/sql_join_cache.cc' --- sql/sql_join_cache.cc 2012-03-24 17:21:22 +0000 +++ sql/sql_join_cache.cc 2012-11-12 15:11:23 +0000 @@ -4543,7 +4546,7 @@ bool JOIN_CACHE_BKAH::prepare_look_for_m { last_matching_rec_ref_ptr= next_matching_rec_ref_ptr= 0; if (no_association && - (curr_matching_chain= get_matching_chain_by_join_key())) + !(curr_matching_chain= get_matching_chain_by_join_key())) //psergey: added '!'
Is that something that should be pushed into the main branch or it's a temporary hack for cassandra, that should be reverted?
No. It is a fix for a typo bug in BKA code. The bug can only be observed when BKA is used with "no-association" storage engine, like Cassandra SE, or Falcon. Probably, the 'psergey:' comment should be removed.
return 1; last_matching_rec_ref_ptr= get_next_rec_ref(curr_matching_chain); return 0;
=== added file 'storage/cassandra/CMakeLists.txt' --- storage/cassandra/CMakeLists.txt 1970-01-01 00:00:00 +0000 +++ storage/cassandra/CMakeLists.txt 2012-11-12 15:11:23 +0000 @@ -0,0 +1,25 @@ + +SET(cassandra_sources + ha_cassandra.cc + ha_cassandra.h + cassandra_se.h + cassandra_se.cc + gen-cpp/Cassandra.cpp + gen-cpp/cassandra_types.h + gen-cpp/cassandra_types.cpp + gen-cpp/cassandra_constants.h + gen-cpp/cassandra_constants.cpp + gen-cpp/Cassandra.h) + +#INCLUDE_DIRECTORIES(BEFORE ${Boost_INCLUDE_DIRS}) + +#INCLUDE_DIRECTORIES(AFTER /usr/local/include/thrift) +INCLUDE_DIRECTORIES(AFTER /home/buildbot/build/thrift-inst/include/thrift/)
this needs to be fixed before pushing into the main tree I suppose, you'll need to detect here if thrift is available and disable the engine otherwise
Agree. We need to figure out how to make CMake check for Thrift.
+ +# +STRING(REPLACE "-fno-exceptions" "" CMAKE_CXX_FLAGS ${CMAKE_CXX_FLAGS}) +STRING(REPLACE "-fno-implicit-templates" "" CMAKE_CXX_FLAGS ${CMAKE_CXX_FLAGS}) +#LINK_DIRECTORIES(/home/psergey/cassandra/thrift/lib) + +MYSQL_ADD_PLUGIN(cassandra ${cassandra_sources} STORAGE_ENGINE LINK_LIBRARIES thrift) +# was: STORAGE_ENGINE MANDATORY
...
=== added file 'storage/cassandra/ha_cassandra.h' --- storage/cassandra/ha_cassandra.h 1970-01-01 00:00:00 +0000 +++ storage/cassandra/ha_cassandra.h 2012-11-12 15:11:23 +0000 ... + ulonglong table_flags() const + { + /* + HA_BINLOG_STMT_CAPABLE + We are saying that this engine is just statement capable to have + an engine that can only handle statement-based logging. This is + used in testing. + HA_REC_NOT_IN_SEQ + If we don't set it, filesort crashes, because it assumes rowids are + 1..8 byte numbers + */ + return HA_BINLOG_STMT_CAPABLE | + HA_REC_NOT_IN_SEQ;
HA_NO_TRANSACTIONS is unset. Is this engine transactional? No. I've added the flag.
HA_PARTIAL_COLUMN_READ is unset. Do you always work with all columns, ignoring read_set and write_set? Currently, Yes.
HA_TABLE_SCAN_ON_INDEX is unset. Do you store the data MyISAM-style, index and data in separate files? HA_FAST_KEY_READ is unset. is reading keys in order faster in cassandra, than random reads? TODO:Not sure about the above two flags, let's discuss them.
HA_REQUIRE_PRIMARY_KEY is unset. but a primary key is required. added the flag.
HA_PRIMARY_KEY_IN_READ_INDEX is unset. but as far as I see, primary key columns are always returned (and needed for position()) added the flag.
HA_PRIMARY_KEY_REQUIRED_FOR_POSITION is unset, but a primary key is required for position() added the flag.
HA_NO_AUTO_INCREMENT is unset. is auto-increment supported? added the flag.
+ + } ... + uint max_supported_key_length() const { return 16*1024; /* just to return something*/ } + + int index_init(uint idx, bool sorted); + + int index_read_map(uchar * buf, const uchar * key, + key_part_map keypart_map, + enum ha_rkey_function find_flag); + + /** @brief + Called in test_quick_select to determine if indexes should be used. + */ + virtual double scan_time() { return (double) (stats.records+stats.deleted) / 20.0+10; }
does that make any sense? (I understand that it's copy-paste) stats.deleted, for example, is never set.
I've added "stats.deleted= 0" to ha_cassandra::info(). I'd prefer to keep this default impementation for the function.
+ + /** @brief + This method will never be called if you do not implement indexes. + */ + virtual double read_time(uint, uint, ha_rows rows) + { return (double) rows / 20.0+1; }
same question
...
=== added file 'storage/cassandra/ha_cassandra.cc' --- storage/cassandra/ha_cassandra.cc 1970-01-01 00:00:00 +0000 +++ storage/cassandra/ha_cassandra.cc 2012-11-12 15:11:23 +0000 ... + +static int cassandra_init_func(void *p) +{ + DBUG_ENTER("cassandra_init_func"); + +#ifdef HAVE_PSI_INTERFACE + init_cassandra_psi_keys(); +#endif + + cassandra_hton= (handlerton *)p; + mysql_mutex_init(ex_key_mutex_example, &cassandra_mutex, MY_MUTEX_INIT_FAST); + (void) my_hash_init(&cassandra_open_tables,system_charset_info,32,0,0, + (my_hash_get_key) cassandra_get_key,0,0); + + cassandra_hton->state= SHOW_OPTION_YES; + cassandra_hton->create= cassandra_create_handler; + /* + Don't specify HTON_CAN_RECREATE in flags. re-create is used by TRUNCATE + TABLE to create an *empty* table from scratch. Cassandra table won't be + emptied if re-created. + */
HTON_ALTER_NOT_SUPPORTED is not set. ALTER works?
Yes. Cassandra table is a view into the column family. We support instant ALTER TABLE, as long as the new DDL makes sense.
HTON_TEMPORARY_NOT_SUPPORTED is not set. CREATE TEMPORARY TABLE works? HTON_NO_PARTITION is not set. Partitioning works?
+ cassandra_hton->flags= 0; + cassandra_hton->table_options= cassandra_table_option_list; + cassandra_hton->field_options= cassandra_field_option_list; + + mysql_mutex_init(0 /* no instrumentation */,
why no instrumentation ?
+ &cassandra_default_host_lock, MY_MUTEX_INIT_FAST); + + DBUG_RETURN(0); +} + + ...
+ +ColumnDataConverter *map_field_to_validator(Field *field, const char *validator_name) +{ + ColumnDataConverter *res= NULL; + + switch(field->type()) { + case MYSQL_TYPE_TINY: + if (!strcmp(validator_name, validator_boolean))
why do you strcmp here, while you're only check selected characters in get_cassandra_type() ? you could've called get_cassandra_type() for validator_name instead and compare enum values, instead of strings.
This function (written by me) was here before the get_cassandra_type() was written by Sanja. I don't particularly like the idea of checking characters: what if the next version of Cassandra gets a new datatype? The code in this function will refuse to work with it, and produce a meaningful error message. The code in Sanja's function will silently map it to some datatype, and the conversion will not make sense. map_field_to_validator() is not exactly a hotspot function. Establishing connection to cassandra currently requires: - making a TCP connection - Requesting Thrift-packed DDL for the column family from Cassandra - Receiving the mentioned DDL via Thrift (which populates the data structures) - Checking the needed columns Apparently, map_field_to_validator() is not the bottleneck. If connection setup is an issue, we need a connection pool. Why optimize map_field_to_validator to the point of unreadability?
+ { + res= new TinyintDataConverter; + break; + } + /* fall through: */ + case MYSQL_TYPE_SHORT: + case MYSQL_TYPE_LONGLONG: + { + bool is_counter= false; + if (!strcmp(validator_name, validator_bigint) || + !strcmp(validator_name, validator_timestamp) || + (is_counter= !strcmp(validator_name, validator_counter))) + res= new BigintDataConverter(!is_counter); + break; + } + case MYSQL_TYPE_FLOAT: + if (!strcmp(validator_name, validator_float)) + res= new FloatDataConverter; + break; + + case MYSQL_TYPE_DOUBLE: + if (!strcmp(validator_name, validator_double)) + res= new DoubleDataConverter; + break; + + case MYSQL_TYPE_TIMESTAMP: + if (!strcmp(validator_name, validator_timestamp)) + res= new TimestampDataConverter; + break; + + case MYSQL_TYPE_STRING: // these are space padded CHAR(n) strings. + if (!strcmp(validator_name, validator_uuid) && + field->real_type() == MYSQL_TYPE_STRING && + field->field_length == 36) + { + // UUID maps to CHAR(36), its text representation + res= new UuidDataConverter; + break; + } + /* fall through: */ + case MYSQL_TYPE_VAR_STRING: + case MYSQL_TYPE_VARCHAR: + { + /* + Cassandra's "varint" type is a binary-encoded arbitary-length + big-endian number. + - It can be mapped to VARBINARY(N), with sufficiently big N. + - If the value does not fit into N bytes, it is an error. We should not + truncate it, because that is just as good as returning garbage. + - varint should not be mapped to BINARY(N), because BINARY(N) values + are zero-padded, which will work as multiplying the value by + 2^k for some value of k. + */ + if (field->type() == MYSQL_TYPE_VARCHAR && + field->binary() && + (!strcmp(validator_name, validator_varint) || + !strcmp(validator_name, validator_decimal))) + { + res= new StringCopyConverter(field->field_length); + break; + } + + if (!strcmp(validator_name, validator_blob) || + !strcmp(validator_name, validator_ascii) || + !strcmp(validator_name, validator_text)) + { + res= new StringCopyConverter((size_t)-1); + } + break; + } + case MYSQL_TYPE_LONG: + if (!strcmp(validator_name, validator_int)) + res= new Int32DataConverter; + break; + + default:; + } + return res; +} + + ...
+ +int ha_cassandra::index_init(uint idx, bool sorted) +{ + int ires; + if (!se && (ires= connect_and_check_options(table))) + return ires; + return 0; +} + +void store_key_image_to_rec(Field *field, uchar *ptr, uint len); + +int ha_cassandra::index_read_map(uchar *buf, const uchar *key, + key_part_map keypart_map, + enum ha_rkey_function find_flag) +{ + int rc= 0; + DBUG_ENTER("ha_cassandra::index_read_map"); + + if (find_flag != HA_READ_KEY_EXACT) + DBUG_RETURN(HA_ERR_WRONG_COMMAND);
did you verify that the server expects HA_ERR_WRONG_COMMAND from this method and correctly handles it? or, perhaps, find_flag is always HA_READ_KEY_EXACT here, because of your index_flags() ?
The latter is true. I don't expect that this function is ever invoked with find_flag!=HA_READ_KEY_EXACT.
+ + uint key_len= calculate_key_len(table, active_index, key, keypart_map); + store_key_image_to_rec(table->field[0], (uchar*)key, key_len); + + char *cass_key; + int cass_key_len; + my_bitmap_map *old_map; + + old_map= dbug_tmp_use_all_columns(table, table->read_set); + + if (rowkey_converter->mariadb_to_cassandra(&cass_key, &cass_key_len)) + { + /* We get here when making lookups like uuid_column='not-an-uuid' */ + dbug_tmp_restore_column_map(table->read_set, old_map); + DBUG_RETURN(HA_ERR_KEY_NOT_FOUND); + } + + dbug_tmp_restore_column_map(table->read_set, old_map); + + bool found; + if (se->get_slice(cass_key, cass_key_len, &found)) + { + my_error(ER_INTERNAL_ERROR, MYF(0), se->error_str());
generally it's better not to use my_error() directly, but only return an error code, and return your se->error_str() from ha_cassandra::get_error_message()
+ rc= HA_ERR_INTERNAL_ERROR; + } + + /* TODO: what if we're not reading all columns?? */ + if (!found) + rc= HA_ERR_KEY_NOT_FOUND; + else + rc= read_cassandra_columns(false); + + DBUG_RETURN(rc); +} + ...
+int ha_cassandra::rnd_pos(uchar *buf, uchar *pos) +{ + int rc; + DBUG_ENTER("ha_cassandra::rnd_pos"); + + int save_active_index= active_index; + active_index= 0; /* The primary key */ + rc= index_read_map(buf, pos, key_part_map(1), HA_READ_KEY_EXACT); + + active_index= save_active_index;
a bit nicer would be to implement index_read_map_idx and call it from here and from index_read_map
+ + DBUG_RETURN(rc); +} + ... +bool ha_cassandra::mrr_start_read() +{ + uint key_len; + + my_bitmap_map *old_map; + old_map= dbug_tmp_use_all_columns(table, table->read_set); + + se->new_lookup_keys(); + + while (!(source_exhausted= mrr_funcs.next(mrr_iter, &mrr_cur_range))) + { + char *cass_key; + int cass_key_len; + + DBUG_ASSERT(mrr_cur_range.range_flag & EQ_RANGE); + + uchar *key= (uchar*)mrr_cur_range.start_key.key; + key_len= mrr_cur_range.start_key.length; + //key_len= calculate_key_len(table, active_index, key, keypart_map); // NEED THIS?? + store_key_image_to_rec(table->field[0], (uchar*)key, key_len); + + rowkey_converter->mariadb_to_cassandra(&cass_key, &cass_key_len); + + // Primitive buffer control + if (se->add_lookup_key(cass_key, cass_key_len) > + THDVAR(table->in_use, multiget_batch_size)) + break;
I'd suggest to add a status variable to count how many times the buffer was refilled
Agree. I'd prefer to do this in the next version.
+ } + + dbug_tmp_restore_column_map(table->read_set, old_map); + + return se->multiget_slice(); +} ... + + +/* The following function was copied from ha_blackhole::store_lock: */ +THR_LOCK_DATA **ha_cassandra::store_lock(THD *thd, + THR_LOCK_DATA **to, + enum thr_lock_type lock_type) +{ + DBUG_ENTER("ha_cassandra::store_lock"); + if (lock_type != TL_IGNORE && lock.type == TL_UNLOCK) + { + /* + Here is where we get into the guts of a row level lock. + If TL_UNLOCK is set + If we are not doing a LOCK TABLE or DISCARD/IMPORT + TABLESPACE, then allow multiple writers + */ + + if ((lock_type >= TL_WRITE_CONCURRENT_INSERT && + lock_type <= TL_WRITE) && !thd_in_lock_tables(thd) + && !thd_tablespace_op(thd))
1. tablespace op in cassandra? really? too much copy-paste is confusing. (here and in the comments too) 2. did you test LOCK TABLES?
What is a meaningful usage scenario of cassandra SE with LOCK TABLES? It is a distributed engine. Do locks matter at all?
+ lock_type = TL_WRITE_ALLOW_WRITE;
that makes all changes to cassanrda immediately visible by concurrently running threads. okay, if that's intentional, but it needs to be documented, as it not the SQL semantics.
It is intentional, Cassandra storage semantics is different from SQL.
+ + /* + In queries of type INSERT INTO t1 SELECT ... FROM t2 ... + MySQL would use the lock TL_READ_NO_INSERT on t2, and that + would conflict with TL_WRITE_ALLOW_WRITE, blocking all inserts + to t2. Convert the lock to a normal read lock to allow + concurrent inserts to t2. + */ + + if (lock_type == TL_READ_NO_INSERT && !thd_in_lock_tables(thd)) + lock_type = TL_READ;
this also breaks SBR for INSERT ... SELECT. same as above - okay if intentional, but should be thouroughly documented.
I am not certain about this. Let's discuss it.
+ + lock.type= lock_type; + } + *to++= &lock; + DBUG_RETURN(to); +} + + +int ha_cassandra::external_lock(THD *thd, int lock_type) +{ + DBUG_ENTER("ha_cassandra::external_lock"); + DBUG_RETURN(0); +} + +int ha_cassandra::delete_table(const char *name) +{ + DBUG_ENTER("ha_cassandra::delete_table"); + /* + Cassandra table is just a view. Dropping it doesn't affect the underlying + column family. + */ + DBUG_RETURN(0); +}
more dummies...
handler::delete_table needs this implementation. I've removed all other dummies, as well as "dummy implementations start/end" comments. ...
+static int show_cassandra_vars(THD *thd, SHOW_VAR *var, char *buff) +{ + cassandra_counters_copy= cassandra_counters; + + var->type= SHOW_ARRAY; + var->value= (char *) &cassandra_status_variables;
what's the point? If you don't do anything in this function, you can just as easily list all status variables below in the func_status[] array.
+ return 0; +} + + +struct st_mysql_storage_engine cassandra_storage_engine= +{ MYSQL_HANDLERTON_INTERFACE_VERSION }; + +static struct st_mysql_show_var func_status[]= +{ + {"Cassandra", (char *)show_cassandra_vars, SHOW_FUNC}, + {0,0,SHOW_UNDEF} +}; + +maria_declare_plugin(cassandra) +{ + MYSQL_STORAGE_ENGINE_PLUGIN, + &cassandra_storage_engine, + "CASSANDRA", + "Monty Program Ab", + "Cassandra storage engine", + PLUGIN_LICENSE_GPL, + cassandra_init_func, /* Plugin Init */ + cassandra_done_func, /* Plugin Deinit */ + 0x0001, /* version number (0.1) */ + func_status, /* status variables */ + cassandra_system_variables, /* system variables */ + "0.1", /* string version */ + MariaDB_PLUGIN_MATURITY_EXPERIMENTAL /* maturity */ +} +maria_declare_plugin_end;
BR Sergei -- Sergei Petrunia, Software Developer Monty Program AB, http://askmonty.org Blog: http://s.petrunia.net/blog