Hi Monty, thanks for review. Btw, I did not tell you what the german2 patch actually includes: 1. WL#4013 Unicode german2 collation itself. 2. WL#5624 Collation customization improvements (prerequisite for german2) http://dev.mysql.com/worklog/task/?id=5624 3. Joro's patch for "Bug#62429 XML: ExtractValue, UpdateXML max arg length 127 chars." (prerequisite for WL#5624) 4. Partially, mtr tests for WL#5624. The other part of the WL#5624 tests depends on the WEIGHT_STRING() function which we have not merged yet. Comparing to the first version of the patch, the second version additionally includes: - Monty's review suggestions addressed. - Added *partially* tests for WL#5624 (those not needing WEIGHT_STRING) - Added "WL#5624 Part 13. More verbosity when reading character set definition file" which I somehow missed in the first version. Please also see comments inline: On 09/12/2013 04:32 PM, Michael Widenius wrote:
Review of german2.diff
=== modified file 'include/my_global.h' --- include/my_global.h 2013-07-21 14:39:19 +0000 +++ include/my_global.h 2013-08-29 08:27:09 +0000 @@ -1218,4 +1218,11 @@ static inline double rint(double x) #define HAVE_EXTERNAL_CLIENT #endif /* EMBEDDED_LIBRARY */
+ +enum loglevel { + ERROR_LEVEL= 0, + WARNING_LEVEL= 1, + INFORMATION_LEVEL= 2 +}; + #endif /* my_global_h */
Not sure why the above, which is a my_sys construct, should be in my_global.h. Please add at least a comment about this.
my_sys.h includes m_ctype.h. The definition of loglevel was in my_sys.h, but now it's needed in m_ctype.h as well. So I just moved the definition of loglevel to my_global.h, which is seen in both m_ctype.h and my_sys.h. That solved the problem, but as you say, introduced extra visibility of "loglevel", which is not good. In the second version of the patch (attached) I just moved the definition of loglevel from my_sys.h to m_ctype.h instead. It solves all definition dependencies and does not introduce extra visibility.
=== modified file 'mysys/charset.c' --- mysys/charset.c 2012-08-14 14:23:34 +0000 +++ mysys/charset.c 2013-08-29 12:38:44 +0000 @@ -66,6 +66,9 @@ static my_bool init_state_maps(struct ch if (!(cs->ident_map= ident_map= (uchar*) my_once_alloc(256, MYF(MY_WME)))) return 1;
+ state_map= (uchar *) cs->state_map; + ident_map= (uchar *) cs->ident_map;
Why is the above needed? They are already set one line above. (In addition cs->state_map and cs->ident_map are already uchar *)
Thanks for noticing this. Removed.
--- strings/ctype-simple.c 2013-07-21 14:39:19 +0000 +++ strings/ctype-simple.c 2013-08-29 09:05:07 +0000 @@ -1163,12 +1163,12 @@ static int pcmp(const void * f, const vo return res; }
-static my_bool create_fromuni(struct charset_info_st *cs, - void *(*alloc)(size_t)) +static my_bool +create_fromuni(struct charset_info_st *cs, + MY_CHARSET_LOADER *loader) { uni_idx idx[PLANE_NUM]; int i,n; - struct my_uni_idx_st *tab_from_uni;
/* Check that Unicode map is loaded. @@ -1209,18 +1209,18 @@ static my_bool create_fromuni(struct cha for (i=0; i < PLANE_NUM; i++) { int ch,numchars; - uchar *tab;
/* Skip empty plane */ if (!idx[i].nchars) break;
numchars=idx[i].uidx.to-idx[i].uidx.from+1; - if (!(idx[i].uidx.tab= tab= (uchar*) - alloc(numchars * sizeof(*idx[i].uidx.tab)))) + if (!(idx[i].uidx.tab= (uchar *) + (loader->once_alloc) (numchars * + sizeof(*idx[i].uidx.tab)))) return TRUE;
- bzero(tab,numchars*sizeof(*tab)); + bzero((char*) idx[i].uidx.tab, numchars * sizeof(*idx[i].uidx.tab));
for (ch=1; ch < PLANE_SIZE; ch++) { @@ -1228,32 +1228,32 @@ static my_bool create_fromuni(struct cha if (wc >= idx[i].uidx.from && wc <= idx[i].uidx.to && wc) { int ofs= wc - idx[i].uidx.from; - tab[ofs]= ch; + ((char *) idx[i].uidx.tab)[ofs]= ch; } } }
Why remove tab to be a shortcut for idx[i].uidx.from ? Shouldn't it be faster and shorter to use tab than using idx[i].uidx.from? (At least in this place)
I guess you mean idx[i].uidx.tab. Okey, I restored the variable. But generally I don't like pointer aliases (as it's bug prone). Moreover, having an alias in this code does not improve performance, because this code is called only once per collation during mysqld life time. The best choice would probably be to move the loop inside a function and pass idx[i].uidx.tab into the function.
-static const uint16 page0FCdata[]= { /* FC00 (3 weights per char) */ +static uint16 page0FCdata[]= { /* FC00 (3 weights per char) */ 0x134F,0x135E,0x0000, 0x134F,0x1364,0x0000, 0x134F,0x13B0,0x0000, 0x134F,0x13C7,0x0000, 0x134F,0x13C8,0x0000, 0x1352,0x135E,0x0000, 0x1352,0x1364,0x0000, 0x1352,0x1365,0x0000, 0x1352,0x13B0,0x0000, @@ -6006,7 +6005,7 @@ static const uint16 page0FCdata[]= { /* 0x1381,0x13C8,0x0000, 0x1382,0x13C7,0x0000, 0x1382,0x13C8,0x0000, 0x1364,0x13C7,0x0000 };
If we are not going to change the above, better to keep these as const!
They will then be in the const segment and you will have protection from anyone trying to change these!
Changed back to const. I had to do two casts though, in this variable definition, to resolve const/non-const conflict: MY_UCA_INFO my_uca_v400= { { { 0xFFFF, /* maxchar */ (uchar *) uca_length, (uint16 **) uca_weight, { /* Contractions: */ 0, /* nitems */ NULL, /* item */ NULL /* flags */ } }, }, which should be fine.
<cut>
+my_uca_add_contraction(MY_CONTRACTIONS *list, my_wc_t *wc, size_t len, + my_bool with_context) { - MY_CONTRACTIONS *list= (MY_CONTRACTIONS*) cs->contractions; MY_CONTRACTION *next= &list->item[list->nitems]; - DBUG_ASSERT(len == 2); /* We currently support only contraction2 */ - next->ch[0]= wc[0]; - next->ch[1]= wc[1]; + size_t i; + /* + Contraction is always at least 2 characters. + Contraction is never longer than MY_UCA_MAX_CONTRACTION, + which is guaranteed by using my_coll_rule_expand() with proper limit. + */ + DBUG_ASSERT(len > 1 && len <= MY_UCA_MAX_CONTRACTION); + for (i= 0; i < len; i++) + { + /* + We don't support contractions with U+0000. + my_coll_rule_expand() guarantees there're no U+0000 in a contraction. + */ + DBUG_ASSERT(wc[i] != 0); + next->ch[i]= wc[i]; + } + if (i < MY_UCA_MAX_CONTRACTION) + next->ch[i]= 0; /* Add end-of-line marker */
Wouldn't it make life easier to always have the marker there? (Ie, always have place for the marker). At least you can remove one if from the code. If there is more than one if to remove, then it's a simple optimzation do to...
The Unicode Collation Algorithm tables are quite huge. See ctype-uca.c. Adding a space for 0-terminator for every character weight would make the tables even huger. Not sure if we should do this. Originally, in mysql-4.1 times, I intentionally made this to reduce table sizes.
<cut>
+static int +my_coll_parser_too_long_error(MY_COLL_RULE_PARSER *p, const char *name) +{ + my_snprintf(p->errstr, sizeof(p->errstr), "%s is too long", name); + return 0; +}
You should limit '%s' to 64 characters so that one gets the name in the output even if it's over sizeof()...
%-.64s ....
The list of possible values that are passed to this functions is: "Contraction" "Expansion" "Context" "logical position" No needs for %-.64s.
<cut>
Regards, Monty