----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Add support for google protocol buffers CREATION DATE..: Tue, 21 Jul 2009, 21:11 SUPERVISOR.....: Monty IMPLEMENTOR....: COPIES TO......: CATEGORY.......: Server-Sprint TASK ID........: 34 (http://askmonty.org/worklog/?tid=34) VERSION........: WorkLog-3.4 STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: -=-=(Monty - Thu, 11 Feb 2010, 19:59)=-=- High-Level Specification modified. --- /tmp/wklog.34.old.17962 2010-02-11 19:59:45.000000000 +0200 +++ /tmp/wklog.34.new.17962 2010-02-11 19:59:45.000000000 +0200 @@ -1,13 +1,8 @@ - -<contents> 1. GPB Encoding overview 2. GPB in an SQL database -2.1 Informing server about GPB field names and types -2.2 Addressing GPB fields -2.2.1 Option1: SQL Function -2.2.2 Option2: SQL columns -</contents> - +3. Encoding to use for dynamic columns +4. How to store and access data in a protocol buffer from SQL +5. Extensions for the future 1. GPB Encoding overview ======================== @@ -37,42 +32,50 @@ traffic right away, and will open path to getting the best possible performance. -2.1 Informing server about GPB field names and types ----------------------------------------------------- -User-friendly/meaningful access to GPB fields requires knowledge of GPB field -names and types, which are not available from GPB message itself (see "GPB -encoding overview" section). - -So the first issue to be addressed is to get the server to know the definition -of stored messages. We intend to assume that all records have GPB messages -that conform to a certain single definition, which gives one definition per -GPB field. +3. Encoding to use for dynamic columns +====================================== -DecisionToMake: How to pass the server the GPB definition? -First idea: add a CREATE TABLE parameter which will specify either the -definition itself or path to .proto file with the definition. +The data should be coded into the proto buffer in the following format: + +<field_number><value_type><value>[<field_number><value_type><value>...] + +Where field_number is a number between 0-65536 that identifes the field +<value_type> is a enum of type 'Item_result' +<value> is the value coded in proto format. + +In other words, we should have no nested or complex structure. + +4. How to store and access data in a protocol buffer from SQL +============================================================ + +User-friendly/meaningful access to GPB fields requires knowledge of +GPB field names and types, which are not available from GPB message +itself (see "GPB encoding overview" section). + +To make things easy for the user, we will at first stage provide SQL +functions to manipulate a string that is actually in proto format. -2.2 Addressing GPB fields -------------------------- -We'll need to provide a way to access GPB fields. This can be complicated as -structures that are encoded in GPB message can be nested and recursive. - -2.2.1 Option1: SQL Function -~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Introduce an SQL function GPB_FIELD(path) which will return contents of the -field. -- Return type of the function will be determined from GPB message definition. -- For path, we can use XPath selector (a subset of XPath) syntax. - -(TODO ^ the above needs to be specified in more detail. is the selector as -simple as filesystem path or we allow quantifiers (with predicates?)?) - -2.2.2 Option2: SQL columns -~~~~~~~~~~~~~~~~~~~~~~~~~~ -Make GPB columns to be accessible as SQL columns. -This approach has problems: -- It might be hard to implement code-wise - - (TODO will Virtual columns patch help??) -- It is not clear how to access fields from nested structures. Should we allow - quoted names like `foo/bar[2]/baz' ? +The functions we should provde are: +proto_get(gpb, field_number, type) + +This return the field tagged with 'field_number' from the 'gpb' buffer. + +Example: proto_get(blob, 1, varchar) -> Returns field number 1 as varchar + +proto_put(gpb, field_number, value) + +This returns a new gbp buffer with the new value appended. + +Example: proto_put(proto_put(blob, 1, 1), 2, "hello") + +5. Extension for future +======================= + +In the future we may want to access data based on name and get MariaDB to +automaticly know the correct type. To do this we need to be able to +store a definition for the content of the proto buffer somewhere. + +DecisionToMake: How to pass the server the GPB definition? +First idea: add a CREATE TABLE parameter which will specify the +definition itself. -=-=(Monty - Thu, 11 Feb 2010, 19:59)=-=- High Level Description modified. --- /tmp/wklog.34.old.17915 2010-02-11 17:59:17.000000000 +0000 +++ /tmp/wklog.34.new.17915 2010-02-11 17:59:17.000000000 +0000 @@ -1,5 +1,21 @@ -Add support for Google Protocol Buffers (further GPB). It should be possible -to have columns that store GPB-encoded data, as well as use SQL constructs to +Add support for dynamic columns: + +- A column that can hold information from many columns +- One can instantly add or remove column data + +This is a useful feature for any store type of application, where you want to +store different type of information for different kind of items. + +For example, for shoes you want to store: material, size, colour, maker +For a computer you want to store ram, hard disk size etc... + +In a normal 'relational' system you would need to a table for each type. +With dynamic columns you have all common items as fixed fields (like +product_code, manufacturer, price) and the rest stored in a dynamic column. + +The proposed idea is to store the dynamic information in a blob in +Google Protocol Buffers (further GPB) format and use SQL constructs to extract parts of GPB data for use in select list, for filtering, and so forth. + Any support for indexing GPB data is outside of scope of this WL entry. -=-=(Knielsen - Fri, 22 Jan 2010, 11:38)=-=- Low Level Design modified. --- /tmp/wklog.34.old.29965 2010-01-22 11:38:57.000000000 +0200 +++ /tmp/wklog.34.new.29965 2010-01-22 11:38:57.000000000 +0200 @@ -2,3 +2,12 @@ and a parser for text form of .proto file which then exposes the parsed file via standard GPB message navigation API. +* We should have both server-side support and client-side support (client side + means functions in libmysqlclient so that user can select the full BLOB and + extract fields in the application). + +* Add some kind of header to the GPB blob to support versioning and future + extensibility. + +* Add complete syntax description (update, add, drop, exists, ...). + -=-=(Psergey - Tue, 21 Jul 2009, 21:13)=-=- Low Level Design modified. --- /tmp/wklog.34.old.6462 2009-07-21 21:13:13.000000000 +0300 +++ /tmp/wklog.34.new.6462 2009-07-21 21:13:13.000000000 +0300 @@ -1 +1,4 @@ +* GPB tarball contains a protocol definition for .proto file structure itself + and a parser for text form of .proto file which then exposes the parsed + file via standard GPB message navigation API. -=-=(Psergey - Tue, 21 Jul 2009, 21:12)=-=- High-Level Specification modified. --- /tmp/wklog.34.old.6399 2009-07-21 21:12:23.000000000 +0300 +++ /tmp/wklog.34.new.6399 2009-07-21 21:12:23.000000000 +0300 @@ -1 +1,78 @@ +<contents> +1. GPB Encoding overview +2. GPB in an SQL database +2.1 Informing server about GPB field names and types +2.2 Addressing GPB fields +2.2.1 Option1: SQL Function +2.2.2 Option2: SQL columns +</contents> + + +1. GPB Encoding overview +======================== + +GBB is a compact encoding for structured and typed data. A unit of GPB data +(it is called message) is only partially self-describing: it's possible to +iterate over its parts, but, quoting the spec + +http://code.google.com/apis/protocolbuffers/docs/encoding.html: + " the name and declared type for each field can only be determined on the + decoding end by referencing the message type's definition (i.e. the .proto + file). " + +2. GPB in an SQL database +========================= + +It is possible to store GPB data in MariaDB today - one can declare a binary +blob column and use it to store GPB messages. Storing and retrieving entire +messages will be the only available operations, though, as the server has no +idea about the GPB format. +It is apparent that ability to peek inside GPB data from SQL layer would be of +great advantage: one would be able to +- select only certain fields or parts of GPB messages +- filter records based on the values of GPB fields +- etc +performing such operations at SQL layer will allow to reduce client<->server +traffic right away, and will open path to getting the best possible +performance. + +2.1 Informing server about GPB field names and types +---------------------------------------------------- +User-friendly/meaningful access to GPB fields requires knowledge of GPB field +names and types, which are not available from GPB message itself (see "GPB +encoding overview" section). + +So the first issue to be addressed is to get the server to know the definition +of stored messages. We intend to assume that all records have GPB messages +that conform to a certain single definition, which gives one definition per +GPB field. + +DecisionToMake: How to pass the server the GPB definition? +First idea: add a CREATE TABLE parameter which will specify either the +definition itself or path to .proto file with the definition. + +2.2 Addressing GPB fields +------------------------- +We'll need to provide a way to access GPB fields. This can be complicated as +structures that are encoded in GPB message can be nested and recursive. + +2.2.1 Option1: SQL Function +~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Introduce an SQL function GPB_FIELD(path) which will return contents of the +field. +- Return type of the function will be determined from GPB message definition. +- For path, we can use XPath selector (a subset of XPath) syntax. + +(TODO ^ the above needs to be specified in more detail. is the selector as +simple as filesystem path or we allow quantifiers (with predicates?)?) + +2.2.2 Option2: SQL columns +~~~~~~~~~~~~~~~~~~~~~~~~~~ +Make GPB columns to be accessible as SQL columns. +This approach has problems: +- It might be hard to implement code-wise + - (TODO will Virtual columns patch help??) +- It is not clear how to access fields from nested structures. Should we allow + quoted names like `foo/bar[2]/baz' ? + DESCRIPTION: Add support for dynamic columns: - A column that can hold information from many columns - One can instantly add or remove column data This is a useful feature for any store type of application, where you want to store different type of information for different kind of items. For example, for shoes you want to store: material, size, colour, maker For a computer you want to store ram, hard disk size etc... In a normal 'relational' system you would need to a table for each type. With dynamic columns you have all common items as fixed fields (like product_code, manufacturer, price) and the rest stored in a dynamic column. The proposed idea is to store the dynamic information in a blob in Google Protocol Buffers (further GPB) format and use SQL constructs to extract parts of GPB data for use in select list, for filtering, and so forth. Any support for indexing GPB data is outside of scope of this WL entry. HIGH-LEVEL SPECIFICATION: 1. GPB Encoding overview 2. GPB in an SQL database 3. Encoding to use for dynamic columns 4. How to store and access data in a protocol buffer from SQL 5. Extensions for the future 1. GPB Encoding overview ======================== GBB is a compact encoding for structured and typed data. A unit of GPB data (it is called message) is only partially self-describing: it's possible to iterate over its parts, but, quoting the spec http://code.google.com/apis/protocolbuffers/docs/encoding.html: " the name and declared type for each field can only be determined on the decoding end by referencing the message type's definition (i.e. the .proto file). " 2. GPB in an SQL database ========================= It is possible to store GPB data in MariaDB today - one can declare a binary blob column and use it to store GPB messages. Storing and retrieving entire messages will be the only available operations, though, as the server has no idea about the GPB format. It is apparent that ability to peek inside GPB data from SQL layer would be of great advantage: one would be able to - select only certain fields or parts of GPB messages - filter records based on the values of GPB fields - etc performing such operations at SQL layer will allow to reduce client<->server traffic right away, and will open path to getting the best possible performance. 3. Encoding to use for dynamic columns ====================================== The data should be coded into the proto buffer in the following format: <field_number><value_type><value>[<field_number><value_type><value>...] Where field_number is a number between 0-65536 that identifes the field <value_type> is a enum of type 'Item_result' <value> is the value coded in proto format. In other words, we should have no nested or complex structure. 4. How to store and access data in a protocol buffer from SQL ============================================================ User-friendly/meaningful access to GPB fields requires knowledge of GPB field names and types, which are not available from GPB message itself (see "GPB encoding overview" section). To make things easy for the user, we will at first stage provide SQL functions to manipulate a string that is actually in proto format. The functions we should provde are: proto_get(gpb, field_number, type) This return the field tagged with 'field_number' from the 'gpb' buffer. Example: proto_get(blob, 1, varchar) -> Returns field number 1 as varchar proto_put(gpb, field_number, value) This returns a new gbp buffer with the new value appended. Example: proto_put(proto_put(blob, 1, 1), 2, "hello") 5. Extension for future ======================= In the future we may want to access data based on name and get MariaDB to automaticly know the correct type. To do this we need to be able to store a definition for the content of the proto buffer somewhere. DecisionToMake: How to pass the server the GPB definition? First idea: add a CREATE TABLE parameter which will specify the definition itself. LOW-LEVEL DESIGN: * GPB tarball contains a protocol definition for .proto file structure itself and a parser for text form of .proto file which then exposes the parsed file via standard GPB message navigation API. * We should have both server-side support and client-side support (client side means functions in libmysqlclient so that user can select the full BLOB and extract fields in the application). * Add some kind of header to the GPB blob to support versioning and future extensibility. * Add complete syntax description (update, add, drop, exists, ...). ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v3.5.9)