Re: [boost] Interest in an ODBC library?

23 Aug 2005

      Having designed the database template library (a C++ STL type interface for 
ODBC at http://dtemplatelib.sourceforge.net/) I thought I'd throw in a few 
cents about some design you'll want to think through *very* carefully.
Here are some key questions to ask:

Iterator design:

ODBC supports multiple cursors types which I think are pretty typical of 
various kinds of databases.
You want to think carefully about how you map these to STL iterator types. 
Lets say you do a simple mapping of
forward cursor -> input iterator and
scrollable cursor -> random access iterator.

Immediately there are "gothcas"
input iterator:
    - What happens when you copy your iterator?  Should it point to the same 
cursor?  If you do then making a copy of the iterator and incrementing will 
have a "side effect" on the original iterator.  If you do not then users may 
be surprised when they compare two copies of the same iterator and don't get 
the same thing.  How can two cursors be equal to begin() of a container but 
not have the same value?

random access iterator:
   - I assume your iterators will all point to the same recordset resource - 
but if so, how do you avoid "thrashing".  If a and b are two iterators that 
point to the same scrollable recordset at different then you don't really 
want to be excuting multiple database positioning operations to go back and 
forth between the values.  This implies some kind of intelligent caching but 
you want to be carefull or you'll end up with more than you can handle in 
memory since it is not uncommon for tables to have a million rows or more 
(O.K., maybe the Human Genome center is a bit larger than most in this 
regard :-) ).
   - You'll probably also want to implemnent size() operations 
intelligently, paging through every record and counting is not so good...
   - Be aware of the performance cost of scrollable recordsets (see below 
for more).

We also had a input + output iterator that we did to allow users to loop 
through a recordset and modify values.

Binding design:
- In DTL we wanted to go away from the MFC model of forcing you to inject 
our code into your classes.  Instead we defined operators that specify the 
mapping relationship between columns in your query and members in your 
class. I think this is a good model since it keeps the database mappings 
from getting tangled up with the class logic.  This kind of design is also 
used very successfully in the Java world by popular projects like Hibernate 
which let you define your class as you need and simply ask you to provide 
XML type mappings.  I'm also not a big fan of frameworks that force you into 
a particular class design.
Here is a link from DTL showing this:
http://dtemplatelib.sourceforge.net/dtl_introduction.htm

Here is an example of a hibernate mapping:
http://www.hibernate.org/hib_docs/v3/reference/en/html/quickstart.html#quick...

Transparent SQL:
- One of the design goals in DTL was to try not to be "smart" about SQL. 
The fact is that folks do all kinds of strange things in SQL like different 
outer join syntax, hierarchical queries, analytic functions over join 
partitions, stored proceedures etc.  I think there is a real advantage to 
having an iterator that can "get out of the way" and NOT try to parse what 
the user is doing so they can execute these kinds of complex queries.  Also, 
SQL is well understood by most developers so letting them use it 
transparently rather than creating your own join language will significantly 
reduce the learning curve.
- Having said this, being able to construct select, insert, update 
statements for the user has advantages - it lets you create read/write 
containers from a set of bindings transparently and can let you work around 
limitations of the underlying drivers.

Binding strings:
- In ODBC, at least, this is rather ugly.  You can choose to either bind a 
fixed size buffer (implying a maximum string size) or you can use something 
called GetData and PutData to transfer an arbitrary length string in blocks. 
If you choose the latter (which is what we did so folks could bind 
std::string and not have to recompile every time their database field size 
changed) ODBC gets annoying in that many drivers then require that such 
string columns have to come last.

Think carefully about how you want to support stored procedures:
1.  A stored procedure can have input parameters, output parameters and can 
return multiple sets of results (and will in SQL server).

How to support user defined types?
1.  Users will want to define custom types that can be mapped transparently 
to sets of database fields.  Examples:
    - A complex data type that maps to a pair of fields.
    - A "date" type that is stored as an integer (similar to boost::time) 
but might be written to multiple database fields.
    - Mapping of more complicated types, graphs, sounds etc. to blobs.

Transactions:
In ODBC, at least, there is not a database independent notion of a 
transaction.  Instead, when you commit, you commit everything open against 
that connection.  So if you want multiple transactions you need multiple 
connections.   We went with C++ try...catch logic to handle transactions 
since this closely mirrors some of the ACID concepts you need for 
transactions in a database.  So we would have

try {
// a set of operations here

commit;
}
catch(...) {
rollback;
}

More on our design for this is here: 
http://dtemplatelib.sourceforge.net/ExceptionHandling.htm

More later... anyway I would be interested in helping on this project but 
don't know how much time I will have to commit to it.  I definitely like the 
idea of defining the base layer in terms of simple operations so that it can 
support both ODBC and native drivers since I think this would be a big 
advantage for situations where and ODBC driver simply is not available

...

Here is some background on how the ODBC cursor types compare to the C++ 
iterator types and a little bit on how the mapping was done in DTL

ODBC cursors.

The process of retrieving rows from a query result set and returning them to 
the application is called fetching.  An application fetches data with a 
cursor. An ODBC cursor on a result set indicates the current position in the 
result set and what row of data will be returned next.  In ODBC there are 
two main cursor types.

Forward Only Cursors:

-         Can only move forward through the set of records.

-         Are not guaranteed to move through the records in any particular 
order.

-         Two different cursors against the same table using the same query 
may not return the same rows or same number of rows due to inserts, updates 
and deletes.

Scrollable Cursors:

-         Can move forward and backward through the result set.

-         Two different cursors against the same table using the same query 
may not return the same rows or same number of rows due to inserts, updates 
and deletes.

Scrollable cursors raise two additional questions not seen in forward-only 
cursors. First, how does the cursor establish some kind of order for the set 
of records returned? Without some kind of order for the records moving 
"forward" or "backward" through the result set does not have any meaning. 
Second, should the cursor detect changes to rows that have been previously 
fetched?  In other words, when we re-position the cursor should it detect 
changes due to inserted, updated or deleted rows? Depending on how these 
questions are answered one obtains different scrollable cursor types.

Scrollable Cursors:

-         Static cursor type.

o       Once the cursor has been created, the set of records is fixed 
regardless of any subsequent changes made to the database.  Any inserts, 
updates or deletes by other applications to the set of records are not 
detected by this cursor. Usually this is implemented either by locking the 
underlying table so that no changes can be made by other applications while 
the cursor is in effect, or the set of records is viewed as of a particular 
version.

-         Dynamic cursor type.

o       Detects any changes made to the set of records due to insert, update 
or delete by other applications.  Usually this kind of cursor is implemented 
by simply ordering the records with a unique index. This cursor then 
re-executes a fetch every time it scrolls to a new set of rows. Suppose the 
original set of records is defined by "SELECT * FROM EMPLOYEES ORDER BY 
EMPLOYEE_ID".  Then, when the user requests that this cursor scroll to the 
next set of rows the cursor would execute the query "SELECT * FROM EMPLOYEES 
WHERE EMPLOYEE_ID > (?) ORDER BY EMPLOYEE_ID" where it would pass as a 
parameter the EMPLOYEE_ID from the last row in the current buffer.  This 
kind of cursor actually creates multiple result sets due to the re-execution 
of the query every time a scroll operation is performed.  As a result, the 
cursor does not establish a consistent order for the records in the sense 
that the "second" record in the result set might change its value and 
position if we scroll forward and then attempt to scroll back to this 
"second" record.

-         Keyset driven cursor type.

o       When the cursor is opened, it immediately stores a set of unique 
keys for all rows in the result set.  Later, when the user attempts a scroll 
operation, the cursor simply retrieves rows corresponding to the previously 
stored keys.  This means that updates by other applications are detected 
with a new value for the row (unless the update changes the key - if another 
application changes the key this is equivalent to a delete and then an 
insert as far as a keyset cursor is concerned).  Deleted records show up as 
"holes" with a special status flag.  As for inserts made by other 
applications, these are not detected since these inserted rows did not 
appear in the original set of keys that was fetched when the cursor was 
first opened.  This cursor does establish a consistent order for the records 
in the sense that if we scroll forwards and backwards the same records will 
appear in the same position, although they may be marked as having been 
deleted and may contain updated values.

-         Mixed cursor.

o       This is a combination of a keyset driven cursor and a dynamic 
cursor.  This kind of cursor is typically used when the table is too large 
to make it practical to save keys for all the rows in the result set.  In 
this case, the cursor fetches keys for some subset of the rows.  Inside this 
subset, the cursor behaves like a keyset driven cursor.  Outside this 
subset, the cursor behaves like a dynamic cursor.  A mixed cursor is 
equivalent to a keyset-driven cursor when the size of the key subset is 
equal to the result set size.  A mixed cursor is equivalent to a dynamic 
cursor when the size of the key subset is equal to 1.

STL Iterators.

Here we summarize some key properties of certain C++ standard iterators. 
For this discussion we are really interested in only three types of 
iterators: input iterators, output iterators and random access iterators. 
Input and output iterators are considered primarily because these have the 
smallest set of requirements of all the iterator types and so can be mapped 
most easily to an ODBC cursor.  Random access iterators are interesting 
since these have the largest set of requirements and correspond most closely 
to the type of usage that is made of scrollable ODBC cursors.  Here we 
summarize only the iterator properties that are important to our discussion 
(for full details on the iterator requirements see section 24.1 of the C++ 
standard).

Noteworthy Input Iterator Requirements

  1.. Input iterators are assignable and copy constructable.
  2.. Let a, b be input iterators. If a==b and (a,b) is in the domain of == 
then *a is equivalent to *b.
  3.. Let r be an input iterator. If we increment r via ++r, then after the 
increment any copies of the previous value of r are no longer required to be 
dereferencable or in the domain of ==.
  4.. For input iterators a==b does not imply ++a==++b.  Algorithms on input 
iterators should never attempt to pass through the same iterator twice. 
They should be single pass algorithms.
  5.. Let r be an input iterator.  The result of *r must be convertible to 
the value type of the iterator.

Some comments.

-         Requirement two that the dereferenced value of a and b be 
'equivalent' is vague.  Standard is not clear on what 'equivalent' means, 
but it does not mean these two must have the same value.

-         Requirement three means that conceptually when we implement an 
input iterator we are allowed to code it such that only one copy of the 
iterator can be "active".  The reason for this is that once one copy of the 
iterator has been incremented other copies become invalid.

Noteworthy Output Iterator Requirements

  1.. Output iterators are assignable and copy constructable.
  2.. Let r be an output iterator.  The only valid use of operator * is on 
the left hand side of an assignment.  After an assignment, the iterator,  r, 
must be incremented before it can be assigned to again. Algorithms on output 
iterators should be single pass algorithms.

Some comments.

-         Typically operator * on an output iterator is defined to return a 
proxy class that performs a write of the data upon assignment.  This proxy 
might be the iterator itself or might be some other class.

How DTL Abstracts ODBC cursors as iterators.

SELECT statement, forward only cursor: Input Iterator

-         In order to allow multiple copies of the iterator to be active at 
the same time, we chose a design wherein each iterator holds its own ODBC 
cursor.  This allows us to have logic like a=b; ++a; ++b;  and have both 
iterators a and b  be able to return a value.  In fact, we did not have to 
do this since the standard allows only one copy of the iterator to be 
active. So, under the standard it would be allowable for b to become invalid 
once a had been incremented.

-         Note that we are taking full advantage of the standard in 
performing this mapping.  Records are not guaranteed to come back in any 
particular order, nor are two copies of an iterator guaranteed to retrieve 
the same set of records due the fact that inserts, updates and deletes may 
be performed on the set of records the select statement is operating 
against.

INSERT, UPDATE, DELETE statement, forward only cursor: Output Iterator

-         This is fairly straightforward.  The only trick here is that the 
execute of the statement must happen on assignment.  Otherwise, loops that 
use post-increment logic to write to the iterator will not behave correctly.

STORED PROCEDURE, forward only cursor: Input + Output Iterator

-         Mapping stored procedures to C++ iterators was less 
straightforward since stored procedures perform both input and output.

-         Stored procedures accept arguments as input, and in this way act 
as an output iterator.

-         Stored procedures can return results from a query and can return 
results via its arguments so in this way it acts as an input iterator.  What 
is more, stored procedures can return more than one result set which means 
the resulting iterator may need to act over more than one range.  Therefore, 
we added a MoreResults() method to the sql_iterator object so that the 
iterator  can pass through multiple result sets that might be returned by 
the query.

-         To split the input and output for stored procedures we have 
sql_iterators hold two buffers.  One buffer we called the data object 
(DataObj) which stores the result sets that come back from the query.  The 
second buffer we call the parameter object (ParamObj) which is designed to 
hold parameters that are passed to the query.

Re: [boost] Interest in an ODBC library?

Corwin Joy