Re: [boost] Boost.SQL, SOCI and beyond

From: Steve Hutton <shutton@featurecomplete.com>
I think it would be good practice to keep both queries and their related TypeConversions in the same header file. But of course even this isn't enough to ensure queries don't break - the database itself can always change :-) But using select * is another precaution that gives some added protection from database changes.
Surely you mean "... NOT using select *..."? I've always considered it a best practice not to use select * in any production code. Is there another perspective on this? - James Jones Administrative Data Mgmt. Webmaster 375 Raritan Center Pkwy, Suite A Data Architect Edison, NJ 08837

On 10/6/06, james.jones@firstinvestors.com <james.jones@firstinvestors.com> wrote:
Surely you mean "... NOT using select *..."? I've always considered it a best practice not to use select * in any production code. Is there another perspective on this?
It is my understanding that "select *" is generally less efficient than specifying the fields to select. Not to mention that if you are relying on select * to return fields in a specific order, and that order changes, you're in for some hard to find bugs. Jeremy

Jeremy Day wrote:
On 10/6/06, james.jones@firstinvestors.com <james.jones@firstinvestors.com> wrote:
Surely you mean "... NOT using select *..."? I've always considered it a best practice not to use select * in any production code. Is there another perspective on this?
It is my understanding that "select *" is generally less efficient than specifying the fields to select. Not to mention that if you are relying on select * to return fields in a specific order, and that order changes, you're in for some hard to find bugs.
I think that the choice depends on what you are doing. In an object-relational mapping with a static, typesafe representation I would express explicitly all fields in my query, so that any mismatch between the current structure of the database and my code will cause my queries to fail immediately. On the other hand if I'm using a dynamic representation that only relies on a limited subset of the fields involved in the query I'd use '*' so as to limit the conditions that could break my code. In similar cases I wouldn't rely on field order but rather on matching by name. Cheers, Nicola Musatti

Nicola Musatti wrote:
On the other hand if I'm using a dynamic representation that only relies on a limited subset of the fields involved in the query I'd use '*' so as to limit the conditions that could break my code. In similar cases I wouldn't rely on field order but rather on matching by name.
Why not simply select the limited subset on which you rely?

Peter Dimov wrote:
Nicola Musatti wrote:
On the other hand if I'm using a dynamic representation that only relies on a limited subset of the fields involved in the query I'd use '*' so as to limit the conditions that could break my code. In similar cases I wouldn't rely on field order but rather on matching by name.
Why not simply select the limited subset on which you rely?
You're right. Actually the only case where it's meaningful to use 'select *' is when you don't know which columns you're looking for or you don't care. That is, you're either exploring the data or writing a context independent tool. Cheers, Nicola Musatti

Nicola Musatti wrote:
Peter Dimov wrote:
Nicola Musatti wrote:
On the other hand if I'm using a dynamic representation that only relies on a limited subset of the fields involved in the query I'd use '*' so as to limit the conditions that could break my code. In similar cases I wouldn't rely on field order but rather on matching by name.
Why not simply select the limited subset on which you rely?
You're right. Actually the only case where it's meaningful to use 'select *' is when you don't know which columns you're looking for or you don't care. That is, you're either exploring the data or writing a context independent tool.
Even so, it is better to extract the schema first, to guard against selecting large BLOBs.

On 2006-10-07, Nicola Musatti <Nicola.Musatti@gmail.com> wrote:
Peter Dimov wrote:
Nicola Musatti wrote:
On the other hand if I'm using a dynamic representation that only relies on a limited subset of the fields involved in the query I'd use '*' so as to limit the conditions that could break my code. In similar cases I wouldn't rely on field order but rather on matching by name.
Why not simply select the limited subset on which you rely?
You're right. Actually the only case where it's meaningful to use 'select *' is when you don't know which columns you're looking for or you don't care. That is, you're either exploring the data or writing a context independent tool.
Well, back to Jeff's point about how it would be nice if the specification of the column names in the query was somehow validated against the the column names in the o/r mapping at compile time, to find any disconnects. It's an interesting idea (might require full SQL parser to achieve?) What I was trying to point out is that the use of select * and name based mapping is one way to eliminate the posibility of such disconnects. Steve

Steve Hutton wrote:
On 2006-10-07, Nicola Musatti <Nicola.Musatti@gmail.com> wrote:
Peter Dimov wrote:
Nicola Musatti wrote:
On the other hand if I'm using a dynamic representation that only relies on a limited subset of the fields involved in the query I'd use '*' so as to limit the conditions that could break my code. In similar cases I wouldn't rely on field order but rather on matching by name. Why not simply select the limited subset on which you rely? You're right. Actually the only case where it's meaningful to use 'select *' is when you don't know which columns you're looking for or you don't care. That is, you're either exploring the data or writing a context independent tool.
Well, back to Jeff's point about how it would be nice if the specification of the column names in the query was somehow validated against the the column names in the o/r mapping at compile time, to find any disconnects.
Well, actually I wasn't looking for direct validation against the database, but locality of mapping data. A tool that does this sort of validation would make a good library example though. Anyway, if I'm maintaining a type that is database enabled there are several types of possible changes: 1) add/remove a field/data member 2) change a field/data member type 3) change a field/data member name Now there's an infinite number of solutions to the mapping problem. Many approaches maintain the member/mapping information outside of the code in configuration. That's fine, but personally, I usually prefer to have a way to maintain this information directly in the code -- the main reason being that most of the changes above require modification of the code. So when I make any of these modifications, ideally I want to go to one file, make the change, recompile, update the schema if needed, and move on. If there's another file, that's a non-local change that someone has to go find and make. Currently I think SOCI is flexible enough to allow the mapping either way. But I think it would be nice to explore how the code centric approach would really work -- hence my prior comments. Following up from the prior example, there's a couple of ways the mapping could be maintained in the code. For example, one immediate issue with the OR mapping is that the mapping has to be repeated. Here's the snippet: template<> struct TypeConversion<Person> { static Person from(Values const &v) { Person p; p.id = v.get<int>("ID"); p.firstName = v.get<std::string>("FIRST_NAME"); p.lastName = v.get<std::string>("LAST_NAME"); /.... static Values to(Person &p) { Values v; v.set("ID", p.id); v.set("FIRST_NAME", p.firstName); v.set("LAST_NAME", p.lastName); //... So changes of type #1 and #3 require up to 3 changes: 2 in the above mapping code, and 1 to the query. If "select *" is used the query then the third change can be avoided. But I think the interesting question we should be asking is, what if we create a registration system that can simplify and clarify the mapping: template<> struct TypeRegistration<Person> { std::string base_query() { return "select FIRST_NAME, LAST_NAME, ID from PersonTable"; } static void register_type(std::multimap<string, ...>& registry) { //pseudo code here... registry.insert("Person", "PersonTable.FIRST_NAME"); registry.insert("Person", "PersonTable.LAST_NAME"); //... } } No obviously this loses the ability to write a function to do field specific conversions. This might be overcome by making the registry information more sophisticated. It still suffers from requiring 2 updates for several of the scenarios, but they are local. A more sophisticated approach could actually build the base_query. Anyway, I think this is all a layer on the SOCI core...
It's an interesting idea (might require full SQL parser to achieve?)
I think this is overkill.
What I was trying to point out is that the use of select * and name based mapping is one way to eliminate the posibility of such disconnects.
Like many design decisions, it has some advantages and some disadvantages. Jeff

Jeff Garland <jeff <at> crystalclearsoftware.com> writes: [...]
Now there's an infinite number of solutions to the mapping problem. Many approaches maintain the member/mapping information outside of the code in configuration. That's fine, but personally, I usually prefer to have a way to maintain this information directly in the code -- the main reason being that most of the changes above require modification of the code. So when I make any of these modifications, ideally I want to go to one file, make the change, recompile, update the schema if needed, and move on. If there's another file, that's a non-local change that someone has to go find and make.
There are two places where that information needs to be represented: in the database schema and, possibly implicitly, in the part of the application logic that uses that information. The only way I see not to have to maintain a third set of information, i.e. the explicit connection between db schema and C++ data types is to generate it automatically. [Example of the current SOCI O/R mapping and an hypothetical alternative mapping registry snipped]
No obviously this loses the ability to write a function to do field specific conversions. This might be overcome by making the registry information more sophisticated. It still suffers from requiring 2 updates for several of the scenarios, but they are local. A more sophisticated approach could actually build the base_query.
The problem I see is that while you may achieve locality of information, the number of information items you may need to store for each field/attribute mapping still makes maintenance by hand inconvenient as soon as your db schema grows beyond a handful of tables.
It's an interesting idea (might require full SQL parser to achieve?)
I think this is overkill.
This depends on the complexity of the application, but the parsing infrastructure doesn't take much longer than a day or two with a dynamic language and a parser generator, than writing the generating function for every type of code item you may want to generate won't take much longer than writing, say, three or four such items by hand. From the fifth table onward it's all gains. Cheers, Nicola Musatti

Steve Hutton <shutton <at> featurecomplete.com> writes: [...]
Well, back to Jeff's point about how it would be nice if the specification of the column names in the query was somehow validated against the the column names in the o/r mapping at compile time, to find any disconnects.
It's an interesting idea (might require full SQL parser to achieve?)
The way we do it is as follows: The representations of both classes and tables are explicit in code, as is the description of how data members or accessors are associated to fields. The code that performs queries is thus generic. We separated the classes that are used for data exchange from the actual application logic, so that the data exchange classes may be automatically generated. The tools we use for database design & maintenance can output the db structure in SQL DDL, which we parse to generate the data exchange classes and those part of the data exchange infrastructure that depend on the db structure. This could obviously be done by connecting to the db and querying its structure, but SQL source files are more convenient to store and handle with a version control tool. As our schema includes in excess of 250 tables this results in tens of thousands of lines of code that are automatically maintained. This obviously doesn't handle the case of the schema being altered in between builds, but as we spell out in full the involved field names in all queries in general either this isn't a problem or causes an error in the application. Cheers, Nicola Musatti

On 2006-10-06, Jeremy Day <jeremy.day@gmail.com> wrote:
On 10/6/06, james.jones@firstinvestors.com <james.jones@firstinvestors.com> wrote:
Surely you mean "... NOT using select *..."? I've always considered it a best practice not to use select * in any production code. Is there another perspective on this?
It is my understanding that "select *" is generally less efficient than specifying the fields to select. Not to mention that if you are relying on select * to return fields in a specific order, and that order changes, you're in for some hard to find bugs.
Yes, if you are using select *, you should be using name based column access instead of positional access. The combination of the two, however, can help create code which is resistant to future changes. It depends what your goals are, of course. To bring things back on topic, I think it's a good thing for a db library to support name based binding (and SOCI does). The use of select * is one reason why it's valuable. Steve
participants (6)
-
james.jonesīŧ firstinvestors.com
-
Jeff Garland
-
Jeremy Day
-
Nicola Musatti
-
Peter Dimov
-
Steve Hutton