Re: [boost] RTL

23 Oct 2005

      Arkadiy Vertleyb wrote
...
"Calum Grant" <calum@visula.org> wrote
...
...
Calum, is it possible to see the latest code of your example?
Here it is.
I see...
A couple of questions:
1) You modify counter all the time -- does RML allow to do 
this with the fields on which the table is sorted?
Yes - by using table.update_row() or table.update_where(), the indexes
are automatically updated.  That's the whole point of RML - it
uses/updates indexes implicitly.  If you decide to index a column, and
no code needs to change.
...
2) When you say :
select(m_locations,
    m_locations.x>=loc.x-m_cutoff &&
    m_locations.x<=loc.x+m_cutoff &&
    m_locations.y>=loc.y-m_cutoff &&
    m_locations.y<=loc.y+m_cutoff &&
    m_locations.z>=loc.z-m_cutoff &&
    m_locations.z<=loc.z+m_cutoff)
Do you use any kind of special index for this?
Yes - the TMP query analyser sees that m_locations.z is an indexed
column, and uses that index.  If on the other hand x or y were indexed,
that index would be used in preference to z.

If I had implemented user-defined queries, I could write

	select(m_locations,
	     	m_locations.x>=loc.x-m_cutoff &&
     		m_locations.x<=loc.x+m_cutoff &&
     		m_locations.y>=loc.y-m_cutoff &&
     		m_locations.y<=loc.y+m_cutoff &&
     		m_locations.z>=loc.z-m_cutoff &&
     		m_locations.z<=loc.z+m_cutoff && 
		query(test_distance(loc)) )

where test_distance would be some user-defined functor to check the
distance more precisely.  Even in this scenario, the TMP query analyser
would use m_locations.z as an index.  It's not magic, it just uses the
first index is sees.
...
In any case this seems to prove that RML does a good job when 
searching on multiple indexes -- you do quite a bit of that.  
Also, inserts/updates are probably very efficient.
Insertion carries the same overheads as Boost.MultiIndex.  Update is
possibly more efficient, since nodes can be relocated in a single index
- without needing to destroy/create a new indexed item.  This is one of
the inefficiencies of std::set or std::map - if you change the key you
need to allocate a new node.
...
But, as I said before, the overall efficiency of this example 
is based on the fact that this is a very smart, custom, 
hand-coded solution (see how you maintain the counter).  No 
library would be able to beat its performance using a 
higher-level abstraction (If RTL had a count() operator in 
addition to generic groupby(), we could be much smarter in 
maintaining it during updates).
The RML solution is actually slower because of the incremental updates
to the "neighbours" column.  Had they been calculated in one go, then
the query would I suspect run faster.

I would really like to be able to write

  for_each(
    limit( 
      sort( (descending(col<2, 0>()), col<0, Location::name>() ),
        group_by( col<0, Location::id_column>(),
         select( 
          (locations, locations), 
          col<1, Location::x_column>() <= col<0, Location::x_column>() +
m_cutoff &&
          col<1, Location::x_column>() >= col<0, Location::x_column>() -
m_cutoff &&
          col<1, Location::y_column>() <= col<0, Location::y_column>() +
m_cutoff &&
          col<1, Location::y_column>() >= col<0, Location::y_column>() -
m_cutoff &&
          col<1, Location::z_column>() <= col<0, Location::z_column>() +
m_cutoff &&
          col<1, Location::z_column>() >= col<0, Location::z_column>() -
m_cutoff &&
          query(test_distance()) 
         ), 
        ),
       ), 50),
    display_result()
  );

In order to achieve this, I would need to implement sorting using
multiple columns, group_by(), and query().  And this query would I
suspect run as fast as the original.  (Though it would not support
dynamic update).

So the fact that I needed to split my query up is nothing to do with the
fundamental design of RML.  It was because some of the functions were
not implemented.  I will work towards getting these extra functions
implemented, but I have a long todo-list and it won't happen
immediately.
...
But again, I don't believe this all proves RML is faster than 
RTL, since the coding approaches were so different.  I don't 
even know if it's possible to compare the libraries at this 
point, since they have their strong points in different 
areas.  RTL's strong point in the ability to build and run 
SQL-like views.  This I tried to show in my example.  RML 
seems to be very efficient at maintaining and using multiple 
indexes, but, despite its more SQL-like interface, can 
mimique only very simple SQL queries.
You're only saying that because I haven't got around to implementing
group_by and multi-column sorts.  These are less commonly used features
that were not a priority to implement.  But they would not be difficult.
When RML can implement the entire query in one statement, we can re-run
the tests if you'd like.

Indexed views can give performance improvements in real SQL databases -
but they can also reduce performance.  I believe the same phenomenon
might be occurring in your in-memory relational models.

I believe that you would gain a huge performance increase by using
Boost.MultiIndex as your underlying data structure (as opposed to
std::vector), and indexing the columns you require.  That is I believe
the fundamental architectural difference between RML and RTL.  By
storing the columns pre-indexed, you don't have any extra overhead.

Regards, Calum