
Arkadiy Vertleyb wrote
"Calum Grant" <calum@visula.org> wrote
Calum, is it possible to see the latest code of your example?
Here it is.
I see...
A couple of questions:
1) You modify counter all the time -- does RML allow to do this with the fields on which the table is sorted?
Yes - by using table.update_row() or table.update_where(), the indexes are automatically updated. That's the whole point of RML - it uses/updates indexes implicitly. If you decide to index a column, and no code needs to change.
2) When you say :
select(m_locations, m_locations.x>=loc.x-m_cutoff && m_locations.x<=loc.x+m_cutoff && m_locations.y>=loc.y-m_cutoff && m_locations.y<=loc.y+m_cutoff && m_locations.z>=loc.z-m_cutoff && m_locations.z<=loc.z+m_cutoff)
Do you use any kind of special index for this?
Yes - the TMP query analyser sees that m_locations.z is an indexed column, and uses that index. If on the other hand x or y were indexed, that index would be used in preference to z. If I had implemented user-defined queries, I could write select(m_locations, m_locations.x>=loc.x-m_cutoff && m_locations.x<=loc.x+m_cutoff && m_locations.y>=loc.y-m_cutoff && m_locations.y<=loc.y+m_cutoff && m_locations.z>=loc.z-m_cutoff && m_locations.z<=loc.z+m_cutoff && query(test_distance(loc)) ) where test_distance would be some user-defined functor to check the distance more precisely. Even in this scenario, the TMP query analyser would use m_locations.z as an index. It's not magic, it just uses the first index is sees.
In any case this seems to prove that RML does a good job when searching on multiple indexes -- you do quite a bit of that. Also, inserts/updates are probably very efficient.
Insertion carries the same overheads as Boost.MultiIndex. Update is possibly more efficient, since nodes can be relocated in a single index - without needing to destroy/create a new indexed item. This is one of the inefficiencies of std::set or std::map - if you change the key you need to allocate a new node.
But, as I said before, the overall efficiency of this example is based on the fact that this is a very smart, custom, hand-coded solution (see how you maintain the counter). No library would be able to beat its performance using a higher-level abstraction (If RTL had a count() operator in addition to generic groupby(), we could be much smarter in maintaining it during updates).
The RML solution is actually slower because of the incremental updates to the "neighbours" column. Had they been calculated in one go, then the query would I suspect run faster. I would really like to be able to write for_each( limit( sort( (descending(col<2, 0>()), col<0, Location::name>() ), group_by( col<0, Location::id_column>(), select( (locations, locations), col<1, Location::x_column>() <= col<0, Location::x_column>() + m_cutoff && col<1, Location::x_column>() >= col<0, Location::x_column>() - m_cutoff && col<1, Location::y_column>() <= col<0, Location::y_column>() + m_cutoff && col<1, Location::y_column>() >= col<0, Location::y_column>() - m_cutoff && col<1, Location::z_column>() <= col<0, Location::z_column>() + m_cutoff && col<1, Location::z_column>() >= col<0, Location::z_column>() - m_cutoff && query(test_distance()) ), ), ), 50), display_result() ); In order to achieve this, I would need to implement sorting using multiple columns, group_by(), and query(). And this query would I suspect run as fast as the original. (Though it would not support dynamic update). So the fact that I needed to split my query up is nothing to do with the fundamental design of RML. It was because some of the functions were not implemented. I will work towards getting these extra functions implemented, but I have a long todo-list and it won't happen immediately.
But again, I don't believe this all proves RML is faster than RTL, since the coding approaches were so different. I don't even know if it's possible to compare the libraries at this point, since they have their strong points in different areas. RTL's strong point in the ability to build and run SQL-like views. This I tried to show in my example. RML seems to be very efficient at maintaining and using multiple indexes, but, despite its more SQL-like interface, can mimique only very simple SQL queries.
You're only saying that because I haven't got around to implementing group_by and multi-column sorts. These are less commonly used features that were not a priority to implement. But they would not be difficult. When RML can implement the entire query in one statement, we can re-run the tests if you'd like. Indexed views can give performance improvements in real SQL databases - but they can also reduce performance. I believe the same phenomenon might be occurring in your in-memory relational models. I believe that you would gain a huge performance increase by using Boost.MultiIndex as your underlying data structure (as opposed to std::vector), and indexing the columns you require. That is I believe the fundamental architectural difference between RML and RTL. By storing the columns pre-indexed, you don't have any extra overhead. Regards, Calum