
Here is the equivalent in RML. Forget 3000 rows, what about adding 1000000 rows, then adding another 1000000 to it? What time does the equivalent RTL take?
Calum, I think you are missing my point. The above example was provided to illustrate my two statements:
1) RTL can index any (arbitrary complicated) expression, not just table;
In SQL/RML you can index any arbitrarily complicated expression. If you need to index something complex, you create a column for it. For example if you had a table of people, with (name, height, weight). If you needed to work with bmi = weight/(height^2), you create another column for it. people.insert(Person(name, height, weight, weight/(height*height)); The RTL approach would not need to add the fourth column, and that is an advantage that RTL has over RML. Similarly for self-join. If you wanted to store an all-pairs distance between two cities, you create a table (city1, city2, distance). But in practice, in SQL I have never seen an index on anything other than a column. Occasionally I have seen a multi-column index. So this is why supporting computed indexes has not been a priority for RML.
2) RTL can _incrementally_ update any such index when the underlying table has been changed.
Same in RML. You add data, indexes change automatically. You change a field in a row, indexes update automatically. Probably more efficiently.
Did I use the most efficient expression for this particular task? Of course not. I used cross-product where I would use a range (or merge)
join otherwise. I did it _intentionally_, because I wanted to provide
an analogy of a complicated condition, such condition for which optimization is simply not possible.
I am pretty sure your library figured out how to do this query efficiently, without considering every possible combination (it didn't
consider 10^12 combinations, did it?). However, what it would do if the condition was more complicated? Like calculating the distance, matching people by certain characteristics, etc.? RTL can build an index on even slow expression, after which subsequent queries become a
few orders of magnitude faster. And the indexes are not broken when the tables are changed.
That is a nice idea. One could generalize - have a generic observer on any container. It could turn your entire application data-driven. Nice. I have seen observer-driven architectures work extremely well in practice.
Having said all this, RTL doesn't enforce this kind of usage. If the query needs to be executed only once, or if one doesn't want to consum
memory for indexes, any appropriate relational expression can be built, with any number of indexes in different nodes, to achieve needed tradeof between speed, initial response, memory consumption, etc. My firm belief is that this kind of flexibility is simply impossible with SQL-like approach.
I wouldn't knock the SQL approach - it is an extremely effective paradigm for data query and represention. There are some things that would not be possible in SQL, but one can always work around them. For example maintaining an all-pairs distance between two cities. One can simply declare a table (city1, city2, distance), and index on the columns you wanted. Whereas in RTL this data would be stored implicitly in an index, RML would require an explicit table. The reason RML uses the SQL paradigm is because I know that it works. The RTL view approach would make this much easier to maintain. The question is, what are the overheads of this convenience?
I don't think self-joins come up that often in real life however.
I think they are. I just gave examples where they could be (and are) used. This, however, is not about a self-join. This is about any query
which is slightly more complicated than just a simple index-based selection.
Self-joins don't come up in SQL. The assumption you make is that the query analyser is some kind of black box that will fail in mysterious ways. Actually the rules for query optimization are very well spelt out in the tutorial. It is precisely because the rules for indexing are so straightforward that there will be no unexpected performance penalties in RML. It will pick indexes in any complex multi-table conjunction for any comparators.
Once again, I don't believe this is a right time for any performance contests. They will not prove anything. I think RML will be faster in some cases, and RTL -- in the others. And a couple of days spent on
optimization can dramatically change the picture.
I think you should work on that, and then we can see what's what. The only reason I make a fuss is because you say your approach is more efficient.
What I would like to be discussing instead is the _design principles_ behind the libraries.
Okay, that is a good way forward. RTL has (from what I understand) an observer-driven architecture. Correct me if I am wrong. This means that you don't index the underlying data, you index views of the data. RML on the other hand is a multi-index architecture. The query analyser is basically a way of avoiding writing complex index probes, iteration and filters. From an efficiency point of view, it is exactly the same as writing out C++ to manipulate a multi-index. So, RML has the potential to be more efficient because multi-indexes are more efficient than multiple single-indexes. Best Regards, Calum