Question about proper container/algorithm to use for fast dictionary lookup

Hi, This question is whether Boost already has, or perhaps needs to have a container type that is specifically designed for fast word lookup in a UNICODE dictionary. It has occurred to me that an associative container that is specifically designed for word lookup may be much faster than the standard containers such as map, hash_map, associative_vector as they are general purpose containers. So the more detailed questions are: 1) What type of algorithm/container should be used here? 2) Would its performance be much better than a general purpose associative container? Thanks, Elisha Berns e.berns@computer.org tel. (310) 556 - 8332 fax (310) 556 - 2839

Elisha Berns wrote:
1) What type of algorithm/container should be used here?
You need some kind of a Trie - http://en.wikipedia.org/wiki/Trie
2) Would its performance be much better than a general purpose associative container?
Performance-wise, it should be about the same as a hash_map. Memory-wise, a trie should be much better. AFAIK there is nothing resembling a trie in Boost, but there are other free implementations available.

2) Would its performance be much better than a general purpose associative container?
Performance-wise, it should be about the same as a hash_map. Memory-wise, a trie should be much better.
AFAIK there is nothing resembling a trie in Boost, but there are other free implementations available.
Spirit has a TST as a internal class. It is used in boost::spirit::symbols class. For information about TST see http://www.cs.princeton.edu/%7Ers/strings/ or a recent copy of Sedgewick's "Algorithms" book. Memory-wise a trie could be much worse than a hash_map, it depends on the data set. If enough of your keys are prefixes of one another then memory usage is not that bad. However if your keys are unique or the end of the key is the common part then an excessive amount of memory can be used. Since for a TST a typical node will be at least 16 bytes on a 32bit machine. That is 16 bytes per character, not per string. Nathan E. Moore

Thanks for the reply, Your point is well taken. Well if that's so then why does boost::spirit implement a TST? And just to get a bit more clarity here, at the site for the TST their data set looks much like a pretty random collection of words (a typical dictionary), so I guess the question is which container works best for implementing a) a very small dictionary of under 100 terms that are comprised of the Unicode invariant characters (about 60 characters) which is even a subset of ASCII and b) a larger dictionary of thousands of terms comprised of the entire Unicode BMP set of characters? Thanks, Elisha
2) Would its performance be much better than a general purpose associative container?
Performance-wise, it should be about the same as a hash_map. Memory-wise, a trie should be much better.
AFAIK there is nothing resembling a trie in Boost, but there are
free implementations available.
Spirit has a TST as a internal class. It is used in boost::spirit::symbols class.
For information about TST see http://www.cs.princeton.edu/%7Ers/strings/ or a recent copy of Sedgewick's "Algorithms" book.
Memory-wise a trie could be much worse than a hash_map, it depends on
data set. If enough of your keys are prefixes of one another then memory usage is not that bad. However if your keys are unique or the end of
other the the
key is the common part then an excessive amount of memory can be used. Since for a TST a typical node will be at least 16 bytes on a 32bit machine. That is 16 bytes per character, not per string.
Nathan E. Moore _______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
participants (3)
-
Elisha Berns
-
Nathan E. Moore
-
Peter Petrov