Question about proper container/algorithm to use for fast dictionary lookup

older
BGL - randomize_property & bundled...

Elisha Berns

16 Sep 2005 16 Sep '05

11:56 p.m.

Hi, This question is whether Boost already has, or perhaps needs to have a container type that is specifically designed for fast word lookup in a UNICODE dictionary. It has occurred to me that an associative container that is specifically designed for word lookup may be much faster than the standard containers such as map, hash_map, associative_vector as they are general purpose containers. So the more detailed questions are: 1) What type of algorithm/container should be used here? 2) Would its performance be much better than a general purpose associative container? Thanks, Elisha Berns e.berns@computer.org tel. (310) 556 - 8332 fax (310) 556 - 2839

Show replies by date

Peter Petrov

17 Sep 17 Sep

12:05 a.m.

New subject: Question about proper container/algorithm to use for fast dictionary lookup

Elisha Berns wrote:

...

1) What type of algorithm/container should be used here?

You need some kind of a Trie - http://en.wikipedia.org/wiki/Trie

...

2) Would its performance be much better than a general purpose associative container?

Performance-wise, it should be about the same as a hash_map. Memory-wise, a trie should be much better. AFAIK there is nothing resembling a trie in Boost, but there are other free implementations available.

Nathan E. Moore

6:12 a.m.

New subject: Question about proper container/algorithm to use for fast dictionary lookup

...

...
2) Would its performance be much better than a general purpose associative container?

Performance-wise, it should be about the same as a hash_map. Memory-wise, a trie should be much better.

AFAIK there is nothing resembling a trie in Boost, but there are other free implementations available.

Spirit has a TST as a internal class. It is used in boost::spirit::symbols class. For information about TST see <http://www.cs.princeton.edu/%7Ers/strings/> or a recent copy of Sedgewick's "Algorithms" book. Memory-wise a trie could be much worse than a hash_map, it depends on the data set. If enough of your keys are prefixes of one another then memory usage is not that bad. However if your keys are unique or the end of the key is the common part then an excessive amount of memory can be used. Since for a TST a typical node will be at least 16 bytes on a 32bit machine. That is 16 bytes per character, not per string. Nathan E. Moore

Elisha Berns

18 Sep 18 Sep

4:57 a.m.

New subject: Question about proper container/algorithm to use for fast dictionary lookup

Thanks for the reply, Your point is well taken. Well if that's so then why does boost::spirit implement a TST? And just to get a bit more clarity here, at the site for the TST their data set looks much like a pretty random collection of words (a typical dictionary), so I guess the question is which container works best for implementing a) a very small dictionary of under 100 terms that are comprised of the Unicode invariant characters (about 60 characters) which is even a subset of ASCII and b) a larger dictionary of thousands of terms comprised of the entire Unicode BMP set of characters? Thanks, Elisha

...

...
...
2) Would its performance be much better than a general purpose associative container?

Performance-wise, it should be about the same as a hash_map. Memory-wise, a trie should be much better.

AFAIK there is nothing resembling a trie in Boost, but there are

...

...
free implementations available.

Spirit has a TST as a internal class. It is used in boost::spirit::symbols class.

For information about TST see <http://www.cs.princeton.edu/%7Ers/strings/> or a recent copy of Sedgewick's "Algorithms" book.

Memory-wise a trie could be much worse than a hash_map, it depends on

...

data set. If enough of your keys are prefixes of one another then memory usage is not that bad. However if your keys are unique or the end of

other the the

...

key is the common part then an excessive amount of memory can be used. Since for a TST a typical node will be at least 16 bytes on a 32bit machine. That is 16 bytes per character, not per string.

Nathan E. Moore _______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

7226

Age (days ago)

7228

Last active (days ago)

List overview

Download

3 comments

3 participants

participants (3)

Elisha Berns
Nathan E. Moore
Peter Petrov