
El 22/05/2025 a las 19:36, Vinnie Falco escribió:
On Thu, May 22, 2025 at 10:09 AM Joaquin M López Muñoz via Boost <boost@lists.boost.org> wrote:
In fact I'm somewhat surprised most reviewers haven't asked for variations of/alternatives to Bloom filters. This can be added to the roadmap if the librarygains traction, I guess.
Ideally, variations would have motivating use-cases so their performance could be studied in a real-world setting.
> The example in the README could be improved by declaring the filter thusly: > > using filter = boost::bloom::filter<boost::core::string_view, 5>; > > I don't think it is a good idea to give users an example which performs > unnecessary copying or allocation.
This is not needed cause the filter (with the exception of emplace) does not createelements on its own --you comment on this later in the review.
It does create the element temporarily. Here, we are constructing a `std::string` from `char const*`:
https://github.com/joaquintides/bloom/blob/fabc8698c1053efdfe5336de2d01a4fb3...
This is only because the filter type uses `std::string`. [...]
Ah ok. Yes, a string must be constructed to extract its hash value. boost::bloom::filter supports heterogeneous lookup, so you can avoid it with that, or use std::string_view to beign with as you pointed out.
> I'm puzzled about the terminology. "insert" and "emplace" imply that new > nodes or array elements of type value_type are being stored. This is not > the case though, as these objects are hashed and the resulting digest is > used to algorithmically adjust the filter which is in essence an array of > bool. `emplace` documentation says that it constructs elements.
The interface mimics that of a container as a way to lower the entry barrier for new users learning it. But filters are _not_ containers, as you've found out. Claudio also mentioned this. I'll stress the point in some prominent place inthe documentation.
I have the opposite opinion. That the interface should distance itself from a typical container interface, otherwise new users may mistakenly believe that they behave similarly in terms of resource usage. At least when it comes to strings, proper selection of the template parameter type seems to be a foot-gun. Maybe there is room for a type alias "string_filter" which is tuned for character strings.
I personally don't see the need: if you want to process strings, you have to be aware of the implications of using std::string vs. std::string_view, this is not something germane to candidate Boost.Bloom per se.
BTW, emplace() does construct the element, calculates its hash, and then throwsit away.
Yeah, this is weird. Why even have the element type? `insert` can in theory accept any object which is hashable.
I've been tempted to introduced an insert_hash_value method, but this looks mighty dangerous for the casual user. As an alternative, one can always declare a filter<std::size_t, ...> and do all the hashing stuff externally.
And emplace doesn't seem as valuable for the bloom filter container as it is for regular containers, since there are no nodes. That is, rather than `emplace(args...)` a user could just as easily write `insert(T(args...))`
It's there just for consistency with container APIs. It also supports the incredibly unlikely case of uses-allocator constuction, but this is admittedly not a real motivation: https://en.cppreference.com/w/cpp/memory/uses_allocator
> I think that the array() function(s) need to guarantee that bits beyond the > capacity but present in the storage will be zeroed out. This way they can > be serialized deterministically.
Not sure I'm getting you, but the extent of the span returned by array() is exactlyas wide as capacity() indicates.
What if I construct the filter with say, a prime number of bits like 137? What happens to the last 7 bits (to bring it up to 144)?
You can _request_ 137 bits, but the resulting capacity will be slightly larger than that to accomodate for the subfilter block size and round up to the byte. Joaquin M Lopez Munoz