Re: [boost] [Bloom] review starts on 13th of May

22 May 2025


      El 22/05/2025 a las 19:36, Vinnie Falco escribió:
...
On Thu, May 22, 2025 at 10:09 AM Joaquin M López Muñoz via Boost 
<boost@lists.boost.org> wrote:
In fact I'm somewhat surprised most reviewers haven't asked for
    variations
    of/alternatives to Bloom filters. This can be added to the roadmap if
    the librarygains traction, I guess.
Ideally, variations would have motivating use-cases so their 
performance could be studied in a real-world setting.
> The example in the README could be improved by declaring the
    filter thusly:
    >
    >      using filter =
    boost::bloom::filter<boost::core::string_view, 5>;
    >
    > I don't think it is a good idea to give users an example which
    performs
    > unnecessary copying or allocation.
This is not needed cause the filter (with the exception of
    emplace) does
    not createelements on its own --you comment on this later in the
    review.
It does create the element temporarily. Here, we are constructing a 
`std::string` from `char const*`:
https://github.com/joaquintides/bloom/blob/fabc8698c1053efdfe5336de2d01a4fb3...
This is only because the filter type uses `std::string`. [...]
Ah ok. Yes, a string must be constructed to extract its hash value.
boost::bloom::filter supports heterogeneous lookup, so you can avoid it 
with that,
or use std::string_view to beign with as you pointed out.
...
> I'm puzzled about the terminology. "insert" and "emplace" imply
    that new
    > nodes or array elements of type value_type are being stored.
    This is not
    > the case though, as these objects are hashed and the resulting
    digest is
    > used to algorithmically adjust the filter which is in essence an
    array of
    > bool. `emplace` documentation says that it constructs elements.
The interface mimics that of a container as a way to lower the
    entry barrier
    for new users learning it. But filters are _not_ containers, as
    you've found
    out. Claudio also mentioned this. I'll stress the point in some
    prominent place inthe documentation.
I have the opposite opinion. That the interface should distance itself 
from a typical container interface, otherwise new users may mistakenly 
believe that they behave similarly in terms of resource usage. At 
least when it comes to strings, proper selection of the template 
parameter type seems to be a foot-gun. Maybe there is room for a type 
alias "string_filter" which is tuned for character strings.
I personally don't see the need: if you want to process strings, you have to
be aware of the implications of using std::string vs. std::string_view, 
this is
not something germane to candidate Boost.Bloom per se.
...
BTW, emplace() does construct the element, calculates its hash,
    and then
    throwsit away.
Yeah, this is weird. Why even have the element type? `insert` can in 
theory accept any object which is hashable.
I've been tempted to introduced an insert_hash_value method, but this looks
mighty dangerous for the casual user. As an alternative, one can always 
declare
a filter<std::size_t, ...> and do all the hashing stuff externally.
...
And emplace doesn't seem as valuable for the bloom filter container as 
it is for regular containers, since there are no nodes. That is, 
rather than `emplace(args...)` a user could just as easily write 
`insert(T(args...))`
It's there just for consistency with container APIs. It also supports 
the incredibly
unlikely case of uses-allocator constuction, but this is admittedly not 
a real
motivation:

https://en.cppreference.com/w/cpp/memory/uses_allocator
...
> I think that the array() function(s) need to guarantee that bits
    beyond the
    > capacity but present in the storage will be zeroed out. This way
    they can
    > be serialized deterministically.
Not sure I'm getting you, but the extent of the span returned by
    array()
    is exactlyas wide as capacity() indicates.
What if I construct the filter with say, a prime number of bits like 
137? What happens to the last 7 bits (to bring it up to 144)?
You can _request_ 137 bits, but the resulting capacity will be slightly 
larger
than that to accomodate for the subfilter block size and round up to the 
byte.

Joaquin M Lopez Munoz