== Executive Summary ==
I'm interested in packaging [https://github.com/addthis/stream-lib stream-lib] for Fedora to enable some functionality that is currently disabled in our Apache Spark package (which is under review at the moment). stream-lib incorporates code derived from the Bloom filter implementation in Apache Cassandra, but it has modified this code to make it accessible and consumable by other applications. I believe that this constitutes reverse bundling.
== Standard Questions ==
=== Has the library behaviour been modified? If so, how has it been modified?===
Yes; the interfaces have changed to make the library more testable and
suitable for general use. In particular, (1) the stream-lib implementation
hides leaky abstractions in the Cassandra implementation and (2) it is
possible to add String objects and byte arrays (not merely native byte
buffers, as in the Cassandra implementation). The stream-lib Bloom filter
is serializable (also unlike the Cassandra implementation), which is an
essential feature for other distributed computing projects that use Bloom
filters to aggregate over distributed data sets (like Apache Spark).
I've summarized the interface differences here:
=== Why haven't the changes been pushed to the upstream library?===
Cassandra's implementation is intended for use within Cassandra and not as
a standalone library. stream-lib reverse-bundles this code to expose it to
other applications. I suspect that Cassandra's Bloom filter prioritizes
speed in particular use cases over maintainability as an
=== Have the changes been proposed to the Fedora package maintainer for the library?===
Not applicable; Cassandra is not packaged for Fedora.
===Could we make the forked version the canonical version within Fedora?===
It might be possible to port Cassandra to use stream-lib, but I suspect
this would have negative performance implications.
===Are the changes useful to consumers other than the bundling application?===
Yes, the point of stream-lib is to make approximate aggregation
functionality (including but not limited to Bloom filters) available to
===Is upstream keeping the base library updated or are they continuously one or more versions behind the latest upstream release?===
They have tracked some, but not all, of the changes to Bloom filter
internals from Cassandra. See:
The main change is that they improved the false positive rate using an
algorithmic improvement ported over from Cassandra. Other changes to the
Cassandra code since the fork are more implementation-specific and may not
be applicable to the stream-lib implementation.
=== What is the attitude of upstream towards bundling?===
They view this code as a derivative work from the Cassandra implementation.
It has diverged over time but has tracked some of the improvements.
=== Overview of the security ramifications of bundling===
Minimal; this code is a data structure with deterministic performance. It
doesn't do I/O or execute arbitrary code, and it isn't susceptible to DOS
attacks due to malicious inputs. Since Bloom filters are probabilistic data
structures that aren't typically used for signatures or other sensitive
cryptographic applications, algorithmic weaknesses or implementation
errors in the underlying hash functions merely impact performance (i.e.
error tolerance) rather than expose security vulnerabilities.
(There are some research applications of Bloom filters motivated by
privacy, e.g., approximate anonymity-maintaining searches of electronic
medical records, but again, errors in the filter application would merely
make these searches more imprecise rather than expose sensitive data.)
=== Does the maintainer of the Fedora package of the library being bundled have any comments about this?===
=== Is there a plan for unbundling the library at a later time?===
If the Cassandra project were to release their Bloom filter as a reusable,
independent component with the usability and interface improvements from
the stream-lib implementation, it would be possible to unbundle and base
stream-lib upon that with some Fedora-specific patching.
=== Please include any relevant documentation -- mailing list links, bug reports for upstream or the bundled library, etc.===
These are inline in answers to the preceding questions.
We started voting on this at today's meeting but didn't have quorum to approve:
Proposal: stream-lib's use of Cassandra code is considered forking and therefore allowed. (+1:4, 0:0, -1:0)
We talked about this at today's meeting (http://meetbot.fedoraproject.org/fedora-meeting-2/2015-02-12/fpc.2015-02-12-17.05.txt):
to comment on this ticket.