== Executive Summary ==
I'm interested in packaging [https://github.com/addthis/stream-lib stream-lib] for Fedora to enable some functionality that is currently disabled in our Apache Spark package (which is under review at the moment). stream-lib incorporates code derived from the Bloom filter implementation in Apache Cassandra, but it has modified this code to make it accessible and consumable by other applications. I believe that this constitutes reverse bundling.
== Standard Questions ==
=== Has the library behaviour been modified? If so, how has it been modified?===
Yes; the interfaces have changed to make the library more testable and suitable for general use. In particular, (1) the stream-lib implementation hides leaky abstractions in the Cassandra implementation and (2) it is possible to add String objects and byte arrays (not merely native byte buffers, as in the Cassandra implementation). The stream-lib Bloom filter is serializable (also unlike the Cassandra implementation), which is an essential feature for other distributed computing projects that use Bloom filters to aggregate over distributed data sets (like Apache Spark).
I've summarized the interface differences here: https://gist.github.com/willb/9391908
=== Why haven't the changes been pushed to the upstream library?===
Cassandra's implementation is intended for use within Cassandra and not as a standalone library. stream-lib reverse-bundles this code to expose it to other applications. I suspect that Cassandra's Bloom filter prioritizes speed in particular use cases over maintainability as an externally-consumable library.
=== Have the changes been proposed to the Fedora package maintainer for the library?===
Not applicable; Cassandra is not packaged for Fedora.
===Could we make the forked version the canonical version within Fedora?===
It might be possible to port Cassandra to use stream-lib, but I suspect this would have negative performance implications.
===Are the changes useful to consumers other than the bundling application?===
Yes, the point of stream-lib is to make approximate aggregation functionality (including but not limited to Bloom filters) available to other applications.
===Is upstream keeping the base library updated or are they continuously one or more versions behind the latest upstream release?===
They have tracked some, but not all, of the changes to Bloom filter internals from Cassandra. See:
https://github.com/addthis/stream-lib/commits/master/src/main/java/com/clearspring/analytics/stream/membership
vs.
https://github.com/apache/cassandra/commits/trunk/src/java/org/apache/cassandra/utils/BloomFilter.java
The main change is that they improved the false positive rate using an algorithmic improvement ported over from Cassandra. Other changes to the Cassandra code since the fork are more implementation-specific and may not be applicable to the stream-lib implementation.
=== What is the attitude of upstream towards bundling?===
They view this code as a derivative work from the Cassandra implementation. It has diverged over time but has tracked some of the improvements.
=== Overview of the security ramifications of bundling===
Minimal; this code is a data structure with deterministic performance. It doesn't do I/O or execute arbitrary code, and it isn't susceptible to DOS attacks due to malicious inputs. Since Bloom filters are probabilistic data structures that aren't typically used for signatures or other sensitive cryptographic applications[1], algorithmic weaknesses or implementation errors in the underlying hash functions merely impact performance (i.e. error tolerance) rather than expose security vulnerabilities.
(There are some research applications of Bloom filters motivated by privacy, e.g., approximate anonymity-maintaining searches of electronic medical records, but again, errors in the filter application would merely make these searches more imprecise rather than expose sensitive data.)
=== Does the maintainer of the Fedora package of the library being bundled have any comments about this?===
=== Is there a plan for unbundling the library at a later time?===
If the Cassandra project were to release their Bloom filter as a reusable, independent component with the usability and interface improvements from the stream-lib implementation, it would be possible to unbundle and base stream-lib upon that with some Fedora-specific patching.
=== Please include any relevant documentation -- mailing list links, bug reports for upstream or the bundled library, etc.===
These are inline in answers to the preceding questions.
We started voting on this at today's meeting but didn't have quorum to approve:
Proposal: stream-lib's use of Cassandra code is considered forking and therefore allowed. (+1:4, 0:0, -1:0)
We talked about this at today's meeting (http://meetbot.fedoraproject.org/fedora-meeting-2/2015-02-12/fpc.2015-02-12-17.05.txt):
Log in to comment on this ticket.