#23 RFI: Merge identical arch FTIs
Closed 3 years ago by lbrabec. Opened 3 years ago by churchyard.


We were already discussing this, and it honestly is a can of worms, once we get to more complex situations. Merging "the absolutely same ones" would be fairly straightforward, but I'm now trying to put together something that finds "largest common subsets", so we can group the reasons "more better" :) (and yes, my math is weak :D)

Nice. Grouping reasons more better sounds awesome :D

Side note: with traditional libraries, the reasons will usually not be the same on 32bit and 64bit, but that at least means there will be 2 groups instead of 6 arches.

(Do you merge on backend or frontend?)

IDK, I'm fairly positive I have "the algorithm" figured out, but would be nice if somebody checked my "math" :D

import collections
import pprint

d1 = dict(
  a = [1, 2, 3, 4, 5],
  b = [4, 5, 6, 7, 8],
  c = [1, 2, 3, 4, 5, 6, 7, 8]
)

pprint.pprint(d1)

d2 = collections.defaultdict(list)
for k,v in d1.items():
    for i in v:
        d2[i].append(k)


pprint.pprint(d2)
"""
defaultdict(<class 'list'>,
            {1: ['a', 'c'],
             2: ['a', 'c'],
             3: ['a', 'c'],
             4: ['a', 'b', 'c'],
             5: ['a', 'b', 'c'],
             6: ['b', 'c'],
             7: ['b', 'c'],
             8: ['b', 'c']})
"""

d3 = collections.defaultdict(list)
for k, v in d2.items():
    d3[', '.join(v)].append(k)

pprint.pprint(d3)
"""
defaultdict(<class 'list'>,
            {'a, b, c': [4, 5],
             'a, c': [1, 2, 3],
             'b, c': [6, 7, 8]})

"""

And I'm also pretty sure this can be condensed down to some map-reduce (so it would be javascript and multicore friendly), but I'm way to tired to do that now.

I see what you did there. This can however give you a "less better" result:

d1 = dict(
  a = [1, 8, 9, 10, 11],
  b = [2, 7, 8],
  c = [3, 7, 9],
  d = [4, 10, 12],
  e = [5, 11],
  f = [6, 12, 8],
)

This has "only" 6 keys, but the result has 12 (and you can get more if you try harder). Is the idea to use the result only if it has less keys than the input, because for real cases, it's likely to create better results?

This has "only" 6 keys, but the result has 12 (and you can get more if you try harder). Is the idea to use the result only if it has less keys than the input, because for real cases, it's likely to create better results?

It honestly just did not occur to me.

I'm not necessarily sure "number of keys" is the best metric - take this as an example where the "number of keys on output" is bigger, yet the outcome is (IMO) more readable:

d1 = dict(
  a = [1, 2, 3, 4, 5, 6],
  b = [1, 2 ,3, 7, 8],
  c = [1, 2, 3, 4, 8],
  d = [1, 2, 4, 7, 8]
  )
...
pprint.pprint(d3)
"""
defaultdict(<class 'list'>,
            {'a, b, c, d': [1, 2],
             'a, b, c': [3],
             'a, c, d': [4],
             'b, c, d': [8],
             'b, d': [7],
             'a': [5, 6]})
"""

But I struggle to represent "readability" in a programatical way. From the top of my head, I can think of these:
- number of lines produced on the output - aim for at least 25-30% reduction?
- average number of keys-per-section - aim above 2?

Given the code I already shared, this is what I mean:

lines =  (sum([len(v) for v in d3.values()]) + len(d3)) / (sum([len(v) for v in d1.values()]) + len(d1))
condensation = sum([len(k.split(', ')) for k in d3]) / float(len(d3))

print("Lines proportion:", lines) 
print("Key condensation:", condensation) 

Obviously tailored to "suit my example and kill yours" (0.52, 2.33) vs (0.96, 1.58).

I think the best way forward would be to test this out on top of the real FTI data, and see what happens, so we can tweak the metrics (or find out that we would not need them anyway).

Also, in the end, it really depends on what you, as a packager, want to be seeing, since I'm not the target audience here, I kind of have a rough time deciding :)

Maybe this whole exercise just does not make sense in the real world, and we'd be better-off with just de-duplicating when all the reasons are the same for the two (or more) arches. WDYT?

Maybe this whole exercise just does not make sense in the real world, and we'd be better-off with just de-duplicating when all the reasons are the same for the two (or more) arches. WDYT?

Right. Let's start with merging the arches if the sets are identical. And we might as well end there.

Metadata Update from @lbrabec:
- Issue assigned to lbrabec

3 years ago

Right. Let's start with merging the arches if the sets are identical. And we might as well end there.

Deployed.

Metadata Update from @lbrabec:
- Issue status updated to: Closed (was: Open)

3 years ago

Login to comment on this ticket.

Metadata
Attachments 2
png
Attached 3 years ago View Comment
png
Attached 3 years ago View Comment